Monday, 24 February 2020

iCE40 BRAM & SPRAM access: The need for speed

Until now the central fetch-decode-execute cycle of the cpu contained a lot of wait cycles. It looked like this (where ip1 is r[15]+1):

case(state)
    FETCH1  :   begin
                    r[0] <= 0;
                    r[1] <= 1;
                    r[13][31] <= 1; // force the always on bit
                    mem_raddr <= ip;
                    state <= halted ? FETCH1 : FETCH2;
                end
    FETCH2  :   state <= FETCH3;
    FETCH3  :   begin
                    instruction[15:8] <= mem_data_out;
                    r[15] <= ip1;
                    state <= FETCH4;
                end
    FETCH4  :   begin
                    mem_raddr <= ip;
                    state <= FETCH5;
                end
    FETCH5  :   state <= FETCH6;
    FETCH6  :   begin
                    instruction[7:0] <= mem_data_out;
                    r[15] <= ip1;
                    ...
So between every assignment to the mem_raddr register (in state FETCH1 and FETCH4) and the retrieval of the byte from the mem_data_out register (in state FETCH3 and FETCH6) we had a wait cycle.

Now it is true that for the ice40 BRAM there needs to be two clock cycles between setting the address and reading the byte, but we can already set the new address in the next cycle, allowing us to read a byte every clock cycle once we set the initial address.

This alternative approach looks like this:


case(state)
    FETCH1  :   begin
                    r[0] <= 0;
                    r[1] <= 1;
                    r[13][31] <= 1; // force the always on bit
                    mem_raddr <= ip;
                    state <= halted ? FETCH1 : FETCH2;
                end
    FETCH2  :   begin
                    state <= FETCH3;
                    r[15] <= ip1;
                    mem_raddr <= ip1;
                    end
    FETCH3  :   begin
                    instruction[15:8] <= mem_data_out;
                    state <= FETCH4;
                end
    FETCH4  :   begin
                    instruction[7:0] <= mem_data_out;
                    r[15] <= ip1;
                    ...
So we set the address in states FETCH1 and FETCH2 and read the corresponding bytes in states FETCH3 and FETCH4 respectively, saving us 2 clock cycles for every instruction. Since the most used instructions took 8 cycles and now 6, this is a reduction of 25%. Not bad I think.

And although not very well documented (or documented at all actually) this setup works for SPRAMS as well.

CPU design

The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...