Small FPGA things: spram

Until now the central fetch-decode-execute cycle of the cpu contained a lot of wait cycles. It looked like this (where ip1 is r[15]+1):


case(state)
    FETCH1  :   begin
                    r[0] <= 0;
                    r[1] <= 1;
                    r[13][31] <= 1; // force the always on bit
                    mem_raddr <= ip;
                    state <= halted ? FETCH1 : FETCH2;
                end
    FETCH2  :   state <= FETCH3;
    FETCH3  :   begin
                    instruction[15:8] <= mem_data_out;
                    r[15] <= ip1;
                    state <= FETCH4;
                end
    FETCH4  :   begin
                    mem_raddr <= ip;
                    state <= FETCH5;
                end
    FETCH5  :   state <= FETCH6;
    FETCH6  :   begin
                    instruction[7:0] <= mem_data_out;
                    r[15] <= ip1;
                    ...

So between every assignment to the mem_raddr register (in state FETCH1 and FETCH4) and the retrieval of the byte from the mem_data_out register (in state FETCH3 and FETCH6) we had a wait cycle.

Now it is true that for the ice40 BRAM there needs to be two clock cycles between setting the address and reading the byte, but we can already set the new address in the next cycle, allowing us to read a byte every clock cycle once we set the initial address.

This alternative approach looks like this:


case(state)
    FETCH1  :   begin
                    r[0] <= 0;
                    r[1] <= 1;
                    r[13][31] <= 1; // force the always on bit
                    mem_raddr <= ip;
                    state <= halted ? FETCH1 : FETCH2;
                end
    FETCH2  :   begin
                    state <= FETCH3;
                    r[15] <= ip1;
                    mem_raddr <= ip1;
                    end
    FETCH3  :   begin
                    instruction[15:8] <= mem_data_out;
                    state <= FETCH4;
                end
    FETCH4  :   begin
                    instruction[7:0] <= mem_data_out;
                    r[15] <= ip1;
                    ...

So we set the address in states FETCH1 and FETCH2 and read the corresponding bytes in states FETCH3 and FETCH4 respectively, saving us 2 clock cycles for every instruction. Since the most used instructions took 8 cycles and now 6, this is a reduction of 25%. Not bad I think.

And although not very well documented (or documented at all actually) this setup works for SPRAMS as well.

The iCE40 up5k that is used on the iCEBreaker board provides another type of memory besides the ubiquitous block ram: single port ram (spram).
No less than 128 Kbytes are provided and although it is a little bit unclear to me at the moment how fast they are, theY seem to function quite well with two clock cycle delay, so I can integrate them with my current design without a any changes to the CPU.

Implementation

The 128 Kbytes are provided as four blocks, each 16k x 16bits. As far as I know Yosys does not yet offer automatic inference, which means we have to use the iCE40 primitives directly. This may sound complicated but it is not as hard as it sounds.
The blocks take a 14 bit address (i.e. can address 16K words) and will read or write 16 bits at the time. Because we are interested in 8 bit bytes rather than words we need to make sure we return or write either the upper half or the lower half of a word depending on the address. For reading this means simply selecting, for writing this means setting a writemask that will limit which bits of a 16 bit word are actually written on receiving a write enable signal. Such a write mask itself is not 16 bit wide but just 4: 1 bit for each 4 bit nibble. We make this selection based on bit 14.

Code

The code below (GitHub) shows the implementation details. We use all four SB_SPRAM256KA blocks available on the up5k and use the top two bits of the 17 bit address to select a block. Bit 14 is then used to calculate the write mask (called nibble mask here). The same nibble mask is also used to select either the high or low byte from any 16 bit word we read from any of the four blocks. Note that our module's input data (wdata) is a byte but we always write 16 bit words. To this end we simply double the incoming byte; whether we actually write to high or low byte is determined by the write mask we construct and pass to the .MASKWREN input of the blocks.


// byte addressable spram
// uses all 128MB

module spram (
 input clk,
 input wen,
 input [16:0] addr,
 input [7:0] wdata,
 output [7:0] rdata
);

wire cs_0 = addr[16:15] == 0;
wire cs_1 = addr[16:15] == 1;
wire cs_2 = addr[16:15] == 2;
wire cs_3 = addr[16:15] == 3;

wire nibble_mask_hi = addr[14];
wire nibble_mask_lo = !addr[14];

wire [15:0] wdata16 = {wdata, wdata};

wire [15:0] rdata_0,rdata_1,rdata_2,rdata_3;
wire [7:0] rdata_0b = nibble_mask_hi ? rdata_0[15:8] : rdata_0[7:0];
wire [7:0] rdata_1b = nibble_mask_hi ? rdata_1[15:8] : rdata_1[7:0];
wire [7:0] rdata_2b = nibble_mask_hi ? rdata_2[15:8] : rdata_2[7:0];
wire [7:0] rdata_3b = nibble_mask_hi ? rdata_3[15:8] : rdata_3[7:0];

assign rdata = cs_0 ? rdata_0b : cs_1 ? rdata_1b : cs_2 ? rdata_2b : rdata_3b;

SB_SPRAM256KA ram0
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_0),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_0)
  );

SB_SPRAM256KA ram1
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_1),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_1)
  );

SB_SPRAM256KA ram2
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_2),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_2)
  );

SB_SPRAM256KA ram3
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_3),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_3)
  );

endmodule

Notes

Because of the values passed to the standby sleep and poweroff inputs we effectively keep the everything running full blast and presumably consuming quite a lot of power (relatively speaking). Since i have no idea at the moment hiw long it would take to resume from standby, i leave it at that for now.

Small FPGA things

Monday, 24 February 2020

iCE40 BRAM & SPRAM access: The need for speed

Saturday, 4 January 2020

More memory: spram

Implementation

Code

Notes

CPU design