Small FPGA things: Memory

When I started out with the design for this CPU the fetch decode execute cycle was a massive affair, resulting in 8 clock cycles for a MOVE instruction and 15 cycles for a LOADL (load long word from memory) instruction.

By closely looking at the timing diagrams for memory access we could reduce the number of cycles in the fetch part significantly. Meanwhile I implemented some additional optimizations and currently the MOVE and LOADL instruction clock in (pun intended) at 4 and 9 cycles respectively, a speedup of about 2x compared to the initial implementation.

The diagram below illustrates the different activities that take place in the various states (click to enlarge):

The important bit to understand here is that we do not read anything from memory in the decode and exec1 states. For some instructions this is inevitable because only after reading the second byte of an instruction (in fetch4) and adding the two source registers (available in decode, because adding those to registers needs a clock cycle) can we load the mem_raddr register and start loading two cycles later.
However, for instructions like LOADIL (load immediate ling word) and SETBRA, the data and offset respectively are located just after the actual instruction, so we could keep on incrementing the mem_raddr in states fetch 3 and fetch 4 so that the first two bytes would be available in the decode and exec 1 states as indicated by the highlighted 'gaps' in the table.

Even for the POP instruction we know what the address should be because we can refer to register 14 (the stackpointer). The only thing we have to keep in ind that we need to decide whether to keep on incrementing the mem_raddr register or to load it with the address in the stack pointer. We can make this decision in state fetch 3 already because there we read the high byte of the instruction which contains the intructions opcode.

So next on my agenda is to see whether we can indeed implement this idea. it would potentially shave of another 2 cycles from the the LOADIL, SETBRA and POP instructions so it is certainly worth the effort.

The iCE40 up5k that is used on the iCEBreaker board provides another type of memory besides the ubiquitous block ram: single port ram (spram).
No less than 128 Kbytes are provided and although it is a little bit unclear to me at the moment how fast they are, theY seem to function quite well with two clock cycle delay, so I can integrate them with my current design without a any changes to the CPU.

Implementation

The 128 Kbytes are provided as four blocks, each 16k x 16bits. As far as I know Yosys does not yet offer automatic inference, which means we have to use the iCE40 primitives directly. This may sound complicated but it is not as hard as it sounds.
The blocks take a 14 bit address (i.e. can address 16K words) and will read or write 16 bits at the time. Because we are interested in 8 bit bytes rather than words we need to make sure we return or write either the upper half or the lower half of a word depending on the address. For reading this means simply selecting, for writing this means setting a writemask that will limit which bits of a 16 bit word are actually written on receiving a write enable signal. Such a write mask itself is not 16 bit wide but just 4: 1 bit for each 4 bit nibble. We make this selection based on bit 14.

Code

The code below (GitHub) shows the implementation details. We use all four SB_SPRAM256KA blocks available on the up5k and use the top two bits of the 17 bit address to select a block. Bit 14 is then used to calculate the write mask (called nibble mask here). The same nibble mask is also used to select either the high or low byte from any 16 bit word we read from any of the four blocks. Note that our module's input data (wdata) is a byte but we always write 16 bit words. To this end we simply double the incoming byte; whether we actually write to high or low byte is determined by the write mask we construct and pass to the .MASKWREN input of the blocks.


// byte addressable spram
// uses all 128MB

module spram (
 input clk,
 input wen,
 input [16:0] addr,
 input [7:0] wdata,
 output [7:0] rdata
);

wire cs_0 = addr[16:15] == 0;
wire cs_1 = addr[16:15] == 1;
wire cs_2 = addr[16:15] == 2;
wire cs_3 = addr[16:15] == 3;

wire nibble_mask_hi = addr[14];
wire nibble_mask_lo = !addr[14];

wire [15:0] wdata16 = {wdata, wdata};

wire [15:0] rdata_0,rdata_1,rdata_2,rdata_3;
wire [7:0] rdata_0b = nibble_mask_hi ? rdata_0[15:8] : rdata_0[7:0];
wire [7:0] rdata_1b = nibble_mask_hi ? rdata_1[15:8] : rdata_1[7:0];
wire [7:0] rdata_2b = nibble_mask_hi ? rdata_2[15:8] : rdata_2[7:0];
wire [7:0] rdata_3b = nibble_mask_hi ? rdata_3[15:8] : rdata_3[7:0];

assign rdata = cs_0 ? rdata_0b : cs_1 ? rdata_1b : cs_2 ? rdata_2b : rdata_3b;

SB_SPRAM256KA ram0
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_0),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_0)
  );

SB_SPRAM256KA ram1
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_1),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_1)
  );

SB_SPRAM256KA ram2
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_2),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_2)
  );

SB_SPRAM256KA ram3
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_3),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_3)
  );

endmodule

Notes

Because of the values passed to the standby sleep and poweroff inputs we effectively keep the everything running full blast and presumably consuming quite a lot of power (relatively speaking). Since i have no idea at the moment hiw long it would take to resume from standby, i leave it at that for now.

Small FPGA things

Thursday, 5 March 2020

Optimizing the fetch decode execute cycle

Saturday, 4 January 2020

More memory: spram

Implementation

Code

Notes

CPU design