When I started out with the design for this CPU the fetch decode execute cycle was a massive affair, resulting in 8 clock cycles for a MOVE instruction and 15 cycles for a LOADL (load long word from memory) instruction.
By closely looking at the timing diagrams for memory access we could reduce the number of cycles in the fetch part significantly. Meanwhile I implemented some additional optimizations and currently the MOVE and LOADL instruction clock in (pun intended) at 4 and 9 cycles respectively, a speedup of about 2x compared to the initial implementation.
The diagram below illustrates the different activities that take place in the various states (click to enlarge):
The important bit to understand here is that we do not read anything from memory in the decode and exec1 states. For some instructions this is inevitable because only after reading the second byte of an instruction (in fetch4) and adding the two source registers (available in decode, because adding those to registers needs a clock cycle) can we load the mem_raddr register and start loading two cycles later.
However, for instructions like LOADIL (load immediate ling word) and SETBRA, the data and offset respectively are located just after the actual instruction, so we could keep on incrementing the mem_raddr in states fetch 3 and fetch 4 so that the first two bytes would be available in the decode and exec 1 states as indicated by the highlighted 'gaps' in the table.
Even for the POP instruction we know what the address should be because we can refer to register 14 (the stackpointer). The only thing we have to keep in ind that we need to decide whether to keep on incrementing the mem_raddr register or to load it with the address in the stack pointer. We can make this decision in state fetch 3 already because there we read the high byte of the instruction which contains the intructions opcode.
So next on my agenda is to see whether we can indeed implement this idea. it would potentially shave of another 2 cycles from the the LOADIL, SETBRA and POP instructions so it is certainly worth the effort.
Showing posts with label decode. Show all posts
Showing posts with label decode. Show all posts
Thursday, 5 March 2020
Monday, 24 February 2020
iCE40 BRAM & SPRAM access: The need for speed
Until now the central fetch-decode-execute cycle of the cpu contained a lot of wait cycles. It looked like this (where
ip1 is r[15]+1):
case(state)
FETCH1 : begin
r[0] <= 0;
r[1] <= 1;
r[13][31] <= 1; // force the always on bit
mem_raddr <= ip;
state <= halted ? FETCH1 : FETCH2;
end
FETCH2 : state <= FETCH3;
FETCH3 : begin
instruction[15:8] <= mem_data_out;
r[15] <= ip1;
state <= FETCH4;
end
FETCH4 : begin
mem_raddr <= ip;
state <= FETCH5;
end
FETCH5 : state <= FETCH6;
FETCH6 : begin
instruction[7:0] <= mem_data_out;
r[15] <= ip1;
...
So between every assignment to the mem_raddr register (in state FETCH1 and FETCH4) and the retrieval of the byte from the mem_data_out register (in state FETCH3 and FETCH6) we had a wait cycle.
Now it is true that for the ice40 BRAM there needs to be two clock cycles between setting the address and reading the byte, but we can already set the new address in the next cycle, allowing us to read a byte every clock cycle once we set the initial address.
This alternative approach looks like this:
case(state)
FETCH1 : begin
r[0] <= 0;
r[1] <= 1;
r[13][31] <= 1; // force the always on bit
mem_raddr <= ip;
state <= halted ? FETCH1 : FETCH2;
end
FETCH2 : begin
state <= FETCH3;
r[15] <= ip1;
mem_raddr <= ip1;
end
FETCH3 : begin
instruction[15:8] <= mem_data_out;
state <= FETCH4;
end
FETCH4 : begin
instruction[7:0] <= mem_data_out;
r[15] <= ip1;
...
So we set the address in states FETCH1 and FETCH2 and read the corresponding bytes in states FETCH3 and FETCH4 respectively, saving us 2 clock cycles for every instruction. Since the most used instructions took 8 cycles and now 6, this is a reduction of 25%. Not bad I think.
And although not very well documented (or documented at all actually) this setup works for SPRAMS as well.
Subscribe to:
Posts (Atom)
CPU design
The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...
