Small FPGA things
An fpga discovery along IceSticks, Icebreakers, Verilog and more
Sunday 15 March 2020
Count leading zeros in verilog
It documents (with code) the implementation of a new module to count leading zeros.
Saturday 14 March 2020
Verilog test benches
for(i=7; i>=0 ; i--) begin
...
end
The simple solution of course was to replace
i--
with i=i-1
, but it is still not completely clear to me what the exact differences are in Verilog versions supported by Yosys and Icarus.
clz_tb: ../clz.v clz_tb.v
$(VERILOG) -o $@ $^ ; vvp $@ | awk "/FATAL/{exit(1)}"
Doubts about Verilog
I can't tell at this point if VHDL or other hardware definition languages are any better but the longer i work with Verilog the more doubts i have: It does not clearly separate simulation from synthesis, its syntax (especially scoping rules for variables) is illogical, you can't define functions with more than one statement (at least not in verilog-2001) and every implementation is allowed to diverge from the standard by chosing to implement some features or not. I am not sure why people in the hardware world accept this, couldn't imagine this happening to Python implementations for example.
Anyway, it works sort of, so we'll see where it gets us; maybe with a bit more experience it will be less awkward to work with.
Saturday 7 March 2020
Optimizing the fetch decode execute cycle II
Because we know the opcode for any instruction already in the FETCH3 cycle we can set the mem_raddr register with the contents of the stackpointer if we are dealing with a pop instruction or keep on incrementing the mem_raddr for those instructions that are followed by some bytes after the instructions itself, like the two byte offset fir the branch instruction and the four bytes of the load immediate instruction. And if we set the mem_raddr register two cycles earlier that means that we can actually read those bytes two cycles earlier as well.
This newly implemented scenario is summed up in the table below (click to enlarge)
Some more opportunities
Thursday 5 March 2020
Optimizing the fetch decode execute cycle
By closely looking at the timing diagrams for memory access we could reduce the number of cycles in the fetch part significantly. Meanwhile I implemented some additional optimizations and currently the MOVE and LOADL instruction clock in (pun intended) at 4 and 9 cycles respectively, a speedup of about 2x compared to the initial implementation.
The diagram below illustrates the different activities that take place in the various states (click to enlarge):
The important bit to understand here is that we do not read anything from memory in the decode and exec1 states. For some instructions this is inevitable because only after reading the second byte of an instruction (in fetch4) and adding the two source registers (available in decode, because adding those to registers needs a clock cycle) can we load the mem_raddr register and start loading two cycles later.
However, for instructions like LOADIL (load immediate ling word) and SETBRA, the data and offset respectively are located just after the actual instruction, so we could keep on incrementing the mem_raddr in states fetch 3 and fetch 4 so that the first two bytes would be available in the decode and exec 1 states as indicated by the highlighted 'gaps' in the table.
Even for the POP instruction we know what the address should be because we can refer to register 14 (the stackpointer). The only thing we have to keep in ind that we need to decide whether to keep on incrementing the mem_raddr register or to load it with the address in the stack pointer. We can make this decision in state fetch 3 already because there we read the high byte of the instruction which contains the intructions opcode.
So next on my agenda is to see whether we can indeed implement this idea. it would potentially shave of another 2 cycles from the the LOADIL, SETBRA and POP instructions so it is certainly worth the effort.
Saturday 29 February 2020
The Robin SoC has a dedicated website now
It is implemented as a GitHub site, check it out from time to time as articles get added.
Monday 24 February 2020
iCE40 BRAM & SPRAM access: The need for speed
ip1
is r[15]+1
):
case(state)
FETCH1 : begin
r[0] <= 0;
r[1] <= 1;
r[13][31] <= 1; // force the always on bit
mem_raddr <= ip;
state <= halted ? FETCH1 : FETCH2;
end
FETCH2 : state <= FETCH3;
FETCH3 : begin
instruction[15:8] <= mem_data_out;
r[15] <= ip1;
state <= FETCH4;
end
FETCH4 : begin
mem_raddr <= ip;
state <= FETCH5;
end
FETCH5 : state <= FETCH6;
FETCH6 : begin
instruction[7:0] <= mem_data_out;
r[15] <= ip1;
...
So between every assignment to the mem_raddr
register (in state FETCH1 and FETCH4) and the retrieval of the byte from the mem_data_out
register (in state FETCH3 and FETCH6) we had a wait cycle.
Now it is true that for the ice40 BRAM there needs to be two clock cycles between setting the address and reading the byte, but we can already set the new address in the next cycle, allowing us to read a byte every clock cycle once we set the initial address.
This alternative approach looks like this:
case(state)
FETCH1 : begin
r[0] <= 0;
r[1] <= 1;
r[13][31] <= 1; // force the always on bit
mem_raddr <= ip;
state <= halted ? FETCH1 : FETCH2;
end
FETCH2 : begin
state <= FETCH3;
r[15] <= ip1;
mem_raddr <= ip1;
end
FETCH3 : begin
instruction[15:8] <= mem_data_out;
state <= FETCH4;
end
FETCH4 : begin
instruction[7:0] <= mem_data_out;
r[15] <= ip1;
...
So we set the address in states FETCH1 and FETCH2 and read the corresponding bytes in states FETCH3 and FETCH4 respectively, saving us 2 clock cycles for every instruction. Since the most used instructions took 8 cycles and now 6, this is a reduction of 25%. Not bad I think.
And although not very well documented (or documented at all actually) this setup works for SPRAMS as well.
Sunday 23 February 2020
The Robin SoC on the iCEbreaker: current status
Simplification
The main decoding loop in the CPU was rather convoluted so both were redesigned a bit to improve readability of the Verilog code as well as reduce resource consumption. (The ALU code was updated in place, the CPU code got a new file)
Because by now I also have some experience with the code that is being generated by the compiler, I was able to remove unused instructions and ALU operations. Previously the pop, push and setxxx instructions were considered sub-instructions within one special opcode, now they are individual instructions (in case of pop and push) or rolled into a single set-and-branch instruction. The new instruction set architecture was highlighted in a separate article.
Less resources
All in all this redesign shrunk the number of LUTs consumed from 5279 (yes, just one LUT removed from 100%) to 4879 (92%), which is pretty neat because it leaves some room for additional functionality or tweaks. The biggest challenge by the way is Yosys: even slight changes in the design, like assigning different values to labels of a case statement that is not full, may result in a different number of LUTs consumed. This is something that needs some more research, maybe Yosys offers additional optimization options that let me get the lowest resource count in a more predictable manner.
Better testing
A significant amount of effort was spent on designing more and better regression tests. Both for the SoC and the support programs (assembler, simulator, ...) regression tests and syntax checkers were added. Most of these were also added to GitHub push actions, with the exception of the actual hardware tests because I cannot run those on GitHub. And of course this mainly done to show a few green banners on the repository home page 😀
Bug fixes
With a better testing framework in place it is far easier to check whether changes don't inadvertently break something. This was put to work in fixing one of the more annoying bugs left in the ALU design: previously shift left by more than 31 and shift right by 0 did not give a proper result. This is now fixed.
Frustrations
The up5k on the iCEbreaker board has 8 dsp cores. We currently use 4 of them to implement 32x32 bit multiplication. The SB_MAC16 primitives we use for this are inferred by Yosys from some multiplication statements we use in the ALU (i.e. we do not instantiate them directly) and work fine.
However, when I want to instantiate some of them directly and configure them to be used as 32 bit adders these instantiations will still multiply instead of add! No matter what I do, the result stays teh same. I have to admit I have no idea how Yosys infers stuff so it might very well be that my direct instantiation gets rewritten by some Yosys stage, so I will have to do some more research here.
What next?
I think next on the agenda is performance: I think I use too many read states for the fetch/decode/execute cycle. The Lattice technical documentation seems to imply we can read and write new data every clock cycle, at least for block ram. Unfortunately the docs for the SPRAM are less clear. Anyway, this area for sure needs some attention.
CPU design
The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...