Sunday, 15 March 2020

Count leading zeros in verilog

Added a new article on the website dedicated to documenting the Robin SoC/CPU.

It documents (with code) the implementation of a new module to count leading zeros.


Saturday, 14 March 2020

Verilog test benches

Because of my very limited experience with Verilog all the test cases until now have been 'real' tests, i.e. they automate the execution of programs running on the actual hardware and verify expected results.

In a way these tests are better than just simulations because they verify that the final product actually runs on the hardware as intended but they are also cumbersome: until you have a fair bit of infrastructure in place (like a UART and a monitor program) you cannot test at all and also these additional components contain modules that should better be tested themselves first.

I had a bit of luck so it wasn't too difficult to get something working and then proceed from there but i still wanted to have some proper simulations in place to test individual modules like the ALU for example.

So I finally got around this and managed to create my first test bench. This test bench is for a new module I am developing to count the number of leading zeros (probably more on that in a future article). I run the test bench with the Icarus Verilog compiler (iverilog).

While running the test bench I noted a couple of oddities:

Unlike yosys, iverilog does not like postfix operators (like i--), so the following generate block gave an error


for(i=7; i>=0 ; i--) begin

    ...

end

The simple solution of course was to replace i-- with i=i-1, but it is still not completely clear to me what the exact differences are in Verilog versions supported by Yosys and Icarus.
Also, even though the developers are aware of this, iverilog has no option to return a non-zero return code: errors and fatal conditions only write messages to stdout. This means we have to check for specific strings to appear in the output in order to stop a Makefile. This isn't difficult and easily done with awk:


clz_tb: ../clz.v clz_tb.v
 $(VERILOG) -o $@ $^ ; vvp $@ | awk "/FATAL/{exit(1)}"

Doubts about Verilog

I can't tell at this point if VHDL or other hardware definition languages are any better but the longer i work with Verilog the more doubts i have: It does not clearly separate simulation from synthesis, its syntax (especially scoping rules for variables) is illogical, you can't define functions with more than one statement (at least not in verilog-2001) and every implementation is allowed to diverge from the standard by chosing to implement some features or not. I am not sure why people in the hardware world accept this, couldn't imagine this happening to Python implementations for example.

Anyway, it works sort of, so we'll see where it gets us; maybe with a bit more experience it will be less awkward to work with.

Saturday, 7 March 2020

Optimizing the fetch decode execute cycle II

In the last article we identified a couple of opportunities to decrease the cycles needed to execute the branch, pop and load immediate instructions. The key issue hear was that we weren't reading the bytes until after we started setting the mem_raddr register in the DECODE cycle.

Because we know the opcode for any instruction already in the FETCH3 cycle we can set the mem_raddr register with the contents of the stackpointer if we are dealing with a pop instruction or keep on incrementing the mem_raddr for those instructions that are followed by some bytes after the instructions itself, like the two byte offset fir the branch instruction and the four bytes of the load immediate instruction. And if we set the mem_raddr register two cycles earlier that means that we can actually read those bytes two cycles earlier as well.

This newly implemented scenario is summed up in the table below (click to enlarge)


The highlighted areas show where the changes are. From the second column we can see that we are setting or updating the mem_raddr register for every cycle from FETCH1 to EXEC3, and reading a byte in every cycle from FETCH3 to EXEC5.

This means that for the load immediate and pop instructions we're done in EXEC3 and for the branch instruction even one cycle earlier (not two cycles because although we read two bytes less, we also need to add the offset to the program counter and that takes a cycle).

Some more opportunities


There are still a few opportunities left for optimization for the mover, load byte and the push instruction and i'll probably discuss that in a future article.

Thursday, 5 March 2020

Optimizing the fetch decode execute cycle

When I started out with the design for this CPU the fetch decode execute cycle was a massive affair, resulting in 8 clock cycles for a MOVE instruction and 15 cycles for a LOADL (load long word from memory) instruction.

By closely looking at the timing diagrams for memory access we could reduce the number of cycles in the fetch part significantly. Meanwhile I implemented some additional optimizations and currently the MOVE and LOADL instruction clock in (pun intended) at 4 and 9 cycles respectively, a speedup of about 2x compared to the initial implementation.

The diagram below illustrates the different activities that take place in the various states (click to enlarge):


The important bit to understand here is that we do not read anything from memory in the decode and exec1 states. For some instructions this is inevitable because only after reading the second byte of an instruction (in fetch4) and adding the two source registers (available in decode, because adding those to registers needs a clock cycle) can we load the mem_raddr register and start loading two cycles later.
However, for instructions like LOADIL (load immediate ling word) and SETBRA, the data and offset respectively are located just after the actual instruction, so we could keep on incrementing the mem_raddr in states fetch 3 and fetch 4 so that the first two bytes would be available in the decode and exec 1 states as indicated by the highlighted 'gaps' in the table.

Even for the POP instruction we know what the address should be because we can refer to register 14 (the stackpointer). The only thing we have to keep in ind that we need to decide whether to keep on incrementing the mem_raddr register or to load it with the address in the stack pointer. We can make this decision in state fetch 3 already because there we read the high byte of the instruction which contains the intructions opcode.

So next on my agenda is to see whether we can indeed implement this idea. it would potentially shave of another 2 cycles from the the LOADIL, SETBRA and POP instructions so it is certainly worth the effort.




CPU design

The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...