Small FPGA things

Sunday 15 March 2020

Count leading zeros in verilog

Added a new article on the website dedicated to documenting the Robin SoC/CPU.

It documents (with code) the implementation of a new module to count leading zeros.

Saturday 14 March 2020

Verilog test benches

Because of my very limited experience with Verilog all the test cases until now have been 'real' tests, i.e. they automate the execution of programs running on the actual hardware and verify expected results.

In a way these tests are better than just simulations because they verify that the final product actually runs on the hardware as intended but they are also cumbersome: until you have a fair bit of infrastructure in place (like a UART and a monitor program) you cannot test at all and also these additional components contain modules that should better be tested themselves first.

I had a bit of luck so it wasn't too difficult to get something working and then proceed from there but i still wanted to have some proper simulations in place to test individual modules like the ALU for example.

So I finally got around this and managed to create my first test bench. This test bench is for a new module I am developing to count the number of leading zeros (probably more on that in a future article). I run the test bench with the Icarus Verilog compiler (iverilog).

While running the test bench I noted a couple of oddities:

Unlike yosys, iverilog does not like postfix operators (like i--), so the following generate block gave an error


for(i=7; i>=0 ; i--) begin

    ...

end

The simple solution of course was to replace i-- with i=i-1, but it is still not completely clear to me what the exact differences are in Verilog versions supported by Yosys and Icarus.

Also, even though the developers are aware of this, iverilog has no option to return a non-zero return code: errors and fatal conditions only write messages to stdout. This means we have to check for specific strings to appear in the output in order to stop a Makefile. This isn't difficult and easily done with awk:


clz_tb: ../clz.v clz_tb.v
 $(VERILOG) -o $@ $^ ; vvp $@ | awk "/FATAL/{exit(1)}"

Doubts about Verilog

I can't tell at this point if VHDL or other hardware definition languages are any better but the longer i work with Verilog the more doubts i have: It does not clearly separate simulation from synthesis, its syntax (especially scoping rules for variables) is illogical, you can't define functions with more than one statement (at least not in verilog-2001) and every implementation is allowed to diverge from the standard by chosing to implement some features or not. I am not sure why people in the hardware world accept this, couldn't imagine this happening to Python implementations for example.

Anyway, it works sort of, so we'll see where it gets us; maybe with a bit more experience it will be less awkward to work with.

Saturday 7 March 2020

Optimizing the fetch decode execute cycle II

In the last article we identified a couple of opportunities to decrease the cycles needed to execute the branch, pop and load immediate instructions. The key issue hear was that we weren't reading the bytes until after we started setting the mem_raddr register in the DECODE cycle.

Because we know the opcode for any instruction already in the FETCH3 cycle we can set the mem_raddr register with the contents of the stackpointer if we are dealing with a pop instruction or keep on incrementing the mem_raddr for those instructions that are followed by some bytes after the instructions itself, like the two byte offset fir the branch instruction and the four bytes of the load immediate instruction. And if we set the mem_raddr register two cycles earlier that means that we can actually read those bytes two cycles earlier as well.

This newly implemented scenario is summed up in the table below (click to enlarge)

The highlighted areas show where the changes are. From the second column we can see that we are setting or updating the mem_raddr register for every cycle from FETCH1 to EXEC3, and reading a byte in every cycle from FETCH3 to EXEC5.

This means that for the load immediate and pop instructions we're done in EXEC3 and for the branch instruction even one cycle earlier (not two cycles because although we read two bytes less, we also need to add the offset to the program counter and that takes a cycle).

Some more opportunities

There are still a few opportunities left for optimization for the mover, load byte and the push instruction and i'll probably discuss that in a future article.

Thursday 5 March 2020

Optimizing the fetch decode execute cycle

When I started out with the design for this CPU the fetch decode execute cycle was a massive affair, resulting in 8 clock cycles for a MOVE instruction and 15 cycles for a LOADL (load long word from memory) instruction.

By closely looking at the timing diagrams for memory access we could reduce the number of cycles in the fetch part significantly. Meanwhile I implemented some additional optimizations and currently the MOVE and LOADL instruction clock in (pun intended) at 4 and 9 cycles respectively, a speedup of about 2x compared to the initial implementation.

The diagram below illustrates the different activities that take place in the various states (click to enlarge):

The important bit to understand here is that we do not read anything from memory in the decode and exec1 states. For some instructions this is inevitable because only after reading the second byte of an instruction (in fetch4) and adding the two source registers (available in decode, because adding those to registers needs a clock cycle) can we load the mem_raddr register and start loading two cycles later.
However, for instructions like LOADIL (load immediate ling word) and SETBRA, the data and offset respectively are located just after the actual instruction, so we could keep on incrementing the mem_raddr in states fetch 3 and fetch 4 so that the first two bytes would be available in the decode and exec 1 states as indicated by the highlighted 'gaps' in the table.

Even for the POP instruction we know what the address should be because we can refer to register 14 (the stackpointer). The only thing we have to keep in ind that we need to decide whether to keep on incrementing the mem_raddr register or to load it with the address in the stack pointer. We can make this decision in state fetch 3 already because there we read the high byte of the instruction which contains the intructions opcode.

So next on my agenda is to see whether we can indeed implement this idea. it would potentially shave of another 2 cycles from the the LOADIL, SETBRA and POP instructions so it is certainly worth the effort.

Saturday 29 February 2020

The Robin SoC has a dedicated website now

I started documenting the design of the Robin SoC (and in particular the CPU) in a more structured manner than just a Wiki.

It is implemented as a GitHub site, check it out from time to time as articles get added.

Monday 24 February 2020

iCE40 BRAM & SPRAM access: The need for speed

Until now the central fetch-decode-execute cycle of the cpu contained a lot of wait cycles. It looked like this (where ip1 is r[15]+1):


case(state)
    FETCH1  :   begin
                    r[0] <= 0;
                    r[1] <= 1;
                    r[13][31] <= 1; // force the always on bit
                    mem_raddr <= ip;
                    state <= halted ? FETCH1 : FETCH2;
                end
    FETCH2  :   state <= FETCH3;
    FETCH3  :   begin
                    instruction[15:8] <= mem_data_out;
                    r[15] <= ip1;
                    state <= FETCH4;
                end
    FETCH4  :   begin
                    mem_raddr <= ip;
                    state <= FETCH5;
                end
    FETCH5  :   state <= FETCH6;
    FETCH6  :   begin
                    instruction[7:0] <= mem_data_out;
                    r[15] <= ip1;
                    ...

So between every assignment to the mem_raddr register (in state FETCH1 and FETCH4) and the retrieval of the byte from the mem_data_out register (in state FETCH3 and FETCH6) we had a wait cycle.

Now it is true that for the ice40 BRAM there needs to be two clock cycles between setting the address and reading the byte, but we can already set the new address in the next cycle, allowing us to read a byte every clock cycle once we set the initial address.

This alternative approach looks like this:


case(state)
    FETCH1  :   begin
                    r[0] <= 0;
                    r[1] <= 1;
                    r[13][31] <= 1; // force the always on bit
                    mem_raddr <= ip;
                    state <= halted ? FETCH1 : FETCH2;
                end
    FETCH2  :   begin
                    state <= FETCH3;
                    r[15] <= ip1;
                    mem_raddr <= ip1;
                    end
    FETCH3  :   begin
                    instruction[15:8] <= mem_data_out;
                    state <= FETCH4;
                end
    FETCH4  :   begin
                    instruction[7:0] <= mem_data_out;
                    r[15] <= ip1;
                    ...

So we set the address in states FETCH1 and FETCH2 and read the corresponding bytes in states FETCH3 and FETCH4 respectively, saving us 2 clock cycles for every instruction. Since the most used instructions took 8 cycles and now 6, this is a reduction of 25%. Not bad I think.

And although not very well documented (or documented at all actually) this setup works for SPRAMS as well.

Sunday 23 February 2020

The Robin SoC on the iCEbreaker: current status

The main focus in the last couple of weeks has been on the simplification of the CPU and ALU.

Simplification

The main decoding loop in the CPU was rather convoluted so both were redesigned a bit to improve readability of the Verilog code as well as reduce resource consumption. (The ALU code was updated in place, the CPU code got a new file)

Because by now I also have some experience with the code that is being generated by the compiler, I was able to remove unused instructions and ALU operations. Previously the pop, push and setxxx instructions were considered sub-instructions within one special opcode, now they are individual instructions (in case of pop and push) or rolled into a single set-and-branch instruction. The new instruction set architecture was highlighted in a separate article.

Less resources

All in all this redesign shrunk the number of LUTs consumed from 5279 (yes, just one LUT removed from 100%) to 4879 (92%), which is pretty neat because it leaves some room for additional functionality or tweaks. The biggest challenge by the way is Yosys: even slight changes in the design, like assigning different values to labels of a case statement that is not full, may result in a different number of LUTs consumed. This is something that needs some more research, maybe Yosys offers additional optimization options that let me get the lowest resource count in a more predictable manner.

Better testing

A significant amount of effort was spent on designing more and better regression tests. Both for the SoC and the support programs (assembler, simulator, ...) regression tests and syntax checkers were added. Most of these were also added to GitHub push actions, with the exception of the actual hardware tests because I cannot run those on GitHub. And of course this mainly done to show a few green banners on the repository home page 😀

Bug fixes

With a better testing framework in place it is far easier to check whether changes don't inadvertently break something. This was put to work in fixing one of the more annoying bugs left in the ALU design: previously shift left by more than 31 and shift right by 0 did not give a proper result. This is now fixed.

Frustrations

The up5k on the iCEbreaker board has 8 dsp cores. We currently use 4 of them to implement 32x32 bit multiplication. The SB_MAC16 primitives we use for this are inferred by Yosys from some multiplication statements we use in the ALU (i.e. we do not instantiate them directly) and work fine.
However, when I want to instantiate some of them directly and configure them to be used as 32 bit adders these instantiations will still multiply instead of add! No matter what I do, the result stays teh same. I have to admit I have no idea how Yosys infers stuff so it might very well be that my direct instantiation gets rewritten by some Yosys stage, so I will have to do some more research here.

What next?

I think next on the agenda is performance: I think I use too many read states for the fetch/decode/execute cycle. The Lattice technical documentation seems to imply we can read and write new data every clock cycle, at least for block ram. Unfortunately the docs for the SPRAM are less clear. Anyway, this area for sure needs some attention.