I started documenting the design of the Robin SoC (and in particular the CPU) in a more structured manner than just a Wiki.
It is implemented as a GitHub site, check it out from time to time as articles get added.
Saturday 29 February 2020
Monday 24 February 2020
iCE40 BRAM & SPRAM access: The need for speed
Until now the central fetch-decode-execute cycle of the cpu contained a lot of wait cycles. It looked like this (where
ip1
is r[15]+1
):
case(state)
FETCH1 : begin
r[0] <= 0;
r[1] <= 1;
r[13][31] <= 1; // force the always on bit
mem_raddr <= ip;
state <= halted ? FETCH1 : FETCH2;
end
FETCH2 : state <= FETCH3;
FETCH3 : begin
instruction[15:8] <= mem_data_out;
r[15] <= ip1;
state <= FETCH4;
end
FETCH4 : begin
mem_raddr <= ip;
state <= FETCH5;
end
FETCH5 : state <= FETCH6;
FETCH6 : begin
instruction[7:0] <= mem_data_out;
r[15] <= ip1;
...
So between every assignment to the mem_raddr
register (in state FETCH1 and FETCH4) and the retrieval of the byte from the mem_data_out
register (in state FETCH3 and FETCH6) we had a wait cycle.
Now it is true that for the ice40 BRAM there needs to be two clock cycles between setting the address and reading the byte, but we can already set the new address in the next cycle, allowing us to read a byte every clock cycle once we set the initial address.
This alternative approach looks like this:
case(state)
FETCH1 : begin
r[0] <= 0;
r[1] <= 1;
r[13][31] <= 1; // force the always on bit
mem_raddr <= ip;
state <= halted ? FETCH1 : FETCH2;
end
FETCH2 : begin
state <= FETCH3;
r[15] <= ip1;
mem_raddr <= ip1;
end
FETCH3 : begin
instruction[15:8] <= mem_data_out;
state <= FETCH4;
end
FETCH4 : begin
instruction[7:0] <= mem_data_out;
r[15] <= ip1;
...
So we set the address in states FETCH1 and FETCH2 and read the corresponding bytes in states FETCH3 and FETCH4 respectively, saving us 2 clock cycles for every instruction. Since the most used instructions took 8 cycles and now 6, this is a reduction of 25%. Not bad I think.
And although not very well documented (or documented at all actually) this setup works for SPRAMS as well.
Sunday 23 February 2020
The Robin SoC on the iCEbreaker: current status
The main focus in the last couple of weeks has been on the simplification of the CPU and ALU.
The main decoding loop in the CPU was rather convoluted so both were redesigned a bit to improve readability of the Verilog code as well as reduce resource consumption. (The ALU code was updated in place, the CPU code got a new file)
Because by now I also have some experience with the code that is being generated by the compiler, I was able to remove unused instructions and ALU operations. Previously the pop, push and setxxx instructions were considered sub-instructions within one special opcode, now they are individual instructions (in case of pop and push) or rolled into a single set-and-branch instruction. The new instruction set architecture was highlighted in a separate article.
All in all this redesign shrunk the number of LUTs consumed from 5279 (yes, just one LUT removed from 100%) to 4879 (92%), which is pretty neat because it leaves some room for additional functionality or tweaks. The biggest challenge by the way is Yosys: even slight changes in the design, like assigning different values to labels of a case statement that is not full, may result in a different number of LUTs consumed. This is something that needs some more research, maybe Yosys offers additional optimization options that let me get the lowest resource count in a more predictable manner.
A significant amount of effort was spent on designing more and better regression tests. Both for the SoC and the support programs (assembler, simulator, ...) regression tests and syntax checkers were added. Most of these were also added to GitHub push actions, with the exception of the actual hardware tests because I cannot run those on GitHub. And of course this mainly done to show a few green banners on the repository home page 😀
With a better testing framework in place it is far easier to check whether changes don't inadvertently break something. This was put to work in fixing one of the more annoying bugs left in the ALU design: previously shift left by more than 31 and shift right by 0 did not give a proper result. This is now fixed.
The up5k on the iCEbreaker board has 8 dsp cores. We currently use 4 of them to implement 32x32 bit multiplication. The SB_MAC16 primitives we use for this are inferred by Yosys from some multiplication statements we use in the ALU (i.e. we do not instantiate them directly) and work fine.
However, when I want to instantiate some of them directly and configure them to be used as 32 bit adders these instantiations will still multiply instead of add! No matter what I do, the result stays teh same. I have to admit I have no idea how Yosys infers stuff so it might very well be that my direct instantiation gets rewritten by some Yosys stage, so I will have to do some more research here.
I think next on the agenda is performance: I think I use too many read states for the fetch/decode/execute cycle. The Lattice technical documentation seems to imply we can read and write new data every clock cycle, at least for block ram. Unfortunately the docs for the SPRAM are less clear. Anyway, this area for sure needs some attention.
Simplification
The main decoding loop in the CPU was rather convoluted so both were redesigned a bit to improve readability of the Verilog code as well as reduce resource consumption. (The ALU code was updated in place, the CPU code got a new file)
Because by now I also have some experience with the code that is being generated by the compiler, I was able to remove unused instructions and ALU operations. Previously the pop, push and setxxx instructions were considered sub-instructions within one special opcode, now they are individual instructions (in case of pop and push) or rolled into a single set-and-branch instruction. The new instruction set architecture was highlighted in a separate article.
Less resources
All in all this redesign shrunk the number of LUTs consumed from 5279 (yes, just one LUT removed from 100%) to 4879 (92%), which is pretty neat because it leaves some room for additional functionality or tweaks. The biggest challenge by the way is Yosys: even slight changes in the design, like assigning different values to labels of a case statement that is not full, may result in a different number of LUTs consumed. This is something that needs some more research, maybe Yosys offers additional optimization options that let me get the lowest resource count in a more predictable manner.
Better testing
A significant amount of effort was spent on designing more and better regression tests. Both for the SoC and the support programs (assembler, simulator, ...) regression tests and syntax checkers were added. Most of these were also added to GitHub push actions, with the exception of the actual hardware tests because I cannot run those on GitHub. And of course this mainly done to show a few green banners on the repository home page 😀
Bug fixes
With a better testing framework in place it is far easier to check whether changes don't inadvertently break something. This was put to work in fixing one of the more annoying bugs left in the ALU design: previously shift left by more than 31 and shift right by 0 did not give a proper result. This is now fixed.
Frustrations
The up5k on the iCEbreaker board has 8 dsp cores. We currently use 4 of them to implement 32x32 bit multiplication. The SB_MAC16 primitives we use for this are inferred by Yosys from some multiplication statements we use in the ALU (i.e. we do not instantiate them directly) and work fine.
However, when I want to instantiate some of them directly and configure them to be used as 32 bit adders these instantiations will still multiply instead of add! No matter what I do, the result stays teh same. I have to admit I have no idea how Yosys infers stuff so it might very well be that my direct instantiation gets rewritten by some Yosys stage, so I will have to do some more research here.
What next?
I think next on the agenda is performance: I think I use too many read states for the fetch/decode/execute cycle. The Lattice technical documentation seems to imply we can read and write new data every clock cycle, at least for block ram. Unfortunately the docs for the SPRAM are less clear. Anyway, this area for sure needs some attention.
Saturday 22 February 2020
Instruction Set Architecture
The last couple of weeks I experimented a bit with details of the instruction set and the final design is shown below. We settled for just 13 instructions, although the alu supports about 15 different operations so 28 instructions would be an equally valid number.
All instructions are 2 bytes long, with the exception of loadil (load immediate long) which is 6 bytes and the set and branch instruction which is 4 bytes.
Instructions consist of 4 fields each 4 bits wide. The first field is the opcode, the other 3 fields typically specify registers and are called r2,r1 and r0 respectively, although there are exceptions. For loadi (load immediate) r1 and r0 together encode a byte value and for set and branch r2 encodes the condition.
All instructions are 2 bytes long, with the exception of loadil (load immediate long) which is 6 bytes and the set and branch instruction which is 4 bytes.
Encoding
Instructions consist of 4 fields each 4 bits wide. The first field is the opcode, the other 3 fields typically specify registers and are called r2,r1 and r0 respectively, although there are exceptions. For loadi (load immediate) r1 and r0 together encode a byte value and for set and branch r2 encodes the condition.
move | move r2,r1,r0 | r2 = r1 + r0 | move sum of source registers to destination |
pop | pop r2 | r2 = (sp) ; sp += 4 | pop top of stack to destination register |
push | push r2 | sp -= 4; (sp) = r2 | push register onto stack |
alu | alu r2,r1,r0 | r2 = r1 op r0 | perform alu operation (op = r[13][7:0]) |
mover | mover r2,r1,n | r2 = r1 + 4*n | add multiple of 4, n = [-8,7] |
stor | stor r2,r1,r0 | (r1 + r0) = r2[7:0] | store byte in memory |
storl | storl r2,r1,r0 | (r1 + r0) = r2 | store long word in memory |
load | load r2,r1,r0 | r2[7:0] = (r1 + r0) | load byte from memory |
loadl | loadl r2,r1,r0 | r2 = (r1 + r0) | load long word from memory |
loadi | loadi r2,#n | r2[7:0] = n | load byte immediate, n = 8b bit value |
loadil | loadil r2,#n | r2 = (pc +2); pc+=4 | load long word immediate |
jal | jal r2,r1,r0 | r2 = pc; pc = r1+r0 | jump and link |
halt | halt | halt execution | |
setbXX | setbXX r1,off | see below | set and branch on condition XX |
Notes
- Loading a byte from memory or immediately does not sign extend it. This means that a destinations register may need to be zeroed out before loading the byte.
- The alu operation performs the operation stored in the lower byte of R13. This means that choosing the operation and actually performing it are two separate steps. You can reuse this operation, if performing multiple additions for example there is no need to reload R13
- R13 is also the flags register: bits 31 is always set while bit 30 and 29 are the sign and zero bit respectively.
- The alu does not calculate a carry or borrow
Set and branch
The set and branch instruction acts on flags in R13 and can be used to set a destination register (selected by field r1) to zero or one based on whether a specific flag is set. If this condition is true, it then adds a 16 bit offset to the program counter (i.e. branches).
If you only want to branch r0 or r2 can be used as the destination register as these are immutable. Likewise if you only want to set a register without branching, a zero offset can be used. The assembler implements these variants with macros: bra, beq, bne, brp and brm for the (un)conditional branches and seteq, setne, setpos and setmin for the conditional set instructions.
The technical implementation of the branch condition is
R13[31:29] & cond[2:0] == cond[3] & cond[2:0]
Where cond is specified by the r2 field of the instruction.
This means that the lower 3 bits of cond select the flag(s) to test while the upper bit determines whether the flag should be set or unset. Because R13[31] is always on we also have the option to unconditionally execute the instruction (or never, if we have cond[3]==0)
R1 en r0 are immutable and hardcoded to hold 1 and 0 respectively.
R13 is the flags and alu operation register.
R14 is the stackpointer targeted by the pop and push instructions.
R15 is the program counter (PC, a.k.a. instruction pointer).
Add and Sub
And, Or and Xor
Not
Shift right, Shift left
Cmp and Test
Mul (high and low 32 bits)
Div and rem (signed and unsigned)
Special registers
R1 en r0 are immutable and hardcoded to hold 1 and 0 respectively.
R13 is the flags and alu operation register.
R14 is the stackpointer targeted by the pop and push instructions.
R15 is the program counter (PC, a.k.a. instruction pointer).
Supported alu operations
Add and Sub
And, Or and Xor
Not
Shift right, Shift left
Cmp and Test
Mul (high and low 32 bits)
Div and rem (signed and unsigned)
Sunday 16 February 2020
Regression tests
In a previous post I mentioned the importance of creating a proper framework for testing.
In order to properly debug more complex programs like some functions in the libc library, I created a simulator. This simulator allows you to set breakpoints and single step through a program but also contains features that come in handy for regression tests, like dumping the contents of registers and memory locations after running a program. This information can be checked against known values in a test.
The simulator is great for more complex programs and to verify that elements of the toolchain like the assembler keep working as designed and as such these tests can even be automated as a GitHub push triggered action. These tests do not test the actual hardware though.
Our monitor program is scriptable however, so I designed a couple of tests that execute all instructions and alu operations and verify the results against known values. These tests still don't cover all edge cases but are sufficient to verify proper performance once I start refactoring the cpu and alu. Also, the hardware tests are mirrored in a test for the simulator (in the makefile) , so can be used to test its behavior as well.
There are many things that can be improved in the design, especially when considering resource usage, instruction set design and performance.
I already started rewriting the very complex state machine into something simpler but without changing the instruction set (apart from removing the obsolete loadw and storw instructions). This already saves more than a hundred LUTs but I want to do more.
Things I have in mind:
Now that we have a somewhat decent test framework in place these changes can be more easily tested against regressions. The results of these refactorings will be reported on in future articles.
In order to properly debug more complex programs like some functions in the libc library, I created a simulator. This simulator allows you to set breakpoints and single step through a program but also contains features that come in handy for regression tests, like dumping the contents of registers and memory locations after running a program. This information can be checked against known values in a test.
The simulator is great for more complex programs and to verify that elements of the toolchain like the assembler keep working as designed and as such these tests can even be automated as a GitHub push triggered action. These tests do not test the actual hardware though.
Our monitor program is scriptable however, so I designed a couple of tests that execute all instructions and alu operations and verify the results against known values. These tests still don't cover all edge cases but are sufficient to verify proper performance once I start refactoring the cpu and alu. Also, the hardware tests are mirrored in a test for the simulator (in the makefile) , so can be used to test its behavior as well.
Refactoring the cpu and alu
There are many things that can be improved in the design, especially when considering resource usage, instruction set design and performance.
I already started rewriting the very complex state machine into something simpler but without changing the instruction set (apart from removing the obsolete loadw and storw instructions). This already saves more than a hundred LUTs but I want to do more.
Things I have in mind:
- Removing carry related functionality
- Moving pop/push from 'special' sub-opcodes to their own instruction opcodes
- Merging the setxxx and branch instructions into a combined instruction
Now that we have a somewhat decent test framework in place these changes can be more easily tested against regressions. The results of these refactorings will be reported on in future articles.
Saturday 8 February 2020
A simulator for the Robin cpu
When I started playing around with a SoC design that I wanted to implement on the iCEbreaker, I quickly realized that without proper testing tools even a moderately ambiteous design would quickly become too complex to change and improve.
There exist of course tools to simulate verilog designs and even perform formal verification but my skill level is not quite up to that yet. On top of that I am convinced that many changes that I want to try out in the cpu would benefit from regression tests that are based on real code, i.e. code generated by a compiler instead of artificial tiny bits of code: code that you do not directly implement yourself tends to expose issues in the instruction set or bugs in it hardware implementation quicker than when you deliberately try to construct tiny test cases.
For these more realistic tests an assembler and a C compiler were created and they were used to implement small string and floating point libraries mimicking some of the functions in the C standard library. And they proved their worth as they uncovered among other things bugs in the handling of conditional branches for example.
However, as we will use the assembler and compiler to perform regression tests on the cpu it is important that these tools themselves are as bug free as possible, even when we add new functionality or change implementation details. Ideally some contineous integration would be implemented using GitHub actions that would be triggered on every push.
There is one catch here though: we cannot perform the final test in our chain of dependencies simply because the GitHub machines do not have an iCEbreaker board attached ☺️
We can deal with this challenge by creating a program the will simulate the cpu we have implemented on our fpga. This way we should be able to perform the tests for the compiler/assembler toolchain against this simulator with the added benefit of having more debugging options available (because they are much easier to implement in a bit of Python that in our resource constrained hardware.
The first version of this simulator is now commited and i hope to create some contineous integration actions in the near future.
There exist of course tools to simulate verilog designs and even perform formal verification but my skill level is not quite up to that yet. On top of that I am convinced that many changes that I want to try out in the cpu would benefit from regression tests that are based on real code, i.e. code generated by a compiler instead of artificial tiny bits of code: code that you do not directly implement yourself tends to expose issues in the instruction set or bugs in it hardware implementation quicker than when you deliberately try to construct tiny test cases.
For these more realistic tests an assembler and a C compiler were created and they were used to implement small string and floating point libraries mimicking some of the functions in the C standard library. And they proved their worth as they uncovered among other things bugs in the handling of conditional branches for example.
However, as we will use the assembler and compiler to perform regression tests on the cpu it is important that these tools themselves are as bug free as possible, even when we add new functionality or change implementation details. Ideally some contineous integration would be implemented using GitHub actions that would be triggered on every push.
There is one catch here though: we cannot perform the final test in our chain of dependencies simply because the GitHub machines do not have an iCEbreaker board attached ☺️
We can deal with this challenge by creating a program the will simulate the cpu we have implemented on our fpga. This way we should be able to perform the tests for the compiler/assembler toolchain against this simulator with the added benefit of having more debugging options available (because they are much easier to implement in a bit of Python that in our resource constrained hardware.
The first version of this simulator is now commited and i hope to create some contineous integration actions in the near future.
Saturday 1 February 2020
The Robin SoC on the iCEbreaker: current status
It is perhaps a bit weird, i started first with the iCEstick and then with the iCEbreaker to play around with FPGAs but actually i spend more time tweaking the assembler and C compiler than working on the hardware.
This is a good thing really, for it means that the hardware is working quite well. The reason for writing an assembler and a compiler (besides being fun) is to be able to test the SoC in a more thorough manner than just poking some bytes to see what it does.
You can of course simulate and verify correct behaviour of individual components but i find it infeasible to simulate a complete CPU (at least not with my limited experience). This means that at least in my opinion simulations are the hardware equivalent of unit tests, important in their own right but insufficient to test a complete design in a way proper programs put a CPU through its motions. Another goal is to find out whether the Instruction Set Architecture (ISA) i have thrown together is workable from a compiler point of view.
Over the last couple of weeks i slowly expanded the functionality of the C compiler (although it is a far cry from being standards compliant and it likely will stay that way) and while this compiler currently just supports char and int types I started writing a soft float library. And indeed while doing this I encountered a serious bug in my hardware design: conditional branches were not always taken correctly.
This was exactly why i started writing actual programs that exercise the CPU in more realistic ways. However, most of the things i encountered where bugs in the compiler rather than in the hardware but all the work until now resulted in this provisional list of observations:
It would be nice to have the option to single step through a program on the hardware level. The CPU has a halt instruction and in its ROM i have implemented a small program that dumps all registers but there is no easy way to restart again. The need for single stepping is somewhat lessened now that i implemented print and itoa functions but it still would be nice to have.
I didn't design the UART myself because i wanted to implement a monitor first and my verilog skills weren't up to it at that point. Later I added FIFOs to the UART to prevent overruns and made it available to the CPU via memory mapping. However, currently the UART is a bottleneck: it doesn't perform reliably at speeds over 115200 baud and even then it is fairly easy to confuse it. Obviously this would be a prime candidate for a proper redesign and part of this would be dealing with interrupts so that the CPU would waste its time polling for an available character or making sure enough time has passed to send another one.
The current CPU implementation is quite complex i think: instead of calculating many control signals with a single meaning most of the logic is implemented in a rather long and deep state machine. The SoC as a whole (monitor, CPU and supporting components) currently eats up all but one (!) of the LUTs of the up5k on the iCEbreaker. On the other hand i noticed that even in the current RISC like design, some instructions are never used: the 16 bit move, load and store instructions for example. This of course because my compiler doesn't bother with anything that isn't a byte or a 4byte entity but it nevertheless shows that depending on the area of application we might reconsider the design.
Then I'll probably focus on simplifying the CPU design to free up hardware resources. At that point i also hope to start on properly documenting the design as currently it is a bit of a mess (like this rambling blog ☺️). After that I'll probably start on the difficult stuff, i.e improving the UART, but we'll see.
This is a good thing really, for it means that the hardware is working quite well. The reason for writing an assembler and a compiler (besides being fun) is to be able to test the SoC in a more thorough manner than just poking some bytes to see what it does.
You can of course simulate and verify correct behaviour of individual components but i find it infeasible to simulate a complete CPU (at least not with my limited experience). This means that at least in my opinion simulations are the hardware equivalent of unit tests, important in their own right but insufficient to test a complete design in a way proper programs put a CPU through its motions. Another goal is to find out whether the Instruction Set Architecture (ISA) i have thrown together is workable from a compiler point of view.
Over the last couple of weeks i slowly expanded the functionality of the C compiler (although it is a far cry from being standards compliant and it likely will stay that way) and while this compiler currently just supports char and int types I started writing a soft float library. And indeed while doing this I encountered a serious bug in my hardware design: conditional branches were not always taken correctly.
This was exactly why i started writing actual programs that exercise the CPU in more realistic ways. However, most of the things i encountered where bugs in the compiler rather than in the hardware but all the work until now resulted in this provisional list of observations:
- Single stepping
It would be nice to have the option to single step through a program on the hardware level. The CPU has a halt instruction and in its ROM i have implemented a small program that dumps all registers but there is no easy way to restart again. The need for single stepping is somewhat lessened now that i implemented print and itoa functions but it still would be nice to have.
- Interrupts and the UART
I didn't design the UART myself because i wanted to implement a monitor first and my verilog skills weren't up to it at that point. Later I added FIFOs to the UART to prevent overruns and made it available to the CPU via memory mapping. However, currently the UART is a bottleneck: it doesn't perform reliably at speeds over 115200 baud and even then it is fairly easy to confuse it. Obviously this would be a prime candidate for a proper redesign and part of this would be dealing with interrupts so that the CPU would waste its time polling for an available character or making sure enough time has passed to send another one.
- Complexity
The current CPU implementation is quite complex i think: instead of calculating many control signals with a single meaning most of the logic is implemented in a rather long and deep state machine. The SoC as a whole (monitor, CPU and supporting components) currently eats up all but one (!) of the LUTs of the up5k on the iCEbreaker. On the other hand i noticed that even in the current RISC like design, some instructions are never used: the 16 bit move, load and store instructions for example. This of course because my compiler doesn't bother with anything that isn't a byte or a 4byte entity but it nevertheless shows that depending on the area of application we might reconsider the design.
Next steps
The next couple of weeks i will stay focused on improving the C compiler and the test suite until i am confident that changes in the hardware design can be properly checked for regressions. And because compiler writing and especially code generation is fun I'll probably write about some interesting finds along the way.Then I'll probably focus on simplifying the CPU design to free up hardware resources. At that point i also hope to start on properly documenting the design as currently it is a bit of a mess (like this rambling blog ☺️). After that I'll probably start on the difficult stuff, i.e improving the UART, but we'll see.
Subscribe to:
Posts (Atom)
CPU design
The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...