When I started playing around with a SoC design that I wanted to implement on the iCEbreaker, I quickly realized that without proper testing tools even a moderately ambiteous design would quickly become too complex to change and improve.
There exist of course tools to simulate verilog designs and even perform formal verification but my skill level is not quite up to that yet. On top of that I am convinced that many changes that I want to try out in the cpu would benefit from regression tests that are based on real code, i.e. code generated by a compiler instead of artificial tiny bits of code: code that you do not directly implement yourself tends to expose issues in the instruction set or bugs in it hardware implementation quicker than when you deliberately try to construct tiny test cases.
For these more realistic tests an assembler and a C compiler were created and they were used to implement small string and floating point libraries mimicking some of the functions in the C standard library. And they proved their worth as they uncovered among other things bugs in the handling of conditional branches for example.
However, as we will use the assembler and compiler to perform regression tests on the cpu it is important that these tools themselves are as bug free as possible, even when we add new functionality or change implementation details. Ideally some contineous integration would be implemented using GitHub actions that would be triggered on every push.
There is one catch here though: we cannot perform the final test in our chain of dependencies simply because the GitHub machines do not have an iCEbreaker board attached ☺️
We can deal with this challenge by creating a program the will simulate the cpu we have implemented on our fpga. This way we should be able to perform the tests for the compiler/assembler toolchain against this simulator with the added benefit of having more debugging options available (because they are much easier to implement in a bit of Python that in our resource constrained hardware.
The first version of this simulator is now commited and i hope to create some contineous integration actions in the near future.
Showing posts with label compiler. Show all posts
Showing posts with label compiler. Show all posts
Saturday, 8 February 2020
Tuesday, 28 January 2020
Progress
The toolchain (compiler, assembler) for the Robin SoC is really shaping up now, so i started creating a test suite consisting of several functions commonly found in libc.
Even though the C compiler isn't anywhere near completion, it does now produce code that might be ugly but is capable of producing good enough assembly.
The goal of this all is to create a proper test suite of executable programs that can be used to check if future changes in the cpu still perform as designed.
The compiler now supports most control structures (for, while, if/else, break, continue, return) except switch.
It supports char and int data types including pointers and arrays but not yet structs or unions. Some work to support floats is underway (see below).
Variables can be automatic (local) or static (file scope).
Most unary and binary operators are supported including the ternary ?: operator, pointer dereferencing (*) and function calls, but not the address of operator (&).
Type checking however is weak (almost non existent to be honest 😁) and the assembly code it produces is far from optimal but it works. Storage specifiers like static and volatile are completely ignored.
The C standard library is rather large and although I have no intention to implement more than a small subset, these functions do provide a good example of realistic functionality which is why I have chosen it as a test vehicle.
The implementation is done from scratch and of course targeted at just the Robin SoC, which makes life a lot easier because a full blown portable libc is humongous.
The current status (with links) is shown below; more functions will probably follow soon, especially low level functions to implement a (bare bones) soft float library.
These functions are mainly implemented to support the integer and float conversion functions in stdlib.h but are of course useful on their own as well.
strlen.c
strchr.c
strreverse.c
File based functions are a long way off still but some basic output over the serial interface is provided here. Later some input functions will be provided as well.
putchar.c
print.c (this one is not actually in libc, it just prints a string)
In order to test the float functions later, we absolutely need some basic conversion functions so I implemented those first. Note that the float functions currently just support basic decimal fractions, scientific notation (2.3E-9) is not (yet) supported. They do work with standard ieee float32 numbers but no rigorous compliance is attempted with regard to rounding etc.
atoi.c
ftoi.c
itoa.c
itof.c
These functions need to be thoroughly tested before they can actually be used as a proper test suite for the hardware but I feel we have started quite well.
Even though the C compiler isn't anywhere near completion, it does now produce code that might be ugly but is capable of producing good enough assembly.
The goal of this all is to create a proper test suite of executable programs that can be used to check if future changes in the cpu still perform as designed.
Compiler status
The compiler now supports most control structures (for, while, if/else, break, continue, return) except switch.
It supports char and int data types including pointers and arrays but not yet structs or unions. Some work to support floats is underway (see below).
Variables can be automatic (local) or static (file scope).
Most unary and binary operators are supported including the ternary ?: operator, pointer dereferencing (*) and function calls, but not the address of operator (&).
Type checking however is weak (almost non existent to be honest 😁) and the assembly code it produces is far from optimal but it works. Storage specifiers like static and volatile are completely ignored.
Implemented functions
The implementation is done from scratch and of course targeted at just the Robin SoC, which makes life a lot easier because a full blown portable libc is humongous.
The current status (with links) is shown below; more functions will probably follow soon, especially low level functions to implement a (bare bones) soft float library.
From string.h
strlen.c
strchr.c
strreverse.c
From stdio.h
putchar.c
print.c (this one is not actually in libc, it just prints a string)
From stdlib.h
atoi.c
ftoi.c
itoa.c
itof.c
Conclusion
These functions need to be thoroughly tested before they can actually be used as a proper test suite for the hardware but I feel we have started quite well.
Sunday, 19 January 2020
seteq and setne instructions
Because the C99 standard (and newer) requires [section 6.5.8] comparison operators like < > <= => and logical (non-bitwise) operators like && and || to return either zero or one even though any non-zero value in a logical expression will be treated as true the code that my C-compiler generates for the operators is rather bulky, just to stay standard compliant.
The reason for this is because I have not implemented any convenient instruction to convert a non-zero value to one. So the code for the return statement in the code below
is converted to the assembly snippet show below (a is R2, 42 in r3)
To complete the set and make it easier to produce code for the < <= > and >= operators the setpos and setmin instructions are also implemented.
The reason for this is because I have not implemented any convenient instruction to convert a non-zero value to one. So the code for the return statement in the code below
void showcase_compare(int a){
return a == 42;
}
is converted to the assembly snippet show below (a is R2, 42 in r3)
load flags,#alu_cmp ; binop(==)
alu r2,r3,r2
beq post_0003 ; equal
move r2,0,0
bra post_0004
post_0003:
move r2,0,1
post_0004:
So in order to get a proper one or zero we always have to branch.
Seteq and setne
To prevent this kind of unnecessary branching I added two new instructions to the Robin cpu: seteq and setne that set the contents of a register to either zero or one depending on the zero flag. The compiler can now use these instructions to simplify the code to:
load flags,#alu_cmp ; binop(==)
alu r2,r3,r2
seteq r2
This saves not only 3 instructions in code size, but also 2 or 3 instructions being executed (2 if equal, 3 if not equal).Setpos and setmin
To complete the set and make it easier to produce code for the < <= > and >= operators the setpos and setmin instructions are also implemented.
Thursday, 16 January 2020
Additional instructions
The Robin cpu/soc is coming along nicely but when i started playing around with implementing a compiler it became quickly clear that code generation was hindered by not having relative branch instructions that could reach destinations beyond -128 or +127 bytes (a 8-bit signed integer).
The condition is used to check against the flags register. The highest bit of the condition determines if a flag should be set or unset and because bit 31 of the flags register is always 1 we even have an option for an unconditional branch (or even to never take the branch, which is rather useless)
Bit 30 and 29 of the flags register are the negative (sign) and zero bit respectively.
I also added pop and push instructions to reduce code size, even though it is a bit at odds with the RISC philosophy. These always use R14 as the stack pointer and the opcode looks like this:
I have a few other instructions I wish to implement, for example to sign extend a byte to a long, but already i am using almost all available LUTs on the iCEbreaker.
There are a few options though: until now i have been using next-pnr's heap placer which is quite fast (just a few seconds on my machine). The sa placer however is much slower (more than 60 seconds) but also generates a result that saves me about 250 LUTs!
The second option is to play around with the numerical values of the state labels. This may sound weird but the current implementation of the cpu has 29 states, i.e. a 5 bit state register. If i number them consecutively from 0 - 28 yosys uses more LUTs than when I assign the last state the number 31. Apparently the huge multiplexer generated for this state machine benefits from gaps in the list of possible states.
In the end I intend to simplify and optimise this design but for now I stick with the sa placer.
Long branch
So I expanded the instruction set to take a full 32-bit signed offset. If the 8bit offset is zero, the next 4 bytes will be used as a the offset. The complete instruction now looks like this:[15:12] opcode (13) [11: 8] condition [ 7: 0] offset Optional: 4 bytes offset (if offset == 0)
The condition is used to check against the flags register. The highest bit of the condition determines if a flag should be set or unset and because bit 31 of the flags register is always 1 we even have an option for an unconditional branch (or even to never take the branch, which is rather useless)
if cond[2:0] & R13[31:29] == cond[3] then PC += offset ? offset : (PC)
Bit 30 and 29 of the flags register are the negative (sign) and zero bit respectively.
Stack instructions
[15:12] opcode (15) [11: 8] register [ 7: 0] 1 = pop, 2 = push
Verilog observations
There are a few options though: until now i have been using next-pnr's heap placer which is quite fast (just a few seconds on my machine). The sa placer however is much slower (more than 60 seconds) but also generates a result that saves me about 250 LUTs!
The second option is to play around with the numerical values of the state labels. This may sound weird but the current implementation of the cpu has 29 states, i.e. a 5 bit state register. If i number them consecutively from 0 - 28 yosys uses more LUTs than when I assign the last state the number 31. Apparently the huge multiplexer generated for this state machine benefits from gaps in the list of possible states.
In the end I intend to simplify and optimise this design but for now I stick with the sa placer.
Sunday, 12 January 2020
Compiler
Assembler is nice but to get a feel how well the SoC design fits day to day programming tasks I started crafting a small C compiler.
I probably should call it a compiler for a 'C-like language' because it implements a tiny subset of C, just enough to implement some basic functions. Currently it supports int and char as well as pointers and you can define and call functions. Control structures are limited to while, if/else and return but quite a few binary and unary operators have been implemented already.
Because the compiler is based on the pycparser module that can recognize the full C99 spec it will be rather straight forward to implement missing features.
Even for the small string manipulation functions it quickly becomes clear that additional instructions for the CPU would be welcome. The biggest benefit would probably be to have:
Currently implemented as two instructions, one to change the stack pointer and another to load or store the register. This approach makes it possible to use any register as a stack pointer but for compiled c we need just one.
If we load a byte into a register we often have to zero it out before load it. This way we can easily change just the lower byte of the flags register but otherwise it is less convenient.
This would greatly reduce the overhead in expressiins involving && and ||
Plenty of room for improvement here 😁
I probably should call it a compiler for a 'C-like language' because it implements a tiny subset of C, just enough to implement some basic functions. Currently it supports int and char as well as pointers and you can define and call functions. Control structures are limited to while, if/else and return but quite a few binary and unary operators have been implemented already.
Because the compiler is based on the pycparser module that can recognize the full C99 spec it will be rather straight forward to implement missing features.
Pain points
Even for the small string manipulation functions it quickly becomes clear that additional instructions for the CPU would be welcome. The biggest benefit would probably be to have:
- Conditional branch instructions with a larger offset than just one byte.
- Pop/push instructions.
Currently implemented as two instructions, one to change the stack pointer and another to load or store the register. This approach makes it possible to use any register as a stack pointer but for compiled c we need just one.
- Better byte loading.
If we load a byte into a register we often have to zero it out before load it. This way we can easily change just the lower byte of the flags register but otherwise it is less convenient.
- alu operation to convert an int to a boolean
This would greatly reduce the overhead in expressiins involving && and ||
Plenty of room for improvement here 😁
Subscribe to:
Posts (Atom)
CPU design
The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...