Sunday, 16 February 2020

Regression tests

In a previous post I mentioned the importance of creating a proper framework for testing.

In order to properly debug more complex programs like some functions in the libc library, I created a simulator. This simulator allows you to set breakpoints and single step through a program but also contains features that come in handy for regression tests, like dumping the contents of registers and memory locations after running a program. This information can be checked against known values in a test.

The simulator is great for more complex programs and to verify that elements of the toolchain like the assembler keep working as designed and as such these tests can even be automated as a GitHub push triggered action. These tests do not test the actual hardware though.

Our monitor program is scriptable however, so I designed a couple of tests that execute all instructions and alu operations and verify the results against known values. These tests still don't cover all edge cases but are sufficient to verify proper performance once I start refactoring the cpu and alu. Also, the hardware tests are mirrored in a test for the simulator (in the makefile) , so can be used to test its behavior as well.

Refactoring the cpu and alu


There are many things that can be improved in the design, especially when considering resource usage, instruction set design and performance.

I already started rewriting the very complex state machine into something simpler but without changing the instruction set (apart from removing the obsolete loadw and storw instructions). This already saves more than a hundred LUTs but I want to do more.

Things I have in mind:

  • Removing carry related functionality
  • Moving pop/push from 'special' sub-opcodes to their own instruction opcodes
  • Merging the setxxx and branch instructions into a combined instruction


Now that we have a somewhat decent test framework in place these changes can be more easily tested against regressions. The results of these refactorings will be reported on in future articles.

Saturday, 8 February 2020

A simulator for the Robin cpu

When I started playing around with a SoC design that I wanted to implement on the iCEbreaker, I quickly realized that without proper testing tools even a moderately ambiteous design would quickly become too complex to change and improve.

There exist of course tools to simulate verilog designs and even perform formal verification but my skill level is not quite up to that yet. On top of that I am convinced that many changes that I want to try out in the cpu would benefit from regression tests that are based on real code, i.e. code generated by a compiler instead of artificial tiny bits of code: code that you do not directly implement yourself tends to expose issues in the instruction set or bugs in it hardware implementation quicker than when you deliberately try to construct tiny test cases.

For these more realistic tests an assembler and a C compiler were created and they were used  to implement small string and floating point libraries mimicking some of the functions in the C standard library. And they proved their worth as they uncovered among other things bugs in the handling of conditional branches for example.

However, as we will use the assembler and compiler to perform regression tests on the cpu it is important that these tools themselves are as bug free as possible, even when we add new functionality or change implementation details. Ideally some contineous integration would be implemented using GitHub actions that would be triggered on every push.

There is one catch here though: we cannot perform the final test in our chain of dependencies simply because the GitHub machines do not have an iCEbreaker board attached ☺️

We can deal with this challenge by creating a program the will simulate the cpu we have implemented on our fpga. This way we should be able to perform the tests for the compiler/assembler toolchain against this simulator with the added benefit of having more debugging options available (because they are much easier to implement in a bit of Python that in our resource constrained hardware.

The first version of this simulator is now commited and i hope to create some contineous integration actions in the near future.

Saturday, 1 February 2020

The Robin SoC on the iCEbreaker: current status

It is perhaps a bit weird, i started first with the iCEstick and then with the iCEbreaker to play around with FPGAs but actually i spend more time tweaking the assembler and C compiler than working on the hardware.

This is a good thing really, for it means that the hardware is working quite well. The reason for writing an assembler and a compiler (besides being fun) is to be able to test the SoC in a more thorough manner than just poking some bytes to see what it does.

You can of course simulate and verify correct behaviour of individual components but i find it infeasible to simulate a complete CPU (at least not with my limited experience). This means that at least in my opinion simulations are the hardware equivalent of unit tests, important in their own right but insufficient to test a complete design in a way proper programs put a CPU through its motions. Another goal is to find out whether the Instruction Set Architecture (ISA) i have thrown together is workable from a compiler point of view.

Over the last couple of weeks i slowly expanded the functionality of the C compiler (although it is a far cry from being standards compliant and it likely will stay that way) and while this compiler currently just supports char and int types I started writing a soft float library. And indeed while doing this I encountered a serious bug in my hardware design: conditional branches were not always taken correctly.

This was exactly why i started writing actual programs that exercise the CPU in more realistic ways. However, most of the things i encountered where bugs in the compiler rather than in the hardware but all the work until now resulted in this provisional list of observations:


  • Single stepping

It would be nice to have the option to single step through a program on the hardware level. The CPU has a halt instruction and in its ROM i have implemented a small program that dumps all registers but there is no easy way to restart again. The need for single stepping is somewhat lessened now that i implemented print and itoa functions but it still would be nice to have.


  • Interrupts and the UART

I didn't design the UART myself because i wanted to implement a monitor first and my verilog skills weren't up to it at that point. Later I added FIFOs to the UART to prevent overruns and made it available to the CPU via memory mapping. However, currently the UART is a bottleneck: it doesn't perform reliably at speeds over 115200 baud and even then it is fairly easy to confuse it. Obviously this would be a prime candidate for a proper redesign and part of this would be dealing with interrupts so that the CPU would waste its time polling for an available character or making sure enough time has passed to send another one.


  • Complexity

The current CPU implementation is quite complex i think: instead of calculating many control signals with a single meaning most of the logic is implemented in a rather long and deep state machine. The SoC as a whole (monitor, CPU and supporting components) currently eats up all but one (!) of the LUTs of the up5k on the iCEbreaker. On the other hand i noticed that even in the current RISC like design, some instructions are never used: the 16 bit move, load and store instructions for example. This of course because my compiler doesn't bother with anything that isn't a byte or a 4byte entity but it nevertheless shows that depending on the area of application we might reconsider the design.

Next steps

The next couple of weeks i will stay focused on improving the C compiler and the test suite until i am confident that changes in the hardware design can be properly checked for regressions. And because compiler writing and especially code generation is fun I'll probably write about some interesting finds along the way.

Then I'll probably focus on simplifying the CPU design to free up hardware resources. At that point i also hope to start on properly documenting the design as currently it is a bit of a mess (like this rambling blog ☺️). After that I'll probably start on the difficult stuff, i.e improving the UART, but we'll see.

Tuesday, 28 January 2020

Progress

The toolchain (compiler, assembler) for the Robin SoC is really shaping up now, so i started creating a test suite consisting of several functions commonly found in libc.

Even though the C compiler isn't anywhere near completion, it does now produce code that might be ugly but is capable of producing good enough assembly.

The goal of this all is to create a proper test suite of executable programs that can be used to check if future changes in the cpu still perform as designed.

Compiler status


The compiler now supports most control structures (for, while, if/else, break, continue, return) except switch.

It supports char and int data types including pointers and arrays but not yet structs or unions. Some work to support floats is underway (see below).
Variables can be automatic (local) or static (file scope).

Most unary and binary operators are supported including the ternary  ?: operator, pointer dereferencing (*) and function calls, but not the address of operator (&).

Type checking however is weak (almost non existent to be honest 😁) and the assembly code it produces is far from optimal but it works. Storage specifiers like static and volatile are completely ignored.

Implemented functions


The C standard library is rather large and although I have no intention to implement more than a small subset, these functions do provide a good example of realistic functionality which is why I have chosen it as a test vehicle.

The implementation is done from scratch and of course targeted at just the Robin SoC, which makes life a lot easier because a full blown portable libc is humongous.

The current status (with links) is shown below; more functions will probably follow soon, especially low level functions to implement a (bare bones) soft float library.

From string.h


These functions are mainly implemented to support the integer and float conversion functions in stdlib.h but are of course useful on their own as well.

strlen.c
strchr.c
strreverse.c

From stdio.h


File based functions are a long way off still but some basic output over the serial interface is provided here. Later some input functions will be provided as well.

putchar.c
print.c (this one is not actually in libc, it just prints a string)

From stdlib.h


In order to test the float functions later, we absolutely need some basic conversion functions so I implemented those first. Note that the float functions currently just support basic decimal fractions, scientific notation (2.3E-9) is not (yet) supported. They do work with standard ieee float32 numbers but no rigorous compliance is attempted with regard to rounding etc.

atoi.c
ftoi.c
itoa.c
itof.c

Conclusion


These functions need to be thoroughly tested before they can actually be used as a proper test suite for the hardware but I feel we have started quite well.

Wednesday, 22 January 2020

Implementing right shift with left shift

In the previous article I showed an implementation of a left shift instruction that made use of multiplication instead of implementing the barrel shifter directly. Because on an iCEbreaker/up5k multiplication is fast but resources are scarce, this makes sense.

But with left shift available we can now also implement right shift because for a 32 bit register, right shift by N positions can be interpreted as a left shift by 32 - N positions and than looking at the upper 32 bits. This is visualized below


The verilog code needs to be changed only a little bit:



wire shiftq    = op[4:0] == 12;  // true if operaration is shift left
wire shiftqr   = op[4:0] == 13;  // true if operaration is shift right
wire doshift   = shiftq | shiftqr;
wire [5:0] invertshift = 6'd32 - {1'b0,b[4:0]};
wire [4:0] nshift = shiftqr ? invertshift[4:0] : b[4:0];
wire shiftlo   = doshift & ~nshift[4]; // true if shifting < 16 bits
wire shifthi   = doshift &  nshift[4]; // true if shifting >= 16 bits

...

// 4 16x16 bit partial multiplications
// the multiplier is either the b operand or a power of two for a shift
// note that b[31:16] for shift operations [31-0] is always zero
// so when shiftlo is true al_bh and ah_bh still result in zero
// the same is not true the other way around hence the extra shiftq check
// note that the behavior is undefined for shifts > 31
wire [31:0] mult_al_bl = a[15: 0] * (shiftlo ? shiftla16 : doshift ? 16'b0 : b[15: 0]);
wire [31:0] mult_al_bh = a[15: 0] * (shifthi ? shiftla16 : b[31:16]);
wire [31:0] mult_ah_bl = a[31:16] * (shiftlo ? shiftla16 : doshift ? 16'b0 : b[15: 0]);
wire [31:0] mult_ah_bh = a[31:16] * (shifthi ? shiftla16 : b[31:16]);

...

assign result = 
     ...
     shiftq  ? {1'b0, mult64[31:0]} :
     shiftqr ? {1'b0, mult64[63:32]} :
     ...
     ;

The only thing we do here is subtracting the number of positions to shift from 32 if we are dealing with a shift right instruction and also swap in the correct arguments for the multiplication for both the left and the right shift operation. Also, when selecting the final result we take care of selecting the uppermost 32 bits fro the right shift where a left shift would select the lower 32 bits.

Tuesday, 21 January 2020

Turning things around: Implementing shift instructions using multiplications

Often when cpu instruction sets lack a direct multiplication operation, people resort to implementing multiplication by using combinations of shift and add instructions. Even when a multiplication instruction is available, multiplication by simple powers of two might be faster when performed in a single shift operation that is executed in a single clock cycle than with a multiplication instruction that may take many cycles.

Implementing a shift instruction that can shift a 32 bit register by an arbitrary number of bits can consume a lot of resources though. On a the Lattice up5k i found that it could easily use hundreds of LUTs. (The exact number depends on various things other than register size because placement by next-pnr has some randomness and some additional LUTs might be consumed to meet fan-out and timing requirements, so a design size might change considerably even when changing just a few bits. The multiplexers alone already will consume 160 LUTs)

I didn't have that many resources left for my design so i either had to economize or devise a cunning plan 🙂

Turning things around

The up5k on the iCEbreaker board does have something the iCEstick hx1k didn't have: dsp cores, i.e. fast multipliers (at 12Mhz they operate in less than a clock cycle). In fact the up5k has eight dsp cores so i already implemented 32 x 32 bit multiplication using 4 of those, but I still want to have variable shift instructions because they might be needed in all sorts of bit twiddling operations used when implementing a soft floating point library for example.

The fun bit is that we can reuse the multiplication units here if we convert the variable shift amount into a power of two. Because calculating the power of two is simply setting a single bit in an otherwise empty register, this takes far less resources.

The verilog code for this part of the ALU is shown below (ALU code ob GitHub)

// first part: calculate a power of two
wire shiftq    = op[4:0] == 12;     // true if operaration is shift left
wire shiftlo   = shiftq & ~b[4];    // true if shifting < 16 bits
wire shifthi   = shiftq &  b[4];    // true if shifting >= 16 bits

// determine power of two
wire shiftla0  = b[3:0]  == 4'd0;   // 2^0 = 1
wire shiftla1  = b[3:0]  == 4'd1;   // 2^1 = 2
wire shiftla2  = b[3:0]  == 4'd2;   // 2^2 = 3
wire shiftla3  = b[3:0]  == 4'd3;   // ... etc 
...
wire shiftla15 = b[3:0]  == 4'd15;

// combine into 16 bit word
wire [15:0] shiftla16 = {shiftla15,shiftla14,shiftla13,shiftla12,
                         shiftla11,shiftla10,shiftla9 ,shiftla8 ,
                         shiftla7 ,shiftla6 ,shiftla5 ,shiftla4 ,
                         shiftla3 ,shiftla2 ,shiftla1 ,shiftla0};

// second part: reusing the multiplication code
// 4 16x16 bit partial multiplications
// the multiplier is either the b operand or a power of two for a shift
// note that b[31:16] for shift operations [31-0] is always zero
// so when shiftlo is true al_bh and ah_bh still result in zero
// the same is not true the other way around hence the extra shiftq check
// note that the behavior is undefined for shifts > 31

wire [31:0] mult_al_bl = a[15: 0] * (shiftlo ? shiftla16 : shiftq ? 16'b0 : b[15: 0]);
wire [31:0] mult_al_bh = a[15: 0] * (shifthi ? shiftla16 : b[31:16]);
wire [31:0] mult_ah_bl = a[31:16] * (shiftlo ? shiftla16 : shiftq ? 16'b0 : b[15: 0]);
wire [31:0] mult_ah_bh = a[31:16] * (shifthi ? shiftla16 : b[31:16]);

// combine the intermediate results into a 64 bit result
wire [63:0] mult64 = {32'b0,mult_al_bl} + {16'b0,mult_al_bh,16'b0}
                   + {16'b0,mult_ah_bl,16'b0} + {mult_ah_bh,32'b0};

// final part: compute the result of the whole ALU
wire [32:0] result;

assign result = 
            op[4:0] == 0 ? add :
            op[4:0] == 1 ? adc :
            ...
            shiftq ? {1'b0, mult64[31:0]} :
            ...
            ;

The first half constructs rather than computes the power of two by creating a single 16 bit word with just a single bit set.

The second half selects the proper multiplier parts based on the instruction (regular multiplication or shift left)

The final part is about returning the result: it will be in the lower 32 bits of the combined results. Note that shifting by 32 bits should return zero but selecting for this explicit situation will add more LUTs to my design than I have currently available (using 5181 out of 5280). So for this implementation the behavior for shifts outside the range [0-31] is not defined.

Implementation notes

The code is simple because we do not need all multiplication and addition steps of a full 32 x 32 bit multiplication because if a number is a power of two, only one of the two 16 bits of the multiplier will be non zero (for shift amounts < 32).

Multiplying two 32 bit numbers involves four 16 bit multiplications (of each combination of the 16 bit halves of the multiplier and multiplicand). The four intermediate 32 bit results are then added to a 64 bit result.

If one of the halves of the multiplier is zero then two multiplication steps are no longer necessary as their result will be zero and the corresponding addition steps will be redundant too.



LUT Usage


Just to give some idea about the resources used by a barrel shifter vs. this multiplication based implementation I have created bare bone implementations (shiftleft.v and shiftleft2.v) and checked those with yosys/next-pnr.

shiftleft.v (barrel)shiftleft2.v (multiplier)
ICESTORM_LC19967
ICESTORM_DSP03

(side note: the stand alone multiplier implementation only uses 3 DSPs compared to the 4 used by the full ALU but that is because yosys optimizes away the multiplication of both upper halves of the words as they can only end up in the upper 32 bits of the result which we do not use for the left shift)

Sunday, 19 January 2020

seteq and setne instructions

Because the C99 standard (and newer) requires [section 6.5.8] comparison operators like < > <= => and logical (non-bitwise) operators like && and || to return either zero or one even though any non-zero value in a logical expression will be treated as true the code that my C-compiler generates for the operators is rather bulky, just to stay standard compliant.

The reason for this is because I have not implemented any convenient instruction to convert a non-zero value to one. So the code for the return statement in the code below

void showcase_compare(int a){
 return a == 42;
}

is converted to the assembly snippet show below (a is R2, 42 in r3)

        load    flags,#alu_cmp      ; binop(==)
        alu     r2,r3,r2            
        beq     post_0003           ; equal
        move    r2,0,0              
        bra     post_0004           
post_0003:
        move    r2,0,1              
post_0004:
So in order to get a proper one or zero we always have to branch.

Seteq and setne

To prevent this kind of unnecessary branching I added two new instructions to the Robin cpu: seteq and setne that set the contents of a register to either zero or one depending on the zero flag. The compiler can now use these instructions to simplify the code to:

        load    flags,#alu_cmp      ; binop(==)
        alu     r2,r3,r2            
        seteq   r2
This saves not only 3 instructions in code size, but also 2 or 3 instructions being executed (2 if equal, 3 if not equal).

Setpos and setmin


To complete the set and make it easier to produce code for the < <= > and >= operators the setpos and setmin instructions are also implemented.

CPU design

The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...