Sunday, 15 March 2020
Count leading zeros in verilog
It documents (with code) the implementation of a new module to count leading zeros.
Saturday, 14 March 2020
Verilog test benches
for(i=7; i>=0 ; i--) begin
...
end
The simple solution of course was to replace
i--
with i=i-1
, but it is still not completely clear to me what the exact differences are in Verilog versions supported by Yosys and Icarus.
clz_tb: ../clz.v clz_tb.v
$(VERILOG) -o $@ $^ ; vvp $@ | awk "/FATAL/{exit(1)}"
Doubts about Verilog
I can't tell at this point if VHDL or other hardware definition languages are any better but the longer i work with Verilog the more doubts i have: It does not clearly separate simulation from synthesis, its syntax (especially scoping rules for variables) is illogical, you can't define functions with more than one statement (at least not in verilog-2001) and every implementation is allowed to diverge from the standard by chosing to implement some features or not. I am not sure why people in the hardware world accept this, couldn't imagine this happening to Python implementations for example.
Anyway, it works sort of, so we'll see where it gets us; maybe with a bit more experience it will be less awkward to work with.
Saturday, 7 March 2020
Optimizing the fetch decode execute cycle II
Because we know the opcode for any instruction already in the FETCH3 cycle we can set the mem_raddr register with the contents of the stackpointer if we are dealing with a pop instruction or keep on incrementing the mem_raddr for those instructions that are followed by some bytes after the instructions itself, like the two byte offset fir the branch instruction and the four bytes of the load immediate instruction. And if we set the mem_raddr register two cycles earlier that means that we can actually read those bytes two cycles earlier as well.
This newly implemented scenario is summed up in the table below (click to enlarge)
Some more opportunities
Thursday, 5 March 2020
Optimizing the fetch decode execute cycle
By closely looking at the timing diagrams for memory access we could reduce the number of cycles in the fetch part significantly. Meanwhile I implemented some additional optimizations and currently the MOVE and LOADL instruction clock in (pun intended) at 4 and 9 cycles respectively, a speedup of about 2x compared to the initial implementation.
The diagram below illustrates the different activities that take place in the various states (click to enlarge):
The important bit to understand here is that we do not read anything from memory in the decode and exec1 states. For some instructions this is inevitable because only after reading the second byte of an instruction (in fetch4) and adding the two source registers (available in decode, because adding those to registers needs a clock cycle) can we load the mem_raddr register and start loading two cycles later.
However, for instructions like LOADIL (load immediate ling word) and SETBRA, the data and offset respectively are located just after the actual instruction, so we could keep on incrementing the mem_raddr in states fetch 3 and fetch 4 so that the first two bytes would be available in the decode and exec 1 states as indicated by the highlighted 'gaps' in the table.
Even for the POP instruction we know what the address should be because we can refer to register 14 (the stackpointer). The only thing we have to keep in ind that we need to decide whether to keep on incrementing the mem_raddr register or to load it with the address in the stack pointer. We can make this decision in state fetch 3 already because there we read the high byte of the instruction which contains the intructions opcode.
So next on my agenda is to see whether we can indeed implement this idea. it would potentially shave of another 2 cycles from the the LOADIL, SETBRA and POP instructions so it is certainly worth the effort.
Saturday, 29 February 2020
The Robin SoC has a dedicated website now
It is implemented as a GitHub site, check it out from time to time as articles get added.
Monday, 24 February 2020
iCE40 BRAM & SPRAM access: The need for speed
ip1
is r[15]+1
):
case(state)
FETCH1 : begin
r[0] <= 0;
r[1] <= 1;
r[13][31] <= 1; // force the always on bit
mem_raddr <= ip;
state <= halted ? FETCH1 : FETCH2;
end
FETCH2 : state <= FETCH3;
FETCH3 : begin
instruction[15:8] <= mem_data_out;
r[15] <= ip1;
state <= FETCH4;
end
FETCH4 : begin
mem_raddr <= ip;
state <= FETCH5;
end
FETCH5 : state <= FETCH6;
FETCH6 : begin
instruction[7:0] <= mem_data_out;
r[15] <= ip1;
...
So between every assignment to the mem_raddr
register (in state FETCH1 and FETCH4) and the retrieval of the byte from the mem_data_out
register (in state FETCH3 and FETCH6) we had a wait cycle.
Now it is true that for the ice40 BRAM there needs to be two clock cycles between setting the address and reading the byte, but we can already set the new address in the next cycle, allowing us to read a byte every clock cycle once we set the initial address.
This alternative approach looks like this:
case(state)
FETCH1 : begin
r[0] <= 0;
r[1] <= 1;
r[13][31] <= 1; // force the always on bit
mem_raddr <= ip;
state <= halted ? FETCH1 : FETCH2;
end
FETCH2 : begin
state <= FETCH3;
r[15] <= ip1;
mem_raddr <= ip1;
end
FETCH3 : begin
instruction[15:8] <= mem_data_out;
state <= FETCH4;
end
FETCH4 : begin
instruction[7:0] <= mem_data_out;
r[15] <= ip1;
...
So we set the address in states FETCH1 and FETCH2 and read the corresponding bytes in states FETCH3 and FETCH4 respectively, saving us 2 clock cycles for every instruction. Since the most used instructions took 8 cycles and now 6, this is a reduction of 25%. Not bad I think.
And although not very well documented (or documented at all actually) this setup works for SPRAMS as well.
Sunday, 23 February 2020
The Robin SoC on the iCEbreaker: current status
Simplification
The main decoding loop in the CPU was rather convoluted so both were redesigned a bit to improve readability of the Verilog code as well as reduce resource consumption. (The ALU code was updated in place, the CPU code got a new file)
Because by now I also have some experience with the code that is being generated by the compiler, I was able to remove unused instructions and ALU operations. Previously the pop, push and setxxx instructions were considered sub-instructions within one special opcode, now they are individual instructions (in case of pop and push) or rolled into a single set-and-branch instruction. The new instruction set architecture was highlighted in a separate article.
Less resources
All in all this redesign shrunk the number of LUTs consumed from 5279 (yes, just one LUT removed from 100%) to 4879 (92%), which is pretty neat because it leaves some room for additional functionality or tweaks. The biggest challenge by the way is Yosys: even slight changes in the design, like assigning different values to labels of a case statement that is not full, may result in a different number of LUTs consumed. This is something that needs some more research, maybe Yosys offers additional optimization options that let me get the lowest resource count in a more predictable manner.
Better testing
A significant amount of effort was spent on designing more and better regression tests. Both for the SoC and the support programs (assembler, simulator, ...) regression tests and syntax checkers were added. Most of these were also added to GitHub push actions, with the exception of the actual hardware tests because I cannot run those on GitHub. And of course this mainly done to show a few green banners on the repository home page 😀
Bug fixes
With a better testing framework in place it is far easier to check whether changes don't inadvertently break something. This was put to work in fixing one of the more annoying bugs left in the ALU design: previously shift left by more than 31 and shift right by 0 did not give a proper result. This is now fixed.
Frustrations
The up5k on the iCEbreaker board has 8 dsp cores. We currently use 4 of them to implement 32x32 bit multiplication. The SB_MAC16 primitives we use for this are inferred by Yosys from some multiplication statements we use in the ALU (i.e. we do not instantiate them directly) and work fine.
However, when I want to instantiate some of them directly and configure them to be used as 32 bit adders these instantiations will still multiply instead of add! No matter what I do, the result stays teh same. I have to admit I have no idea how Yosys infers stuff so it might very well be that my direct instantiation gets rewritten by some Yosys stage, so I will have to do some more research here.
What next?
I think next on the agenda is performance: I think I use too many read states for the fetch/decode/execute cycle. The Lattice technical documentation seems to imply we can read and write new data every clock cycle, at least for block ram. Unfortunately the docs for the SPRAM are less clear. Anyway, this area for sure needs some attention.
Saturday, 22 February 2020
Instruction Set Architecture
All instructions are 2 bytes long, with the exception of loadil (load immediate long) which is 6 bytes and the set and branch instruction which is 4 bytes.
Encoding
Instructions consist of 4 fields each 4 bits wide. The first field is the opcode, the other 3 fields typically specify registers and are called r2,r1 and r0 respectively, although there are exceptions. For loadi (load immediate) r1 and r0 together encode a byte value and for set and branch r2 encodes the condition.
move | move r2,r1,r0 | r2 = r1 + r0 | move sum of source registers to destination |
pop | pop r2 | r2 = (sp) ; sp += 4 | pop top of stack to destination register |
push | push r2 | sp -= 4; (sp) = r2 | push register onto stack |
alu | alu r2,r1,r0 | r2 = r1 op r0 | perform alu operation (op = r[13][7:0]) |
mover | mover r2,r1,n | r2 = r1 + 4*n | add multiple of 4, n = [-8,7] |
stor | stor r2,r1,r0 | (r1 + r0) = r2[7:0] | store byte in memory |
storl | storl r2,r1,r0 | (r1 + r0) = r2 | store long word in memory |
load | load r2,r1,r0 | r2[7:0] = (r1 + r0) | load byte from memory |
loadl | loadl r2,r1,r0 | r2 = (r1 + r0) | load long word from memory |
loadi | loadi r2,#n | r2[7:0] = n | load byte immediate, n = 8b bit value |
loadil | loadil r2,#n | r2 = (pc +2); pc+=4 | load long word immediate |
jal | jal r2,r1,r0 | r2 = pc; pc = r1+r0 | jump and link |
halt | halt | halt execution | |
setbXX | setbXX r1,off | see below | set and branch on condition XX |
Notes
- Loading a byte from memory or immediately does not sign extend it. This means that a destinations register may need to be zeroed out before loading the byte.
- The alu operation performs the operation stored in the lower byte of R13. This means that choosing the operation and actually performing it are two separate steps. You can reuse this operation, if performing multiple additions for example there is no need to reload R13
- R13 is also the flags register: bits 31 is always set while bit 30 and 29 are the sign and zero bit respectively.
- The alu does not calculate a carry or borrow
Set and branch
Special registers
R1 en r0 are immutable and hardcoded to hold 1 and 0 respectively.
R13 is the flags and alu operation register.
R14 is the stackpointer targeted by the pop and push instructions.
R15 is the program counter (PC, a.k.a. instruction pointer).
Supported alu operations
Add and Sub
And, Or and Xor
Not
Shift right, Shift left
Cmp and Test
Mul (high and low 32 bits)
Div and rem (signed and unsigned)
Sunday, 16 February 2020
Regression tests
In order to properly debug more complex programs like some functions in the libc library, I created a simulator. This simulator allows you to set breakpoints and single step through a program but also contains features that come in handy for regression tests, like dumping the contents of registers and memory locations after running a program. This information can be checked against known values in a test.
The simulator is great for more complex programs and to verify that elements of the toolchain like the assembler keep working as designed and as such these tests can even be automated as a GitHub push triggered action. These tests do not test the actual hardware though.
Our monitor program is scriptable however, so I designed a couple of tests that execute all instructions and alu operations and verify the results against known values. These tests still don't cover all edge cases but are sufficient to verify proper performance once I start refactoring the cpu and alu. Also, the hardware tests are mirrored in a test for the simulator (in the makefile) , so can be used to test its behavior as well.
Refactoring the cpu and alu
There are many things that can be improved in the design, especially when considering resource usage, instruction set design and performance.
I already started rewriting the very complex state machine into something simpler but without changing the instruction set (apart from removing the obsolete loadw and storw instructions). This already saves more than a hundred LUTs but I want to do more.
Things I have in mind:
- Removing carry related functionality
- Moving pop/push from 'special' sub-opcodes to their own instruction opcodes
- Merging the setxxx and branch instructions into a combined instruction
Now that we have a somewhat decent test framework in place these changes can be more easily tested against regressions. The results of these refactorings will be reported on in future articles.
Saturday, 8 February 2020
A simulator for the Robin cpu
There exist of course tools to simulate verilog designs and even perform formal verification but my skill level is not quite up to that yet. On top of that I am convinced that many changes that I want to try out in the cpu would benefit from regression tests that are based on real code, i.e. code generated by a compiler instead of artificial tiny bits of code: code that you do not directly implement yourself tends to expose issues in the instruction set or bugs in it hardware implementation quicker than when you deliberately try to construct tiny test cases.
For these more realistic tests an assembler and a C compiler were created and they were used to implement small string and floating point libraries mimicking some of the functions in the C standard library. And they proved their worth as they uncovered among other things bugs in the handling of conditional branches for example.
However, as we will use the assembler and compiler to perform regression tests on the cpu it is important that these tools themselves are as bug free as possible, even when we add new functionality or change implementation details. Ideally some contineous integration would be implemented using GitHub actions that would be triggered on every push.
There is one catch here though: we cannot perform the final test in our chain of dependencies simply because the GitHub machines do not have an iCEbreaker board attached ☺️
We can deal with this challenge by creating a program the will simulate the cpu we have implemented on our fpga. This way we should be able to perform the tests for the compiler/assembler toolchain against this simulator with the added benefit of having more debugging options available (because they are much easier to implement in a bit of Python that in our resource constrained hardware.
The first version of this simulator is now commited and i hope to create some contineous integration actions in the near future.
Saturday, 1 February 2020
The Robin SoC on the iCEbreaker: current status
This is a good thing really, for it means that the hardware is working quite well. The reason for writing an assembler and a compiler (besides being fun) is to be able to test the SoC in a more thorough manner than just poking some bytes to see what it does.
You can of course simulate and verify correct behaviour of individual components but i find it infeasible to simulate a complete CPU (at least not with my limited experience). This means that at least in my opinion simulations are the hardware equivalent of unit tests, important in their own right but insufficient to test a complete design in a way proper programs put a CPU through its motions. Another goal is to find out whether the Instruction Set Architecture (ISA) i have thrown together is workable from a compiler point of view.
Over the last couple of weeks i slowly expanded the functionality of the C compiler (although it is a far cry from being standards compliant and it likely will stay that way) and while this compiler currently just supports char and int types I started writing a soft float library. And indeed while doing this I encountered a serious bug in my hardware design: conditional branches were not always taken correctly.
This was exactly why i started writing actual programs that exercise the CPU in more realistic ways. However, most of the things i encountered where bugs in the compiler rather than in the hardware but all the work until now resulted in this provisional list of observations:
- Single stepping
It would be nice to have the option to single step through a program on the hardware level. The CPU has a halt instruction and in its ROM i have implemented a small program that dumps all registers but there is no easy way to restart again. The need for single stepping is somewhat lessened now that i implemented print and itoa functions but it still would be nice to have.
- Interrupts and the UART
I didn't design the UART myself because i wanted to implement a monitor first and my verilog skills weren't up to it at that point. Later I added FIFOs to the UART to prevent overruns and made it available to the CPU via memory mapping. However, currently the UART is a bottleneck: it doesn't perform reliably at speeds over 115200 baud and even then it is fairly easy to confuse it. Obviously this would be a prime candidate for a proper redesign and part of this would be dealing with interrupts so that the CPU would waste its time polling for an available character or making sure enough time has passed to send another one.
- Complexity
The current CPU implementation is quite complex i think: instead of calculating many control signals with a single meaning most of the logic is implemented in a rather long and deep state machine. The SoC as a whole (monitor, CPU and supporting components) currently eats up all but one (!) of the LUTs of the up5k on the iCEbreaker. On the other hand i noticed that even in the current RISC like design, some instructions are never used: the 16 bit move, load and store instructions for example. This of course because my compiler doesn't bother with anything that isn't a byte or a 4byte entity but it nevertheless shows that depending on the area of application we might reconsider the design.
Next steps
The next couple of weeks i will stay focused on improving the C compiler and the test suite until i am confident that changes in the hardware design can be properly checked for regressions. And because compiler writing and especially code generation is fun I'll probably write about some interesting finds along the way.Then I'll probably focus on simplifying the CPU design to free up hardware resources. At that point i also hope to start on properly documenting the design as currently it is a bit of a mess (like this rambling blog ☺️). After that I'll probably start on the difficult stuff, i.e improving the UART, but we'll see.
Tuesday, 28 January 2020
Progress
Even though the C compiler isn't anywhere near completion, it does now produce code that might be ugly but is capable of producing good enough assembly.
The goal of this all is to create a proper test suite of executable programs that can be used to check if future changes in the cpu still perform as designed.
Compiler status
The compiler now supports most control structures (for, while, if/else, break, continue, return) except switch.
It supports char and int data types including pointers and arrays but not yet structs or unions. Some work to support floats is underway (see below).
Variables can be automatic (local) or static (file scope).
Most unary and binary operators are supported including the ternary ?: operator, pointer dereferencing (*) and function calls, but not the address of operator (&).
Type checking however is weak (almost non existent to be honest 😁) and the assembly code it produces is far from optimal but it works. Storage specifiers like static and volatile are completely ignored.
Implemented functions
The implementation is done from scratch and of course targeted at just the Robin SoC, which makes life a lot easier because a full blown portable libc is humongous.
The current status (with links) is shown below; more functions will probably follow soon, especially low level functions to implement a (bare bones) soft float library.
From string.h
strlen.c
strchr.c
strreverse.c
From stdio.h
putchar.c
print.c (this one is not actually in libc, it just prints a string)
From stdlib.h
atoi.c
ftoi.c
itoa.c
itof.c
Conclusion
These functions need to be thoroughly tested before they can actually be used as a proper test suite for the hardware but I feel we have started quite well.
Wednesday, 22 January 2020
Implementing right shift with left shift
But with left shift available we can now also implement right shift because for a 32 bit register, right shift by N positions can be interpreted as a left shift by 32 - N positions and than looking at the upper 32 bits. This is visualized below
The verilog code needs to be changed only a little bit:
wire shiftq = op[4:0] == 12; // true if operaration is shift left
wire shiftqr = op[4:0] == 13; // true if operaration is shift right
wire doshift = shiftq | shiftqr;
wire [5:0] invertshift = 6'd32 - {1'b0,b[4:0]};
wire [4:0] nshift = shiftqr ? invertshift[4:0] : b[4:0];
wire shiftlo = doshift & ~nshift[4]; // true if shifting < 16 bits
wire shifthi = doshift & nshift[4]; // true if shifting >= 16 bits
...
// 4 16x16 bit partial multiplications
// the multiplier is either the b operand or a power of two for a shift
// note that b[31:16] for shift operations [31-0] is always zero
// so when shiftlo is true al_bh and ah_bh still result in zero
// the same is not true the other way around hence the extra shiftq check
// note that the behavior is undefined for shifts > 31
wire [31:0] mult_al_bl = a[15: 0] * (shiftlo ? shiftla16 : doshift ? 16'b0 : b[15: 0]);
wire [31:0] mult_al_bh = a[15: 0] * (shifthi ? shiftla16 : b[31:16]);
wire [31:0] mult_ah_bl = a[31:16] * (shiftlo ? shiftla16 : doshift ? 16'b0 : b[15: 0]);
wire [31:0] mult_ah_bh = a[31:16] * (shifthi ? shiftla16 : b[31:16]);
...
assign result =
...
shiftq ? {1'b0, mult64[31:0]} :
shiftqr ? {1'b0, mult64[63:32]} :
...
;
The only thing we do here is subtracting the number of positions to shift from 32 if we are dealing with a shift right instruction and also swap in the correct arguments for the multiplication for both the left and the right shift operation. Also, when selecting the final result we take care of selecting the uppermost 32 bits fro the right shift where a left shift would select the lower 32 bits.
Tuesday, 21 January 2020
Turning things around: Implementing shift instructions using multiplications
Turning things around
The fun bit is that we can reuse the multiplication units here if we convert the variable shift amount into a power of two. Because calculating the power of two is simply setting a single bit in an otherwise empty register, this takes far less resources.
The verilog code for this part of the ALU is shown below (ALU code ob GitHub)
// first part: calculate a power of two
wire shiftq = op[4:0] == 12; // true if operaration is shift left
wire shiftlo = shiftq & ~b[4]; // true if shifting < 16 bits
wire shifthi = shiftq & b[4]; // true if shifting >= 16 bits
// determine power of two
wire shiftla0 = b[3:0] == 4'd0; // 2^0 = 1
wire shiftla1 = b[3:0] == 4'd1; // 2^1 = 2
wire shiftla2 = b[3:0] == 4'd2; // 2^2 = 3
wire shiftla3 = b[3:0] == 4'd3; // ... etc
...
wire shiftla15 = b[3:0] == 4'd15;
// combine into 16 bit word
wire [15:0] shiftla16 = {shiftla15,shiftla14,shiftla13,shiftla12,
shiftla11,shiftla10,shiftla9 ,shiftla8 ,
shiftla7 ,shiftla6 ,shiftla5 ,shiftla4 ,
shiftla3 ,shiftla2 ,shiftla1 ,shiftla0};
// second part: reusing the multiplication code
// 4 16x16 bit partial multiplications
// the multiplier is either the b operand or a power of two for a shift
// note that b[31:16] for shift operations [31-0] is always zero
// so when shiftlo is true al_bh and ah_bh still result in zero
// the same is not true the other way around hence the extra shiftq check
// note that the behavior is undefined for shifts > 31
wire [31:0] mult_al_bl = a[15: 0] * (shiftlo ? shiftla16 : shiftq ? 16'b0 : b[15: 0]);
wire [31:0] mult_al_bh = a[15: 0] * (shifthi ? shiftla16 : b[31:16]);
wire [31:0] mult_ah_bl = a[31:16] * (shiftlo ? shiftla16 : shiftq ? 16'b0 : b[15: 0]);
wire [31:0] mult_ah_bh = a[31:16] * (shifthi ? shiftla16 : b[31:16]);
// combine the intermediate results into a 64 bit result
wire [63:0] mult64 = {32'b0,mult_al_bl} + {16'b0,mult_al_bh,16'b0}
+ {16'b0,mult_ah_bl,16'b0} + {mult_ah_bh,32'b0};
// final part: compute the result of the whole ALU
wire [32:0] result;
assign result =
op[4:0] == 0 ? add :
op[4:0] == 1 ? adc :
...
shiftq ? {1'b0, mult64[31:0]} :
...
;
The first half constructs rather than computes the power of two by creating a single 16 bit word with just a single bit set.
The second half selects the proper multiplier parts based on the instruction (regular multiplication or shift left)
The final part is about returning the result: it will be in the lower 32 bits of the combined results. Note that shifting by 32 bits should return zero but selecting for this explicit situation will add more LUTs to my design than I have currently available (using 5181 out of 5280). So for this implementation the behavior for shifts outside the range [0-31] is not defined.
Implementation notes
The code is simple because we do not need all multiplication and addition steps of a full 32 x 32 bit multiplication because if a number is a power of two, only one of the two 16 bits of the multiplier will be non zero (for shift amounts < 32).Multiplying two 32 bit numbers involves four 16 bit multiplications (of each combination of the 16 bit halves of the multiplier and multiplicand). The four intermediate 32 bit results are then added to a 64 bit result.
If one of the halves of the multiplier is zero then two multiplication steps are no longer necessary as their result will be zero and the corresponding addition steps will be redundant too.
LUT Usage
Just to give some idea about the resources used by a barrel shifter vs. this multiplication based implementation I have created bare bone implementations (shiftleft.v and shiftleft2.v) and checked those with yosys/next-pnr.
shiftleft.v (barrel) | shiftleft2.v (multiplier) | |
---|---|---|
ICESTORM_LC | 199 | 67 |
ICESTORM_DSP | 0 | 3 |
Sunday, 19 January 2020
seteq and setne instructions
The reason for this is because I have not implemented any convenient instruction to convert a non-zero value to one. So the code for the return statement in the code below
void showcase_compare(int a){
return a == 42;
}
is converted to the assembly snippet show below (a is R2, 42 in r3)
load flags,#alu_cmp ; binop(==)
alu r2,r3,r2
beq post_0003 ; equal
move r2,0,0
bra post_0004
post_0003:
move r2,0,1
post_0004:
So in order to get a proper one or zero we always have to branch.
Seteq and setne
To prevent this kind of unnecessary branching I added two new instructions to the Robin cpu: seteq and setne that set the contents of a register to either zero or one depending on the zero flag. The compiler can now use these instructions to simplify the code to:
load flags,#alu_cmp ; binop(==)
alu r2,r3,r2
seteq r2
This saves not only 3 instructions in code size, but also 2 or 3 instructions being executed (2 if equal, 3 if not equal).Setpos and setmin
To complete the set and make it easier to produce code for the < <= > and >= operators the setpos and setmin instructions are also implemented.
Thursday, 16 January 2020
Additional instructions
Long branch
So I expanded the instruction set to take a full 32-bit signed offset. If the 8bit offset is zero, the next 4 bytes will be used as a the offset. The complete instruction now looks like this:[15:12] opcode (13) [11: 8] condition [ 7: 0] offset Optional: 4 bytes offset (if offset == 0)
The condition is used to check against the flags register. The highest bit of the condition determines if a flag should be set or unset and because bit 31 of the flags register is always 1 we even have an option for an unconditional branch (or even to never take the branch, which is rather useless)
if cond[2:0] & R13[31:29] == cond[3] then PC += offset ? offset : (PC)
Bit 30 and 29 of the flags register are the negative (sign) and zero bit respectively.
Stack instructions
[15:12] opcode (15) [11: 8] register [ 7: 0] 1 = pop, 2 = push
Verilog observations
There are a few options though: until now i have been using next-pnr's heap placer which is quite fast (just a few seconds on my machine). The sa placer however is much slower (more than 60 seconds) but also generates a result that saves me about 250 LUTs!
The second option is to play around with the numerical values of the state labels. This may sound weird but the current implementation of the cpu has 29 states, i.e. a 5 bit state register. If i number them consecutively from 0 - 28 yosys uses more LUTs than when I assign the last state the number 31. Apparently the huge multiplexer generated for this state machine benefits from gaps in the list of possible states.
In the end I intend to simplify and optimise this design but for now I stick with the sa placer.
Sunday, 12 January 2020
Compiler
I probably should call it a compiler for a 'C-like language' because it implements a tiny subset of C, just enough to implement some basic functions. Currently it supports int and char as well as pointers and you can define and call functions. Control structures are limited to while, if/else and return but quite a few binary and unary operators have been implemented already.
Because the compiler is based on the pycparser module that can recognize the full C99 spec it will be rather straight forward to implement missing features.
Pain points
Even for the small string manipulation functions it quickly becomes clear that additional instructions for the CPU would be welcome. The biggest benefit would probably be to have:
- Conditional branch instructions with a larger offset than just one byte.
- Pop/push instructions.
Currently implemented as two instructions, one to change the stack pointer and another to load or store the register. This approach makes it possible to use any register as a stack pointer but for compiled c we need just one.
- Better byte loading.
If we load a byte into a register we often have to zero it out before load it. This way we can easily change just the lower byte of the flags register but otherwise it is less convenient.
- alu operation to convert an int to a boolean
This would greatly reduce the overhead in expressiins involving && and ||
Plenty of room for improvement here 😁
Saturday, 4 January 2020
More memory: spram
No less than 128 Kbytes are provided and although it is a little bit unclear to me at the moment how fast they are, theY seem to function quite well with two clock cycle delay, so I can integrate them with my current design without a any changes to the CPU.
Implementation
The 128 Kbytes are provided as four blocks, each 16k x 16bits. As far as I know Yosys does not yet offer automatic inference, which means we have to use the iCE40 primitives directly. This may sound complicated but it is not as hard as it sounds.The blocks take a 14 bit address (i.e. can address 16K words) and will read or write 16 bits at the time. Because we are interested in 8 bit bytes rather than words we need to make sure we return or write either the upper half or the lower half of a word depending on the address. For reading this means simply selecting, for writing this means setting a writemask that will limit which bits of a 16 bit word are actually written on receiving a write enable signal. Such a write mask itself is not 16 bit wide but just 4: 1 bit for each 4 bit nibble. We make this selection based on bit 14.
Code
The code below (GitHub) shows the implementation details. We use all four SB_SPRAM256KA blocks available on the up5k and use the top two bits of the 17 bit address to select a block. Bit 14 is then used to calculate the write mask (called nibble mask here). The same nibble mask is also used to select either the high or low byte from any 16 bit word we read from any of the four blocks. Note that our module's input data (wdata) is a byte but we always write 16 bit words. To this end we simply double the incoming byte; whether we actually write to high or low byte is determined by the write mask we construct and pass to the .MASKWREN input of the blocks.
// byte addressable spram
// uses all 128MB
module spram (
input clk,
input wen,
input [16:0] addr,
input [7:0] wdata,
output [7:0] rdata
);
wire cs_0 = addr[16:15] == 0;
wire cs_1 = addr[16:15] == 1;
wire cs_2 = addr[16:15] == 2;
wire cs_3 = addr[16:15] == 3;
wire nibble_mask_hi = addr[14];
wire nibble_mask_lo = !addr[14];
wire [15:0] wdata16 = {wdata, wdata};
wire [15:0] rdata_0,rdata_1,rdata_2,rdata_3;
wire [7:0] rdata_0b = nibble_mask_hi ? rdata_0[15:8] : rdata_0[7:0];
wire [7:0] rdata_1b = nibble_mask_hi ? rdata_1[15:8] : rdata_1[7:0];
wire [7:0] rdata_2b = nibble_mask_hi ? rdata_2[15:8] : rdata_2[7:0];
wire [7:0] rdata_3b = nibble_mask_hi ? rdata_3[15:8] : rdata_3[7:0];
assign rdata = cs_0 ? rdata_0b : cs_1 ? rdata_1b : cs_2 ? rdata_2b : rdata_3b;
SB_SPRAM256KA ram0
(
.ADDRESS(addr[13:0]),
.DATAIN(wdata16),
.MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
.WREN(wen),
.CHIPSELECT(cs_0),
.CLOCK(clk),
.STANDBY(1'b0),
.SLEEP(1'b0),
.POWEROFF(1'b1),
.DATAOUT(rdata_0)
);
SB_SPRAM256KA ram1
(
.ADDRESS(addr[13:0]),
.DATAIN(wdata16),
.MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
.WREN(wen),
.CHIPSELECT(cs_1),
.CLOCK(clk),
.STANDBY(1'b0),
.SLEEP(1'b0),
.POWEROFF(1'b1),
.DATAOUT(rdata_1)
);
SB_SPRAM256KA ram2
(
.ADDRESS(addr[13:0]),
.DATAIN(wdata16),
.MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
.WREN(wen),
.CHIPSELECT(cs_2),
.CLOCK(clk),
.STANDBY(1'b0),
.SLEEP(1'b0),
.POWEROFF(1'b1),
.DATAOUT(rdata_2)
);
SB_SPRAM256KA ram3
(
.ADDRESS(addr[13:0]),
.DATAIN(wdata16),
.MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
.WREN(wen),
.CHIPSELECT(cs_3),
.CLOCK(clk),
.STANDBY(1'b0),
.SLEEP(1'b0),
.POWEROFF(1'b1),
.DATAOUT(rdata_3)
);
endmodule
Notes
Friday, 3 January 2020
Divider module
Calling the divider module
Therefore the divider module needs to be able to signal to the cpu that it is done (that is, that the output reflects the final result) and also needs to be told to start. The code snippet below shows how the main CPU state machine deals with those div_go and div_available signals when the alu operation signifies that the divider module should be used.
DECODE : begin
state <= EXECUTE;
if(alu_op[5]) div_go <= 1; // start the divider module if we have a divider operation
end
EXECUTE : begin
state <= WAIT;
div_go <= 0;
case(cmd)
CMD_MOVEP: begin
if(writable_destination) r[R2] <= sumr1r0;
end
CMD_ALU: begin
if(~alu_op[5]) begin // regular alu operation (single cycle)
if(writable_destination) r[R2] <= alu_c;
r[13][28] <= alu_carry_out;
r[13][29] <= alu_is_zero;
r[13][30] <= alu_is_negative;
end else begin // divider operation (multiple cycles)
if(div_is_available) begin
if(writable_destination) r[R2] <= div_c;
r[13][29] <= div_is_zero;
r[13][30] <= div_is_negative;
end else
state <= EXECUTE;
end
end
Divider module implementation
module divider(
input clk,
input reset,
input [31:0] a,
input [31:0] b,
input go,
input divs,
input remainder,
output [31:0] c,
output is_zero,
output is_negative,
output reg available
);
localparam DIV_SHIFTL = 2'd0;
localparam DIV_SUBTRACT = 2'd1;
localparam DIV_AVAILABLE = 2'd2;
localparam DIV_DONE = 2'd3;
reg [1:0] step;
reg [32:0] dividend;
reg [32:0] divisor;
reg [32:0] quotient, quotient_part;
wire overshoot = divisor > dividend;
wire division_by_zero = (b == 0);
// for signed division the sign of the remainder is always equal
// to the sign of the dividend (a) while the sign of the quotient
// is equal to the product of the sign of dividend and divisor
// this to keep the following realation true
// quotient * divisor + remainder == dividend
wire signq = a[31] ^ b[31];
wire sign = remainder ? a[31] : signq ;
reg [31:0] result;
wire [31:0] abs_a = a[31] ? -a : a;
wire [31:0] abs_b = b[31] ? -b : b;
always @(posedge clk) begin
if(go) begin
// on receiving the go signal we initializer all registers
// we take care of taking the absolute values for
// dividend and divisor. We skip any calculations of a
// quotient if the divisor is zero.
step <= division_by_zero ? DIV_AVAILABLE : DIV_SHIFTL;
available <= 0;
dividend <= divs ? {1'b0, abs_a} : {1'b0, a};
divisor <= divs ? {1'b0, abs_b} : {1'b0, b};
quotient <= 0;
quotient_part <= 1;
end else
case(step)
// as long as the divisor is smaller than the dividend
// we multiply the divisor and the quotient_part by 2
// If no longer true, we correct by shifting everything
// back. This means registers should by 33 bit instead
// of 32 to accommodate the shifts.
DIV_SHIFTL : begin
if(~overshoot) begin
divisor <= divisor << 1;
quotient_part <= quotient_part << 1;
end else begin
divisor <= divisor >> 1;
quotient_part <= quotient_part >> 1;
step <= DIV_SUBTRACT;
end
end
// the next state is all about subtracting the divisor
// if it is smaller than the dividend. If it is, we
// perform the subtraction and or in the quotient_part
// into the quotient. Then divisor and quotient_part
// are halved again until the quotient_part is zero, in
// which case we are done.
DIV_SUBTRACT: begin
if(quotient_part == 0)
step <= DIV_AVAILABLE;
else begin
if(~overshoot) begin
dividend <= dividend - divisor;
quotient <= quotient | quotient_part;
end
divisor <= divisor >> 1;
quotient_part <= quotient_part >> 1;
end
end
// we signal availability of the result (for one clock)
// to the cpu and set the result to the chosen option.
DIV_AVAILABLE: begin
step <= DIV_DONE;
available <= 1;
result <= remainder ? dividend[31:0] : quotient[31:0];
end
default : available <= 0;
endcase
end
// these wires make sure that the correct sign correction is applied
// and the relevant flags are returned.
assign c = divs ? (sign ? -result : result) : result;
assign is_zero = (c == 0);
assign is_negative = c[31];
endmodule
Performance test
Code availability
Thursday, 2 January 2020
Monitor program, reset button
To this end I wired a debounced signal from the iCEbreaker user button (the one near the usb connector) to the reset wire we already have in place for all the modules.
The debounce module was copied from Nandland and adapted to be be usable for negative logic buttons as well (the use button is high when not being pressed).
We don't want to fire reset_button events every clock cycle as long as the button is pressed so we keep some state and only generate a reset_button event when the state changes from not pressed to pressed. The code is committed to the repository and shown below. The idea is to make the other buttons on the iCEbreaker board available to the cpu via a memory mapped interface.
// button wires
wire user_button, button1, button2,button3;
debounce #(.INITIAL_STATE(1'b1)) debounce_ubutton(CLK, BTN_N, user_button);
reg reset_button = 0;
reg user_button_pressed = 0;
always @(posedge CLK) begin
reset_button <= 0;
if(~ user_button & ~user_button_pressed) begin // pressed (negative logic)
user_button_pressed <= 1;
reset_button <= 1; // a one clock strobe on pressing
end else begin
user_button_pressed <= 0;
end
end
The u_error signal from the UART is now integrated in the the reset logic which makes it also possible to reset things by sending a break over the serial connection. (the UART core used does no provide separate break signals but sets u_error on a received break)
// global reset active low, cleared after startup, set on serial break or user button press
reg reset = 0;
always @(posedge CLK) begin
reset <= 1;
if(u_error | reset_button) reset <= 0;
end
Wednesday, 1 January 2020
ALU
Nevertheless i present the current implementation as is (mainly to test the verilog syntax highlighting capabilities of highlight.js :-) )
module alu(
input [31:0] a,
input [31:0] b,
input carry_in,
input [7:0] op,
output [31:0] c,
output carry_out,
output is_zero,
output is_negative
);
wire [32:0] add = {0, a} + {0, b};
wire [32:0] adc = add + { 32'd0, carry_in};
wire [32:0] sub = {0, a} - {0, b};
wire [32:0] sbc = sub - { 32'd0, carry_in};
wire [32:0] b_and = {0, a & b};
wire [32:0] b_or = {0, a | b};
wire [32:0] b_xor = {0, a ^ b};
wire [32:0] b_not = {0,~a };
wire [32:0] extend = {a[31],a};
wire [32:0] min_a = -extend;
wire [32:0] cmp = sub[32] ? 33'h1ffff_ffff : sub == 0 ? 0 : 1;
wire [32:0] shiftl = {a[31:0],1'b0};
wire [32:0] shiftr = {a[0],1'b0,a[31:1]};
wire [31:0] mult_al_bl = a[15: 0] * b[15: 0];
wire [31:0] mult_al_bh = a[15: 0] * b[31:16];
wire [31:0] mult_ah_bl = a[31:16] * b[15: 0];
wire [31:0] mult_ah_bh = a[31:16] * b[31:16];
wire [63:0] mult64 = {32'b0,mult_al_bl} + {16'b0,mult_al_bh,16'b0}
+ {16'b0,mult_ah_bl,16'b0} + {mult_ah_bh,32'b0};
wire [32:0] result;
always @(*) begin
result= op == 0 ? add :
op == 1 ? adc :
op == 2 ? sub :
op == 3 ? sbc :
op == 4 ? b_or :
op == 5 ? b_and :
op == 6 ? b_not :
op == 7 ? b_xor :
op == 8 ? cmp :
op == 9 ? {1'b0, a} :
op == 12 ? shiftl :
op == 13 ? shiftr :
op == 16 ? {17'b0, mult_al_bl} :
op == 17 ? {1'b0, mult64[31:0]} :
op == 18 ? {1'b0, mult64[63:32]} :
33'b0;
end
assign c = result[31:0];
assign carry_out = result[32];
assign is_zero = (c == 0);
assign is_negative = c[31];
endmodule
Rotating blinkenlights
Now this is convenient when loading the alu operation into the lower byte of the flags register without clearing the flags but in most other situations I am starting to doubt this implementation decision. That is one thing I want to think about.
CPU design
The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...