Sunday, 15 March 2020
Count leading zeros in verilog
It documents (with code) the implementation of a new module to count leading zeros.
Saturday, 14 March 2020
Verilog test benches
for(i=7; i>=0 ; i--) begin
...
end
The simple solution of course was to replace
i-- with i=i-1, but it is still not completely clear to me what the exact differences are in Verilog versions supported by Yosys and Icarus.
clz_tb: ../clz.v clz_tb.v
$(VERILOG) -o $@ $^ ; vvp $@ | awk "/FATAL/{exit(1)}"
Doubts about Verilog
I can't tell at this point if VHDL or other hardware definition languages are any better but the longer i work with Verilog the more doubts i have: It does not clearly separate simulation from synthesis, its syntax (especially scoping rules for variables) is illogical, you can't define functions with more than one statement (at least not in verilog-2001) and every implementation is allowed to diverge from the standard by chosing to implement some features or not. I am not sure why people in the hardware world accept this, couldn't imagine this happening to Python implementations for example.
Anyway, it works sort of, so we'll see where it gets us; maybe with a bit more experience it will be less awkward to work with.
Saturday, 7 March 2020
Optimizing the fetch decode execute cycle II
Because we know the opcode for any instruction already in the FETCH3 cycle we can set the mem_raddr register with the contents of the stackpointer if we are dealing with a pop instruction or keep on incrementing the mem_raddr for those instructions that are followed by some bytes after the instructions itself, like the two byte offset fir the branch instruction and the four bytes of the load immediate instruction. And if we set the mem_raddr register two cycles earlier that means that we can actually read those bytes two cycles earlier as well.
This newly implemented scenario is summed up in the table below (click to enlarge)
Some more opportunities
Saturday, 29 February 2020
The Robin SoC has a dedicated website now
It is implemented as a GitHub site, check it out from time to time as articles get added.
Sunday, 23 February 2020
The Robin SoC on the iCEbreaker: current status
Simplification
The main decoding loop in the CPU was rather convoluted so both were redesigned a bit to improve readability of the Verilog code as well as reduce resource consumption. (The ALU code was updated in place, the CPU code got a new file)
Because by now I also have some experience with the code that is being generated by the compiler, I was able to remove unused instructions and ALU operations. Previously the pop, push and setxxx instructions were considered sub-instructions within one special opcode, now they are individual instructions (in case of pop and push) or rolled into a single set-and-branch instruction. The new instruction set architecture was highlighted in a separate article.
Less resources
All in all this redesign shrunk the number of LUTs consumed from 5279 (yes, just one LUT removed from 100%) to 4879 (92%), which is pretty neat because it leaves some room for additional functionality or tweaks. The biggest challenge by the way is Yosys: even slight changes in the design, like assigning different values to labels of a case statement that is not full, may result in a different number of LUTs consumed. This is something that needs some more research, maybe Yosys offers additional optimization options that let me get the lowest resource count in a more predictable manner.
Better testing
A significant amount of effort was spent on designing more and better regression tests. Both for the SoC and the support programs (assembler, simulator, ...) regression tests and syntax checkers were added. Most of these were also added to GitHub push actions, with the exception of the actual hardware tests because I cannot run those on GitHub. And of course this mainly done to show a few green banners on the repository home page 😀
Bug fixes
With a better testing framework in place it is far easier to check whether changes don't inadvertently break something. This was put to work in fixing one of the more annoying bugs left in the ALU design: previously shift left by more than 31 and shift right by 0 did not give a proper result. This is now fixed.
Frustrations
The up5k on the iCEbreaker board has 8 dsp cores. We currently use 4 of them to implement 32x32 bit multiplication. The SB_MAC16 primitives we use for this are inferred by Yosys from some multiplication statements we use in the ALU (i.e. we do not instantiate them directly) and work fine.
However, when I want to instantiate some of them directly and configure them to be used as 32 bit adders these instantiations will still multiply instead of add! No matter what I do, the result stays teh same. I have to admit I have no idea how Yosys infers stuff so it might very well be that my direct instantiation gets rewritten by some Yosys stage, so I will have to do some more research here.
What next?
I think next on the agenda is performance: I think I use too many read states for the fetch/decode/execute cycle. The Lattice technical documentation seems to imply we can read and write new data every clock cycle, at least for block ram. Unfortunately the docs for the SPRAM are less clear. Anyway, this area for sure needs some attention.
Saturday, 8 February 2020
A simulator for the Robin cpu
There exist of course tools to simulate verilog designs and even perform formal verification but my skill level is not quite up to that yet. On top of that I am convinced that many changes that I want to try out in the cpu would benefit from regression tests that are based on real code, i.e. code generated by a compiler instead of artificial tiny bits of code: code that you do not directly implement yourself tends to expose issues in the instruction set or bugs in it hardware implementation quicker than when you deliberately try to construct tiny test cases.
For these more realistic tests an assembler and a C compiler were created and they were used to implement small string and floating point libraries mimicking some of the functions in the C standard library. And they proved their worth as they uncovered among other things bugs in the handling of conditional branches for example.
However, as we will use the assembler and compiler to perform regression tests on the cpu it is important that these tools themselves are as bug free as possible, even when we add new functionality or change implementation details. Ideally some contineous integration would be implemented using GitHub actions that would be triggered on every push.
There is one catch here though: we cannot perform the final test in our chain of dependencies simply because the GitHub machines do not have an iCEbreaker board attached ☺️
We can deal with this challenge by creating a program the will simulate the cpu we have implemented on our fpga. This way we should be able to perform the tests for the compiler/assembler toolchain against this simulator with the added benefit of having more debugging options available (because they are much easier to implement in a bit of Python that in our resource constrained hardware.
The first version of this simulator is now commited and i hope to create some contineous integration actions in the near future.
Tuesday, 21 January 2020
Turning things around: Implementing shift instructions using multiplications
Turning things around
The fun bit is that we can reuse the multiplication units here if we convert the variable shift amount into a power of two. Because calculating the power of two is simply setting a single bit in an otherwise empty register, this takes far less resources.
The verilog code for this part of the ALU is shown below (ALU code ob GitHub)
// first part: calculate a power of two
wire shiftq = op[4:0] == 12; // true if operaration is shift left
wire shiftlo = shiftq & ~b[4]; // true if shifting < 16 bits
wire shifthi = shiftq & b[4]; // true if shifting >= 16 bits
// determine power of two
wire shiftla0 = b[3:0] == 4'd0; // 2^0 = 1
wire shiftla1 = b[3:0] == 4'd1; // 2^1 = 2
wire shiftla2 = b[3:0] == 4'd2; // 2^2 = 3
wire shiftla3 = b[3:0] == 4'd3; // ... etc
...
wire shiftla15 = b[3:0] == 4'd15;
// combine into 16 bit word
wire [15:0] shiftla16 = {shiftla15,shiftla14,shiftla13,shiftla12,
shiftla11,shiftla10,shiftla9 ,shiftla8 ,
shiftla7 ,shiftla6 ,shiftla5 ,shiftla4 ,
shiftla3 ,shiftla2 ,shiftla1 ,shiftla0};
// second part: reusing the multiplication code
// 4 16x16 bit partial multiplications
// the multiplier is either the b operand or a power of two for a shift
// note that b[31:16] for shift operations [31-0] is always zero
// so when shiftlo is true al_bh and ah_bh still result in zero
// the same is not true the other way around hence the extra shiftq check
// note that the behavior is undefined for shifts > 31
wire [31:0] mult_al_bl = a[15: 0] * (shiftlo ? shiftla16 : shiftq ? 16'b0 : b[15: 0]);
wire [31:0] mult_al_bh = a[15: 0] * (shifthi ? shiftla16 : b[31:16]);
wire [31:0] mult_ah_bl = a[31:16] * (shiftlo ? shiftla16 : shiftq ? 16'b0 : b[15: 0]);
wire [31:0] mult_ah_bh = a[31:16] * (shifthi ? shiftla16 : b[31:16]);
// combine the intermediate results into a 64 bit result
wire [63:0] mult64 = {32'b0,mult_al_bl} + {16'b0,mult_al_bh,16'b0}
+ {16'b0,mult_ah_bl,16'b0} + {mult_ah_bh,32'b0};
// final part: compute the result of the whole ALU
wire [32:0] result;
assign result =
op[4:0] == 0 ? add :
op[4:0] == 1 ? adc :
...
shiftq ? {1'b0, mult64[31:0]} :
...
;
The first half constructs rather than computes the power of two by creating a single 16 bit word with just a single bit set.
The second half selects the proper multiplier parts based on the instruction (regular multiplication or shift left)
The final part is about returning the result: it will be in the lower 32 bits of the combined results. Note that shifting by 32 bits should return zero but selecting for this explicit situation will add more LUTs to my design than I have currently available (using 5181 out of 5280). So for this implementation the behavior for shifts outside the range [0-31] is not defined.
Implementation notes
The code is simple because we do not need all multiplication and addition steps of a full 32 x 32 bit multiplication because if a number is a power of two, only one of the two 16 bits of the multiplier will be non zero (for shift amounts < 32).Multiplying two 32 bit numbers involves four 16 bit multiplications (of each combination of the 16 bit halves of the multiplier and multiplicand). The four intermediate 32 bit results are then added to a 64 bit result.
If one of the halves of the multiplier is zero then two multiplication steps are no longer necessary as their result will be zero and the corresponding addition steps will be redundant too.
LUT Usage
Just to give some idea about the resources used by a barrel shifter vs. this multiplication based implementation I have created bare bone implementations (shiftleft.v and shiftleft2.v) and checked those with yosys/next-pnr.
| shiftleft.v (barrel) | shiftleft2.v (multiplier) | |
|---|---|---|
| ICESTORM_LC | 199 | 67 |
| ICESTORM_DSP | 0 | 3 |
Thursday, 16 January 2020
Additional instructions
Long branch
So I expanded the instruction set to take a full 32-bit signed offset. If the 8bit offset is zero, the next 4 bytes will be used as a the offset. The complete instruction now looks like this:[15:12] opcode (13) [11: 8] condition [ 7: 0] offset Optional: 4 bytes offset (if offset == 0)
The condition is used to check against the flags register. The highest bit of the condition determines if a flag should be set or unset and because bit 31 of the flags register is always 1 we even have an option for an unconditional branch (or even to never take the branch, which is rather useless)
if cond[2:0] & R13[31:29] == cond[3] then PC += offset ? offset : (PC)
Bit 30 and 29 of the flags register are the negative (sign) and zero bit respectively.
Stack instructions
[15:12] opcode (15) [11: 8] register [ 7: 0] 1 = pop, 2 = push
Verilog observations
There are a few options though: until now i have been using next-pnr's heap placer which is quite fast (just a few seconds on my machine). The sa placer however is much slower (more than 60 seconds) but also generates a result that saves me about 250 LUTs!
The second option is to play around with the numerical values of the state labels. This may sound weird but the current implementation of the cpu has 29 states, i.e. a 5 bit state register. If i number them consecutively from 0 - 28 yosys uses more LUTs than when I assign the last state the number 31. Apparently the huge multiplexer generated for this state machine benefits from gaps in the list of possible states.
In the end I intend to simplify and optimise this design but for now I stick with the sa placer.
Friday, 3 January 2020
Divider module
Calling the divider module
Therefore the divider module needs to be able to signal to the cpu that it is done (that is, that the output reflects the final result) and also needs to be told to start. The code snippet below shows how the main CPU state machine deals with those div_go and div_available signals when the alu operation signifies that the divider module should be used.
DECODE : begin
state <= EXECUTE;
if(alu_op[5]) div_go <= 1; // start the divider module if we have a divider operation
end
EXECUTE : begin
state <= WAIT;
div_go <= 0;
case(cmd)
CMD_MOVEP: begin
if(writable_destination) r[R2] <= sumr1r0;
end
CMD_ALU: begin
if(~alu_op[5]) begin // regular alu operation (single cycle)
if(writable_destination) r[R2] <= alu_c;
r[13][28] <= alu_carry_out;
r[13][29] <= alu_is_zero;
r[13][30] <= alu_is_negative;
end else begin // divider operation (multiple cycles)
if(div_is_available) begin
if(writable_destination) r[R2] <= div_c;
r[13][29] <= div_is_zero;
r[13][30] <= div_is_negative;
end else
state <= EXECUTE;
end
end
Divider module implementation
module divider(
input clk,
input reset,
input [31:0] a,
input [31:0] b,
input go,
input divs,
input remainder,
output [31:0] c,
output is_zero,
output is_negative,
output reg available
);
localparam DIV_SHIFTL = 2'd0;
localparam DIV_SUBTRACT = 2'd1;
localparam DIV_AVAILABLE = 2'd2;
localparam DIV_DONE = 2'd3;
reg [1:0] step;
reg [32:0] dividend;
reg [32:0] divisor;
reg [32:0] quotient, quotient_part;
wire overshoot = divisor > dividend;
wire division_by_zero = (b == 0);
// for signed division the sign of the remainder is always equal
// to the sign of the dividend (a) while the sign of the quotient
// is equal to the product of the sign of dividend and divisor
// this to keep the following realation true
// quotient * divisor + remainder == dividend
wire signq = a[31] ^ b[31];
wire sign = remainder ? a[31] : signq ;
reg [31:0] result;
wire [31:0] abs_a = a[31] ? -a : a;
wire [31:0] abs_b = b[31] ? -b : b;
always @(posedge clk) begin
if(go) begin
// on receiving the go signal we initializer all registers
// we take care of taking the absolute values for
// dividend and divisor. We skip any calculations of a
// quotient if the divisor is zero.
step <= division_by_zero ? DIV_AVAILABLE : DIV_SHIFTL;
available <= 0;
dividend <= divs ? {1'b0, abs_a} : {1'b0, a};
divisor <= divs ? {1'b0, abs_b} : {1'b0, b};
quotient <= 0;
quotient_part <= 1;
end else
case(step)
// as long as the divisor is smaller than the dividend
// we multiply the divisor and the quotient_part by 2
// If no longer true, we correct by shifting everything
// back. This means registers should by 33 bit instead
// of 32 to accommodate the shifts.
DIV_SHIFTL : begin
if(~overshoot) begin
divisor <= divisor << 1;
quotient_part <= quotient_part << 1;
end else begin
divisor <= divisor >> 1;
quotient_part <= quotient_part >> 1;
step <= DIV_SUBTRACT;
end
end
// the next state is all about subtracting the divisor
// if it is smaller than the dividend. If it is, we
// perform the subtraction and or in the quotient_part
// into the quotient. Then divisor and quotient_part
// are halved again until the quotient_part is zero, in
// which case we are done.
DIV_SUBTRACT: begin
if(quotient_part == 0)
step <= DIV_AVAILABLE;
else begin
if(~overshoot) begin
dividend <= dividend - divisor;
quotient <= quotient | quotient_part;
end
divisor <= divisor >> 1;
quotient_part <= quotient_part >> 1;
end
end
// we signal availability of the result (for one clock)
// to the cpu and set the result to the chosen option.
DIV_AVAILABLE: begin
step <= DIV_DONE;
available <= 1;
result <= remainder ? dividend[31:0] : quotient[31:0];
end
default : available <= 0;
endcase
end
// these wires make sure that the correct sign correction is applied
// and the relevant flags are returned.
assign c = divs ? (sign ? -result : result) : result;
assign is_zero = (c == 0);
assign is_negative = c[31];
endmodule
Performance test
Code availability
Wednesday, 1 January 2020
ALU
Nevertheless i present the current implementation as is (mainly to test the verilog syntax highlighting capabilities of highlight.js :-) )
module alu(
input [31:0] a,
input [31:0] b,
input carry_in,
input [7:0] op,
output [31:0] c,
output carry_out,
output is_zero,
output is_negative
);
wire [32:0] add = {0, a} + {0, b};
wire [32:0] adc = add + { 32'd0, carry_in};
wire [32:0] sub = {0, a} - {0, b};
wire [32:0] sbc = sub - { 32'd0, carry_in};
wire [32:0] b_and = {0, a & b};
wire [32:0] b_or = {0, a | b};
wire [32:0] b_xor = {0, a ^ b};
wire [32:0] b_not = {0,~a };
wire [32:0] extend = {a[31],a};
wire [32:0] min_a = -extend;
wire [32:0] cmp = sub[32] ? 33'h1ffff_ffff : sub == 0 ? 0 : 1;
wire [32:0] shiftl = {a[31:0],1'b0};
wire [32:0] shiftr = {a[0],1'b0,a[31:1]};
wire [31:0] mult_al_bl = a[15: 0] * b[15: 0];
wire [31:0] mult_al_bh = a[15: 0] * b[31:16];
wire [31:0] mult_ah_bl = a[31:16] * b[15: 0];
wire [31:0] mult_ah_bh = a[31:16] * b[31:16];
wire [63:0] mult64 = {32'b0,mult_al_bl} + {16'b0,mult_al_bh,16'b0}
+ {16'b0,mult_ah_bl,16'b0} + {mult_ah_bh,32'b0};
wire [32:0] result;
always @(*) begin
result= op == 0 ? add :
op == 1 ? adc :
op == 2 ? sub :
op == 3 ? sbc :
op == 4 ? b_or :
op == 5 ? b_and :
op == 6 ? b_not :
op == 7 ? b_xor :
op == 8 ? cmp :
op == 9 ? {1'b0, a} :
op == 12 ? shiftl :
op == 13 ? shiftr :
op == 16 ? {17'b0, mult_al_bl} :
op == 17 ? {1'b0, mult64[31:0]} :
op == 18 ? {1'b0, mult64[63:32]} :
33'b0;
end
assign c = result[31:0];
assign carry_out = result[32];
assign is_zero = (c == 0);
assign is_negative = c[31];
endmodule
Rotating blinkenlights
Now this is convenient when loading the alu operation into the lower byte of the flags register without clearing the flags but in most other situations I am starting to doubt this implementation decision. That is one thing I want to think about.
Tuesday, 31 December 2019
CPU design
(The opcodes and alu operations implemented are documented in this sheet)
Address operations (basically adding any two registers) are done by a separate adder. The verilog implementation of the current cpu can be found in the GitHub repo (cpu.v, alu.v).
CPU design
The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...





