Wednesday, 22 January 2020

Implementing right shift with left shift

In the previous article I showed an implementation of a left shift instruction that made use of multiplication instead of implementing the barrel shifter directly. Because on an iCEbreaker/up5k multiplication is fast but resources are scarce, this makes sense.

But with left shift available we can now also implement right shift because for a 32 bit register, right shift by N positions can be interpreted as a left shift by 32 - N positions and than looking at the upper 32 bits. This is visualized below


The verilog code needs to be changed only a little bit:



wire shiftq    = op[4:0] == 12;  // true if operaration is shift left
wire shiftqr   = op[4:0] == 13;  // true if operaration is shift right
wire doshift   = shiftq | shiftqr;
wire [5:0] invertshift = 6'd32 - {1'b0,b[4:0]};
wire [4:0] nshift = shiftqr ? invertshift[4:0] : b[4:0];
wire shiftlo   = doshift & ~nshift[4]; // true if shifting < 16 bits
wire shifthi   = doshift &  nshift[4]; // true if shifting >= 16 bits

...

// 4 16x16 bit partial multiplications
// the multiplier is either the b operand or a power of two for a shift
// note that b[31:16] for shift operations [31-0] is always zero
// so when shiftlo is true al_bh and ah_bh still result in zero
// the same is not true the other way around hence the extra shiftq check
// note that the behavior is undefined for shifts > 31
wire [31:0] mult_al_bl = a[15: 0] * (shiftlo ? shiftla16 : doshift ? 16'b0 : b[15: 0]);
wire [31:0] mult_al_bh = a[15: 0] * (shifthi ? shiftla16 : b[31:16]);
wire [31:0] mult_ah_bl = a[31:16] * (shiftlo ? shiftla16 : doshift ? 16'b0 : b[15: 0]);
wire [31:0] mult_ah_bh = a[31:16] * (shifthi ? shiftla16 : b[31:16]);

...

assign result = 
     ...
     shiftq  ? {1'b0, mult64[31:0]} :
     shiftqr ? {1'b0, mult64[63:32]} :
     ...
     ;

The only thing we do here is subtracting the number of positions to shift from 32 if we are dealing with a shift right instruction and also swap in the correct arguments for the multiplication for both the left and the right shift operation. Also, when selecting the final result we take care of selecting the uppermost 32 bits fro the right shift where a left shift would select the lower 32 bits.

Tuesday, 21 January 2020

Turning things around: Implementing shift instructions using multiplications

Often when cpu instruction sets lack a direct multiplication operation, people resort to implementing multiplication by using combinations of shift and add instructions. Even when a multiplication instruction is available, multiplication by simple powers of two might be faster when performed in a single shift operation that is executed in a single clock cycle than with a multiplication instruction that may take many cycles.

Implementing a shift instruction that can shift a 32 bit register by an arbitrary number of bits can consume a lot of resources though. On a the Lattice up5k i found that it could easily use hundreds of LUTs. (The exact number depends on various things other than register size because placement by next-pnr has some randomness and some additional LUTs might be consumed to meet fan-out and timing requirements, so a design size might change considerably even when changing just a few bits. The multiplexers alone already will consume 160 LUTs)

I didn't have that many resources left for my design so i either had to economize or devise a cunning plan 🙂

Turning things around

The up5k on the iCEbreaker board does have something the iCEstick hx1k didn't have: dsp cores, i.e. fast multipliers (at 12Mhz they operate in less than a clock cycle). In fact the up5k has eight dsp cores so i already implemented 32 x 32 bit multiplication using 4 of those, but I still want to have variable shift instructions because they might be needed in all sorts of bit twiddling operations used when implementing a soft floating point library for example.

The fun bit is that we can reuse the multiplication units here if we convert the variable shift amount into a power of two. Because calculating the power of two is simply setting a single bit in an otherwise empty register, this takes far less resources.

The verilog code for this part of the ALU is shown below (ALU code ob GitHub)

// first part: calculate a power of two
wire shiftq    = op[4:0] == 12;     // true if operaration is shift left
wire shiftlo   = shiftq & ~b[4];    // true if shifting < 16 bits
wire shifthi   = shiftq &  b[4];    // true if shifting >= 16 bits

// determine power of two
wire shiftla0  = b[3:0]  == 4'd0;   // 2^0 = 1
wire shiftla1  = b[3:0]  == 4'd1;   // 2^1 = 2
wire shiftla2  = b[3:0]  == 4'd2;   // 2^2 = 3
wire shiftla3  = b[3:0]  == 4'd3;   // ... etc 
...
wire shiftla15 = b[3:0]  == 4'd15;

// combine into 16 bit word
wire [15:0] shiftla16 = {shiftla15,shiftla14,shiftla13,shiftla12,
                         shiftla11,shiftla10,shiftla9 ,shiftla8 ,
                         shiftla7 ,shiftla6 ,shiftla5 ,shiftla4 ,
                         shiftla3 ,shiftla2 ,shiftla1 ,shiftla0};

// second part: reusing the multiplication code
// 4 16x16 bit partial multiplications
// the multiplier is either the b operand or a power of two for a shift
// note that b[31:16] for shift operations [31-0] is always zero
// so when shiftlo is true al_bh and ah_bh still result in zero
// the same is not true the other way around hence the extra shiftq check
// note that the behavior is undefined for shifts > 31

wire [31:0] mult_al_bl = a[15: 0] * (shiftlo ? shiftla16 : shiftq ? 16'b0 : b[15: 0]);
wire [31:0] mult_al_bh = a[15: 0] * (shifthi ? shiftla16 : b[31:16]);
wire [31:0] mult_ah_bl = a[31:16] * (shiftlo ? shiftla16 : shiftq ? 16'b0 : b[15: 0]);
wire [31:0] mult_ah_bh = a[31:16] * (shifthi ? shiftla16 : b[31:16]);

// combine the intermediate results into a 64 bit result
wire [63:0] mult64 = {32'b0,mult_al_bl} + {16'b0,mult_al_bh,16'b0}
                   + {16'b0,mult_ah_bl,16'b0} + {mult_ah_bh,32'b0};

// final part: compute the result of the whole ALU
wire [32:0] result;

assign result = 
            op[4:0] == 0 ? add :
            op[4:0] == 1 ? adc :
            ...
            shiftq ? {1'b0, mult64[31:0]} :
            ...
            ;

The first half constructs rather than computes the power of two by creating a single 16 bit word with just a single bit set.

The second half selects the proper multiplier parts based on the instruction (regular multiplication or shift left)

The final part is about returning the result: it will be in the lower 32 bits of the combined results. Note that shifting by 32 bits should return zero but selecting for this explicit situation will add more LUTs to my design than I have currently available (using 5181 out of 5280). So for this implementation the behavior for shifts outside the range [0-31] is not defined.

Implementation notes

The code is simple because we do not need all multiplication and addition steps of a full 32 x 32 bit multiplication because if a number is a power of two, only one of the two 16 bits of the multiplier will be non zero (for shift amounts < 32).

Multiplying two 32 bit numbers involves four 16 bit multiplications (of each combination of the 16 bit halves of the multiplier and multiplicand). The four intermediate 32 bit results are then added to a 64 bit result.

If one of the halves of the multiplier is zero then two multiplication steps are no longer necessary as their result will be zero and the corresponding addition steps will be redundant too.



LUT Usage


Just to give some idea about the resources used by a barrel shifter vs. this multiplication based implementation I have created bare bone implementations (shiftleft.v and shiftleft2.v) and checked those with yosys/next-pnr.

shiftleft.v (barrel)shiftleft2.v (multiplier)
ICESTORM_LC19967
ICESTORM_DSP03

(side note: the stand alone multiplier implementation only uses 3 DSPs compared to the 4 used by the full ALU but that is because yosys optimizes away the multiplication of both upper halves of the words as they can only end up in the upper 32 bits of the result which we do not use for the left shift)

Sunday, 19 January 2020

seteq and setne instructions

Because the C99 standard (and newer) requires [section 6.5.8] comparison operators like < > <= => and logical (non-bitwise) operators like && and || to return either zero or one even though any non-zero value in a logical expression will be treated as true the code that my C-compiler generates for the operators is rather bulky, just to stay standard compliant.

The reason for this is because I have not implemented any convenient instruction to convert a non-zero value to one. So the code for the return statement in the code below

void showcase_compare(int a){
 return a == 42;
}

is converted to the assembly snippet show below (a is R2, 42 in r3)

        load    flags,#alu_cmp      ; binop(==)
        alu     r2,r3,r2            
        beq     post_0003           ; equal
        move    r2,0,0              
        bra     post_0004           
post_0003:
        move    r2,0,1              
post_0004:
So in order to get a proper one or zero we always have to branch.

Seteq and setne

To prevent this kind of unnecessary branching I added two new instructions to the Robin cpu: seteq and setne that set the contents of a register to either zero or one depending on the zero flag. The compiler can now use these instructions to simplify the code to:

        load    flags,#alu_cmp      ; binop(==)
        alu     r2,r3,r2            
        seteq   r2
This saves not only 3 instructions in code size, but also 2 or 3 instructions being executed (2 if equal, 3 if not equal).

Setpos and setmin


To complete the set and make it easier to produce code for the < <= > and >= operators the setpos and setmin instructions are also implemented.

Thursday, 16 January 2020

Additional instructions

The Robin cpu/soc is coming along nicely but when i started playing around with implementing a compiler it became quickly clear that code generation was hindered by not having relative branch instructions that could reach destinations beyond -128 or +127 bytes (a 8-bit signed integer).

Long branch

So I expanded the instruction set to take a full 32-bit signed offset. If the 8bit offset is zero, the next 4 bytes will be used as a the offset. The complete instruction now looks like this:

[15:12] opcode (13)
[11: 8] condition
[ 7: 0] offset
Optional: 4 bytes offset (if offset == 0)

The condition is used to check against the flags register. The highest bit of the condition determines if a flag should be set or unset and because bit 31 of the flags register is always 1 we even have an option for an unconditional branch (or even to never take the branch, which is rather useless)

if cond[2:0] & R13[31:29] == cond[3]  then PC += offset  ? offset : (PC)

Bit 30 and 29 of the flags register are the negative (sign) and zero bit respectively.

Stack instructions


I also added pop and push instructions to reduce code size, even though it is a bit at odds with the RISC philosophy. These always use R14 as the stack pointer and the opcode looks like this:

[15:12] opcode (15)
[11: 8] register
[ 7: 0] 1 = pop, 2 = push

Verilog observations


I have a few other instructions I wish to implement, for example to sign extend a byte to a long, but already i am using almost all available LUTs on the iCEbreaker.
There are a few options though: until now i have been using next-pnr's heap placer which is quite fast (just a few seconds on my machine). The sa placer however is much slower (more than 60 seconds) but also generates a result that saves me about 250 LUTs!
The second option is to play around with the numerical values of the state labels. This may sound weird but the current implementation of the cpu has 29 states, i.e. a 5 bit state register. If i number them consecutively from 0 - 28 yosys uses more LUTs than when I assign the last state the number 31. Apparently the huge multiplexer generated for this state machine benefits from gaps in the list of possible states.
In the end I intend to simplify and optimise this design but for now I stick with the sa placer.

Sunday, 12 January 2020

Compiler

Assembler is nice but to get a feel how well the SoC design fits day to day programming tasks I started crafting a small C compiler.

I probably should call it a compiler for a 'C-like language' because it implements a tiny subset of C, just enough to implement some basic functions. Currently it supports int and char as well as pointers and you can define and call functions. Control structures are limited to while, if/else and return but quite a few binary and unary operators have been implemented already.

Because the compiler is based on the pycparser module that can recognize the full C99 spec it will be rather straight forward to implement missing features.

Pain points


Even for the small string manipulation functions it quickly becomes clear that additional instructions for the CPU would be welcome. The biggest benefit would probably be to have:

  • Conditional branch instructions with a larger offset than just one byte.  
Even small functions may exceed offsets of just -128 to 127 bytes so this is a must have.

  • Pop/push instructions.

Currently implemented as two instructions, one to change the stack pointer and another to load or store the register. This approach makes it possible to use any register as a stack pointer but for compiled c we need just one.

  • Better byte loading.

If we load a byte into a register we often have to zero it out before load it. This way we can easily change just the lower byte of the flags register but otherwise it is less convenient.

  • alu operation to convert an int to a boolean

This would greatly reduce the overhead in expressiins involving && and ||

Plenty of room for improvement here 😁

Saturday, 4 January 2020

More memory: spram

The iCE40 up5k that is used on the iCEBreaker board provides another type of memory besides the ubiquitous block ram: single port ram (spram).
No less than 128 Kbytes are provided and although it is a little bit unclear to me at the moment how fast they are, theY seem to function quite well with two clock cycle delay, so I can integrate them with my current design without a any changes to the CPU.

Implementation

The 128 Kbytes are provided as four blocks, each 16k x 16bits. As far as I know Yosys does not yet offer automatic inference, which means we have to use the iCE40 primitives directly. This may sound complicated but it is not as hard as it sounds.
The blocks take a 14 bit address (i.e. can address 16K words) and will read or write 16 bits at the time. Because we are interested in 8 bit bytes rather than words we need to make sure we return or write either the upper half or the lower half of a word depending on the address. For reading this means simply selecting, for writing this means setting a writemask that will limit which bits of a 16 bit word are actually written on receiving a write enable signal. Such a write mask itself is not 16 bit wide but just 4: 1 bit for each 4 bit nibble. We make this selection based on bit 14.


Code

The code below (GitHub) shows the implementation details. We use all four SB_SPRAM256KA blocks available on the up5k and use the top two bits of the 17 bit address to select a block. Bit 14 is then used to calculate the write mask (called nibble mask here). The same nibble mask is also used to select either the high or low byte from any 16 bit word we read from any of the four blocks. Note that our module's input data (wdata) is a byte but we always write 16 bit words. To this end we simply double the incoming byte; whether we actually write to high or low byte is determined by the write mask we construct and pass to the .MASKWREN input of the blocks.

// byte addressable spram
// uses all 128MB

module spram (
 input clk,
 input wen,
 input [16:0] addr,
 input [7:0] wdata,
 output [7:0] rdata
);

wire cs_0 = addr[16:15] == 0;
wire cs_1 = addr[16:15] == 1;
wire cs_2 = addr[16:15] == 2;
wire cs_3 = addr[16:15] == 3;

wire nibble_mask_hi = addr[14];
wire nibble_mask_lo = !addr[14];

wire [15:0] wdata16 = {wdata, wdata};

wire [15:0] rdata_0,rdata_1,rdata_2,rdata_3;
wire [7:0] rdata_0b = nibble_mask_hi ? rdata_0[15:8] : rdata_0[7:0];
wire [7:0] rdata_1b = nibble_mask_hi ? rdata_1[15:8] : rdata_1[7:0];
wire [7:0] rdata_2b = nibble_mask_hi ? rdata_2[15:8] : rdata_2[7:0];
wire [7:0] rdata_3b = nibble_mask_hi ? rdata_3[15:8] : rdata_3[7:0];

assign rdata = cs_0 ? rdata_0b : cs_1 ? rdata_1b : cs_2 ? rdata_2b : rdata_3b;

SB_SPRAM256KA ram0
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_0),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_0)
  );

SB_SPRAM256KA ram1
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_1),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_1)
  );

SB_SPRAM256KA ram2
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_2),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_2)
  );

SB_SPRAM256KA ram3
  (
    .ADDRESS(addr[13:0]),
    .DATAIN(wdata16),
    .MASKWREN({nibble_mask_hi, nibble_mask_hi, nibble_mask_lo, nibble_mask_lo}),
    .WREN(wen),
    .CHIPSELECT(cs_3),
    .CLOCK(clk),
    .STANDBY(1'b0),
    .SLEEP(1'b0),
    .POWEROFF(1'b1),
    .DATAOUT(rdata_3)
  );

endmodule

Notes

Because of the values passed to the standby sleep and poweroff inputs we effectively keep the everything running full blast and presumably consuming quite a lot of power (relatively speaking). Since i have no idea at the moment hiw long it would take to resume from standby, i leave it at that for now.

Friday, 3 January 2020

Divider module

Because software division is rather slow a hardware division implementation might be nice to have, even though it can eat lots of resources on your fpga (think hundreds of LUTs for a 32 bit implementation).
Also, unlike the regular operations in the ALU that can be performed completely combinatorial and therefore deliver a result instantly (i.e. in one cycle after fetching and decoding an instruction), a divider needs to perform a number of shifts and subtracts to calculate the quotient or the remainder.

Calling the divider module

Therefore the divider module needs to be able to signal to the cpu that it is done (that is, that the output reflects the final result) and also needs to be told to start. The code snippet below shows how the main CPU state machine deals with those div_go and div_available signals when the alu operation signifies that the divider module should be used.


DECODE  : begin
        state <= EXECUTE;
        if(alu_op[5]) div_go <= 1; // start the divider module if we have a divider operation
      end
EXECUTE : begin
        state <= WAIT;
        div_go <= 0;
        case(cmd)
          CMD_MOVEP:  begin
                  if(writable_destination) r[R2] <= sumr1r0;
                end
          CMD_ALU:  begin
                  if(~alu_op[5]) begin // regular alu operation (single cycle)
                    if(writable_destination) r[R2] <= alu_c;
                    r[13][28] <= alu_carry_out;
                    r[13][29] <= alu_is_zero;
                    r[13][30] <= alu_is_negative;
                  end else begin // divider operation (multiple cycles)
                    if(div_is_available) begin
                      if(writable_destination) r[R2] <= div_c;
                      r[13][29] <= div_is_zero;
                      r[13][30] <= div_is_negative;
                    end else
                      state <= EXECUTE; 
                  end
                end

Divider module implementation

The divider module is fairly large (and therefore resource heavy) because among other things it needs to be able to deal with the signs of the operands so there are multiple negations that take exclusive ors and additions over the full register width when implemented in hardware. I have annotated the source code below so it should be fairly straight forward to read. Note that the actual division part is a slightly adapted form of long division, sometimes referred to as "Kenyan division".

 module divider(
    input clk,
    input reset,
  input [31:0] a,
  input [31:0] b,
  input go,
  input divs,
  input remainder,
  output [31:0] c,
  output is_zero,
  output is_negative,
  output reg available
  );

  localparam DIV_SHIFTL    = 2'd0;
  localparam DIV_SUBTRACT  = 2'd1;
  localparam DIV_AVAILABLE = 2'd2;
  localparam DIV_DONE      = 2'd3;
  reg [1:0] step;

  reg [32:0] dividend;
  reg [32:0] divisor;
  reg [32:0] quotient, quotient_part;
  wire overshoot = divisor > dividend;
  wire division_by_zero = (b == 0);
  // for signed division the sign of the remainder is always equal 
  // to the sign of the dividend (a) while the sign of the quotient
  // is equal to the product of the sign of dividend and divisor
  // this to keep the following realation true
  // quotient * divisor + remainder == dividend
  wire signq = a[31] ^ b[31];
  wire sign = remainder ? a[31] : signq ;
  reg [31:0] result;
  wire [31:0] abs_a = a[31] ? -a : a;
  wire [31:0] abs_b = b[31] ? -b : b;

  always @(posedge clk) begin
    if(go) begin
      // on receiving the go signal we initializer all registers
      // we take care of taking the absolute values for
      // dividend and divisor. We skip any calculations of a
      // quotient if the divisor is zero.
      step <= division_by_zero ? DIV_AVAILABLE : DIV_SHIFTL;
      available <= 0;
      dividend  <= divs ? {1'b0, abs_a} : {1'b0, a};
      divisor   <= divs ? {1'b0, abs_b} : {1'b0, b};
      quotient  <= 0;
      quotient_part <= 1;
    end else
      case(step)
        // as long as the divisor is smaller than the dividend
        // we multiply the divisor and the quotient_part by 2
        // If no longer true, we correct by shifting everything
        // back. This means registers should by 33 bit instead
        // of 32 to accommodate the shifts.
        DIV_SHIFTL  :   begin
                  if(~overshoot) begin
                    divisor <= divisor << 1;
                    quotient_part <= quotient_part << 1;
                  end else begin
                    divisor <= divisor >> 1;
                    quotient_part <= quotient_part >> 1;
                    step <= DIV_SUBTRACT;
                  end
                end
        // the next state is all about subtracting the divisor
        // if it is smaller than the dividend. If it is, we
        // perform the subtraction and or in the quotient_part
        // into the quotient. Then divisor and quotient_part
        // are halved again until the quotient_part is zero, in
        // which case we are done.
        DIV_SUBTRACT: begin
                  if(quotient_part == 0)
                    step <= DIV_AVAILABLE;
                  else begin
                    if(~overshoot) begin
                      dividend <= dividend - divisor;
                      quotient <= quotient | quotient_part;
                    end 
                    divisor <= divisor >> 1;
                    quotient_part <= quotient_part >> 1;
                  end
                end
        // we signal availability of the result (for one clock)
        // to the cpu and set the result to the chosen option.
        DIV_AVAILABLE:  begin
                  step <= DIV_DONE;
                  available <= 1;
                  result <= remainder ? dividend[31:0] : quotient[31:0];
                end
        default   :   available <= 0;
      endcase
  end

  // these wires make sure that the correct sign correction is applied
  // and the relevant flags are returned.
  assign c = divs ? (sign ? -result : result) : result;
  assign is_zero = (c == 0);
  assign is_negative = c[31];

endmodule

Performance test

Because the Robin CPU provides a mark instruction to get the current clock counter, it is pretty easy to compare the number of clock cycles it takes to calculate a signed division and remainder in software versus a hardware instruction. The software implementation could probably be optimized a bit, although it already returns both quotient and remainder in one go, whereas this needs two instructions in hardware, but the difference is enormous:
It is interesting to note that less cycles are needed for bigger divisors. This is mainly due to needing less shifts of the divisor to match it up with the dividend. The hardware implementation could probably be made even faster if we would explicitly add shortcuts for small divisors (less than 256 perhaps), something extra worthwhile because dividing by small numbers is pretty common.

Code availability

The divider is part of the GitHub repository for the Robin SoC, the file is named divider.v

CPU design

The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...