Small FPGA things: fpga

Showing posts with label fpga. Show all posts

Friday, 3 January 2020

Divider module

Because software division is rather slow a hardware division implementation might be nice to have, even though it can eat lots of resources on your fpga (think hundreds of LUTs for a 32 bit implementation).

Also, unlike the regular operations in the ALU that can be performed completely combinatorial and therefore deliver a result instantly (i.e. in one cycle after fetching and decoding an instruction), a divider needs to perform a number of shifts and subtracts to calculate the quotient or the remainder.

Calling the divider module

Therefore the divider module needs to be able to signal to the cpu that it is done (that is, that the output reflects the final result) and also needs to be told to start. The code snippet below shows how the main CPU state machine deals with those div_go and div_available signals when the alu operation signifies that the divider module should be used.


DECODE  : begin
        state <= EXECUTE;
        if(alu_op[5]) div_go <= 1; // start the divider module if we have a divider operation
      end
EXECUTE : begin
        state <= WAIT;
        div_go <= 0;
        case(cmd)
          CMD_MOVEP:  begin
                  if(writable_destination) r[R2] <= sumr1r0;
                end
          CMD_ALU:  begin
                  if(~alu_op[5]) begin // regular alu operation (single cycle)
                    if(writable_destination) r[R2] <= alu_c;
                    r[13][28] <= alu_carry_out;
                    r[13][29] <= alu_is_zero;
                    r[13][30] <= alu_is_negative;
                  end else begin // divider operation (multiple cycles)
                    if(div_is_available) begin
                      if(writable_destination) r[R2] <= div_c;
                      r[13][29] <= div_is_zero;
                      r[13][30] <= div_is_negative;
                    end else
                      state <= EXECUTE; 
                  end
                end

Divider module implementation

The divider module is fairly large (and therefore resource heavy) because among other things it needs to be able to deal with the signs of the operands so there are multiple negations that take exclusive ors and additions over the full register width when implemented in hardware. I have annotated the source code below so it should be fairly straight forward to read. Note that the actual division part is a slightly adapted form of long division, sometimes referred to as "Kenyan division".


 module divider(
    input clk,
    input reset,
  input [31:0] a,
  input [31:0] b,
  input go,
  input divs,
  input remainder,
  output [31:0] c,
  output is_zero,
  output is_negative,
  output reg available
  );

  localparam DIV_SHIFTL    = 2'd0;
  localparam DIV_SUBTRACT  = 2'd1;
  localparam DIV_AVAILABLE = 2'd2;
  localparam DIV_DONE      = 2'd3;
  reg [1:0] step;

  reg [32:0] dividend;
  reg [32:0] divisor;
  reg [32:0] quotient, quotient_part;
  wire overshoot = divisor > dividend;
  wire division_by_zero = (b == 0);
  // for signed division the sign of the remainder is always equal 
  // to the sign of the dividend (a) while the sign of the quotient
  // is equal to the product of the sign of dividend and divisor
  // this to keep the following realation true
  // quotient * divisor + remainder == dividend
  wire signq = a[31] ^ b[31];
  wire sign = remainder ? a[31] : signq ;
  reg [31:0] result;
  wire [31:0] abs_a = a[31] ? -a : a;
  wire [31:0] abs_b = b[31] ? -b : b;

  always @(posedge clk) begin
    if(go) begin
      // on receiving the go signal we initializer all registers
      // we take care of taking the absolute values for
      // dividend and divisor. We skip any calculations of a
      // quotient if the divisor is zero.
      step <= division_by_zero ? DIV_AVAILABLE : DIV_SHIFTL;
      available <= 0;
      dividend  <= divs ? {1'b0, abs_a} : {1'b0, a};
      divisor   <= divs ? {1'b0, abs_b} : {1'b0, b};
      quotient  <= 0;
      quotient_part <= 1;
    end else
      case(step)
        // as long as the divisor is smaller than the dividend
        // we multiply the divisor and the quotient_part by 2
        // If no longer true, we correct by shifting everything
        // back. This means registers should by 33 bit instead
        // of 32 to accommodate the shifts.
        DIV_SHIFTL  :   begin
                  if(~overshoot) begin
                    divisor <= divisor << 1;
                    quotient_part <= quotient_part << 1;
                  end else begin
                    divisor <= divisor >> 1;
                    quotient_part <= quotient_part >> 1;
                    step <= DIV_SUBTRACT;
                  end
                end
        // the next state is all about subtracting the divisor
        // if it is smaller than the dividend. If it is, we
        // perform the subtraction and or in the quotient_part
        // into the quotient. Then divisor and quotient_part
        // are halved again until the quotient_part is zero, in
        // which case we are done.
        DIV_SUBTRACT: begin
                  if(quotient_part == 0)
                    step <= DIV_AVAILABLE;
                  else begin
                    if(~overshoot) begin
                      dividend <= dividend - divisor;
                      quotient <= quotient | quotient_part;
                    end 
                    divisor <= divisor >> 1;
                    quotient_part <= quotient_part >> 1;
                  end
                end
        // we signal availability of the result (for one clock)
        // to the cpu and set the result to the chosen option.
        DIV_AVAILABLE:  begin
                  step <= DIV_DONE;
                  available <= 1;
                  result <= remainder ? dividend[31:0] : quotient[31:0];
                end
        default   :   available <= 0;
      endcase
  end

  // these wires make sure that the correct sign correction is applied
  // and the relevant flags are returned.
  assign c = divs ? (sign ? -result : result) : result;
  assign is_zero = (c == 0);
  assign is_negative = c[31];

endmodule

Performance test

Because the Robin CPU provides a mark instruction to get the current clock counter, it is pretty easy to compare the number of clock cycles it takes to calculate a signed division and remainder in software versus a hardware instruction. The software implementation could probably be optimized a bit, although it already returns both quotient and remainder in one go, whereas this needs two instructions in hardware, but the difference is enormous:

It is interesting to note that less cycles are needed for bigger divisors. This is mainly due to needing less shifts of the divisor to match it up with the dividend. The hardware implementation could probably be made even faster if we would explicitly add shortcuts for small divisors (less than 256 perhaps), something extra worthwhile because dividing by small numbers is pretty common.

Code availability

The divider is part of the GitHub repository for the Robin SoC, the file is named divider.v

Thursday, 2 January 2020

Monitor program, reset button

Sometimes you make a stupid mistake like creating an endless loop or programming a delay that takes 500 seconds instead of 500 ms. In those cases a reset button would be convenient 😃

To this end I wired a debounced signal from the iCEbreaker user button (the one near the usb connector) to the reset wire we already have in place for all the modules.

The debounce module was copied from Nandland and adapted to be be usable for negative logic buttons as well (the use button is high when not being pressed).

We don't want to fire reset_button events every clock cycle as long as the button is pressed so we keep some state and only generate a reset_button event when the state changes from not pressed to pressed. The code is committed to the repository and shown below. The idea is to make the other buttons on the iCEbreaker board available to the cpu via a memory mapped interface.


// button wires
wire user_button, button1, button2,button3;
debounce #(.INITIAL_STATE(1'b1)) debounce_ubutton(CLK, BTN_N, user_button);

reg reset_button = 0;
reg user_button_pressed = 0;
always @(posedge CLK) begin
 reset_button <= 0;
 if(~ user_button & ~user_button_pressed) begin // pressed (negative logic)
  user_button_pressed <= 1;
  reset_button <= 1; // a one clock strobe on pressing
 end else begin
  user_button_pressed <= 0;
 end
end

The u_error signal from the UART is now integrated in the the reset logic which makes it also possible to reset things by sending a break over the serial connection. (the UART core used does no provide separate break signals but sets u_error on a received break)


// global reset active low, cleared after startup, set on serial break or user button press
reg reset = 0;
always @(posedge CLK) begin
 reset <= 1;
 if(u_error | reset_button) reset <= 0;
end

Wednesday, 1 January 2020

ALU

Currently the ALU is a pretty straight forward pure combinatorial design. That isn't something we can keep up forever because the Lattice up5k on the iCEbreaker has dsp cores that provide fast multiplication, we will have to implement division ourselves.

Nevertheless i present the current implementation as is (mainly to test the verilog syntax highlighting capabilities of highlight.js :-) )


module alu(
 input [31:0] a,
 input [31:0] b,
 input carry_in,
 input [7:0] op,
 output [31:0] c,
 output carry_out,
 output is_zero,
 output is_negative
 );

 wire [32:0] add = {0, a} + {0, b};
 wire [32:0] adc = add + { 32'd0, carry_in};
 wire [32:0] sub = {0, a} - {0, b};
 wire [32:0] sbc = sub - { 32'd0, carry_in};
 wire [32:0] b_and = {0, a & b};
 wire [32:0] b_or  = {0, a | b};
 wire [32:0] b_xor = {0, a ^ b};
 wire [32:0] b_not = {0,~a    };
 wire [32:0] extend = {a[31],a};
 wire [32:0] min_a = -extend;
 wire [32:0] cmp = sub[32] ? 33'h1ffff_ffff : sub == 0 ? 0 : 1;
 wire [32:0] shiftl = {a[31:0],1'b0};
 wire [32:0] shiftr = {a[0],1'b0,a[31:1]};
 wire [31:0] mult_al_bl = a[15: 0] * b[15: 0];
 wire [31:0] mult_al_bh = a[15: 0] * b[31:16];
 wire [31:0] mult_ah_bl = a[31:16] * b[15: 0];
 wire [31:0] mult_ah_bh = a[31:16] * b[31:16];
 wire [63:0] mult64 = {32'b0,mult_al_bl} + {16'b0,mult_al_bh,16'b0} 
                    + {16'b0,mult_ah_bl,16'b0} + {mult_ah_bh,32'b0};

 wire [32:0] result;

 always @(*) begin
  result= op == 0 ? add :
    op == 1 ? adc :
    op == 2 ? sub :
    op == 3 ? sbc :

    op == 4 ? b_or :
    op == 5 ? b_and :
    op == 6 ? b_not :
    op == 7 ? b_xor :

    op == 8 ? cmp :
    op == 9 ? {1'b0, a} :

    op == 12 ? shiftl :
    op == 13 ? shiftr :

    op == 16 ? {17'b0, mult_al_bl} :
    op == 17 ? {1'b0, mult64[31:0]} :
    op == 18 ? {1'b0, mult64[63:32]} :
    33'b0;
 end

 assign c = result[31:0];
 assign carry_out = result[32];
 assign is_zero = (c == 0);
 assign is_negative = c[31];

endmodule

Rotating blinkenlights

As everbody knows, no fpga design is worth anything unless can blink your on board LEDs :-)

Now I am a long way still from documenting fully what i have implemented but the current implementation of the cpu is fully functional and is capable of running a small program that lights the leds on the iCEbreaker board in a rotating manner until a key is pressed.

The actual code can be found in blinkenlights.S and when writing the program I noticed that when working with bytes getting a byte into the low order bits of a register almost always requires two instruction: one to clear the register and a second one to actually load an immediate byte value.

Now this is convenient when loading the alu operation into the lower byte of the flags register without clearing the flags but in most other situations I am starting to doubt this implementation decision. That is one thing I want to think about.

Tuesday, 31 December 2019

CPU design

The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructions. It has an ALU that performs actions on any two input registers and can write it back. The actual alu operation is encoded in the low byte of R13 (the flags register). This means choosing an ALU operation and performing it are two instructions. This does keep the instruction size down and allows for apply the same operation to different combinations of registers without an extra instruction. (How useful this is, is somethign we will have to see when we start writing real code).
(The opcodes and alu operations implemented are documented in this sheet)
Address operations (basically adding any two registers) are done by a separate adder. The verilog implementation of the current cpu can be found in the GitHub repo (cpu.v, alu.v).