Friday 3 January 2020

Divider module

Because software division is rather slow a hardware division implementation might be nice to have, even though it can eat lots of resources on your fpga (think hundreds of LUTs for a 32 bit implementation).
Also, unlike the regular operations in the ALU that can be performed completely combinatorial and therefore deliver a result instantly (i.e. in one cycle after fetching and decoding an instruction), a divider needs to perform a number of shifts and subtracts to calculate the quotient or the remainder.

Calling the divider module

Therefore the divider module needs to be able to signal to the cpu that it is done (that is, that the output reflects the final result) and also needs to be told to start. The code snippet below shows how the main CPU state machine deals with those div_go and div_available signals when the alu operation signifies that the divider module should be used.


DECODE  : begin
        state <= EXECUTE;
        if(alu_op[5]) div_go <= 1; // start the divider module if we have a divider operation
      end
EXECUTE : begin
        state <= WAIT;
        div_go <= 0;
        case(cmd)
          CMD_MOVEP:  begin
                  if(writable_destination) r[R2] <= sumr1r0;
                end
          CMD_ALU:  begin
                  if(~alu_op[5]) begin // regular alu operation (single cycle)
                    if(writable_destination) r[R2] <= alu_c;
                    r[13][28] <= alu_carry_out;
                    r[13][29] <= alu_is_zero;
                    r[13][30] <= alu_is_negative;
                  end else begin // divider operation (multiple cycles)
                    if(div_is_available) begin
                      if(writable_destination) r[R2] <= div_c;
                      r[13][29] <= div_is_zero;
                      r[13][30] <= div_is_negative;
                    end else
                      state <= EXECUTE; 
                  end
                end

Divider module implementation

The divider module is fairly large (and therefore resource heavy) because among other things it needs to be able to deal with the signs of the operands so there are multiple negations that take exclusive ors and additions over the full register width when implemented in hardware. I have annotated the source code below so it should be fairly straight forward to read. Note that the actual division part is a slightly adapted form of long division, sometimes referred to as "Kenyan division".

 module divider(
    input clk,
    input reset,
  input [31:0] a,
  input [31:0] b,
  input go,
  input divs,
  input remainder,
  output [31:0] c,
  output is_zero,
  output is_negative,
  output reg available
  );

  localparam DIV_SHIFTL    = 2'd0;
  localparam DIV_SUBTRACT  = 2'd1;
  localparam DIV_AVAILABLE = 2'd2;
  localparam DIV_DONE      = 2'd3;
  reg [1:0] step;

  reg [32:0] dividend;
  reg [32:0] divisor;
  reg [32:0] quotient, quotient_part;
  wire overshoot = divisor > dividend;
  wire division_by_zero = (b == 0);
  // for signed division the sign of the remainder is always equal 
  // to the sign of the dividend (a) while the sign of the quotient
  // is equal to the product of the sign of dividend and divisor
  // this to keep the following realation true
  // quotient * divisor + remainder == dividend
  wire signq = a[31] ^ b[31];
  wire sign = remainder ? a[31] : signq ;
  reg [31:0] result;
  wire [31:0] abs_a = a[31] ? -a : a;
  wire [31:0] abs_b = b[31] ? -b : b;

  always @(posedge clk) begin
    if(go) begin
      // on receiving the go signal we initializer all registers
      // we take care of taking the absolute values for
      // dividend and divisor. We skip any calculations of a
      // quotient if the divisor is zero.
      step <= division_by_zero ? DIV_AVAILABLE : DIV_SHIFTL;
      available <= 0;
      dividend  <= divs ? {1'b0, abs_a} : {1'b0, a};
      divisor   <= divs ? {1'b0, abs_b} : {1'b0, b};
      quotient  <= 0;
      quotient_part <= 1;
    end else
      case(step)
        // as long as the divisor is smaller than the dividend
        // we multiply the divisor and the quotient_part by 2
        // If no longer true, we correct by shifting everything
        // back. This means registers should by 33 bit instead
        // of 32 to accommodate the shifts.
        DIV_SHIFTL  :   begin
                  if(~overshoot) begin
                    divisor <= divisor << 1;
                    quotient_part <= quotient_part << 1;
                  end else begin
                    divisor <= divisor >> 1;
                    quotient_part <= quotient_part >> 1;
                    step <= DIV_SUBTRACT;
                  end
                end
        // the next state is all about subtracting the divisor
        // if it is smaller than the dividend. If it is, we
        // perform the subtraction and or in the quotient_part
        // into the quotient. Then divisor and quotient_part
        // are halved again until the quotient_part is zero, in
        // which case we are done.
        DIV_SUBTRACT: begin
                  if(quotient_part == 0)
                    step <= DIV_AVAILABLE;
                  else begin
                    if(~overshoot) begin
                      dividend <= dividend - divisor;
                      quotient <= quotient | quotient_part;
                    end 
                    divisor <= divisor >> 1;
                    quotient_part <= quotient_part >> 1;
                  end
                end
        // we signal availability of the result (for one clock)
        // to the cpu and set the result to the chosen option.
        DIV_AVAILABLE:  begin
                  step <= DIV_DONE;
                  available <= 1;
                  result <= remainder ? dividend[31:0] : quotient[31:0];
                end
        default   :   available <= 0;
      endcase
  end

  // these wires make sure that the correct sign correction is applied
  // and the relevant flags are returned.
  assign c = divs ? (sign ? -result : result) : result;
  assign is_zero = (c == 0);
  assign is_negative = c[31];

endmodule

Performance test

Because the Robin CPU provides a mark instruction to get the current clock counter, it is pretty easy to compare the number of clock cycles it takes to calculate a signed division and remainder in software versus a hardware instruction. The software implementation could probably be optimized a bit, although it already returns both quotient and remainder in one go, whereas this needs two instructions in hardware, but the difference is enormous:
It is interesting to note that less cycles are needed for bigger divisors. This is mainly due to needing less shifts of the divisor to match it up with the dividend. The hardware implementation could probably be made even faster if we would explicitly add shortcuts for small divisors (less than 256 perhaps), something extra worthwhile because dividing by small numbers is pretty common.

Code availability

The divider is part of the GitHub repository for the Robin SoC, the file is named divider.v

CPU design

The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructi...