Also, unlike the regular operations in the ALU that can be performed completely combinatorial and therefore deliver a result instantly (i.e. in one cycle after fetching and decoding an instruction), a divider needs to perform a number of shifts and subtracts to calculate the quotient or the remainder.
Calling the divider module
Therefore the divider module needs to be able to signal to the cpu that it is done (that is, that the output reflects the final result) and also needs to be told to start. The code snippet below shows how the main CPU state machine deals with those div_go and div_available signals when the alu operation signifies that the divider module should be used.
DECODE : begin
state <= EXECUTE;
if(alu_op[5]) div_go <= 1; // start the divider module if we have a divider operation
end
EXECUTE : begin
state <= WAIT;
div_go <= 0;
case(cmd)
CMD_MOVEP: begin
if(writable_destination) r[R2] <= sumr1r0;
end
CMD_ALU: begin
if(~alu_op[5]) begin // regular alu operation (single cycle)
if(writable_destination) r[R2] <= alu_c;
r[13][28] <= alu_carry_out;
r[13][29] <= alu_is_zero;
r[13][30] <= alu_is_negative;
end else begin // divider operation (multiple cycles)
if(div_is_available) begin
if(writable_destination) r[R2] <= div_c;
r[13][29] <= div_is_zero;
r[13][30] <= div_is_negative;
end else
state <= EXECUTE;
end
end
Divider module implementation
The divider module is fairly large (and therefore resource heavy) because among other things it needs to be able to deal with the signs of the operands so there are multiple negations that take exclusive ors and additions over the full register width when implemented in hardware. I have annotated the source code below so it should be fairly straight forward to read. Note that the actual division part is a slightly adapted form of long division, sometimes referred to as "Kenyan division".
module divider(
input clk,
input reset,
input [31:0] a,
input [31:0] b,
input go,
input divs,
input remainder,
output [31:0] c,
output is_zero,
output is_negative,
output reg available
);
localparam DIV_SHIFTL = 2'd0;
localparam DIV_SUBTRACT = 2'd1;
localparam DIV_AVAILABLE = 2'd2;
localparam DIV_DONE = 2'd3;
reg [1:0] step;
reg [32:0] dividend;
reg [32:0] divisor;
reg [32:0] quotient, quotient_part;
wire overshoot = divisor > dividend;
wire division_by_zero = (b == 0);
// for signed division the sign of the remainder is always equal
// to the sign of the dividend (a) while the sign of the quotient
// is equal to the product of the sign of dividend and divisor
// this to keep the following realation true
// quotient * divisor + remainder == dividend
wire signq = a[31] ^ b[31];
wire sign = remainder ? a[31] : signq ;
reg [31:0] result;
wire [31:0] abs_a = a[31] ? -a : a;
wire [31:0] abs_b = b[31] ? -b : b;
always @(posedge clk) begin
if(go) begin
// on receiving the go signal we initializer all registers
// we take care of taking the absolute values for
// dividend and divisor. We skip any calculations of a
// quotient if the divisor is zero.
step <= division_by_zero ? DIV_AVAILABLE : DIV_SHIFTL;
available <= 0;
dividend <= divs ? {1'b0, abs_a} : {1'b0, a};
divisor <= divs ? {1'b0, abs_b} : {1'b0, b};
quotient <= 0;
quotient_part <= 1;
end else
case(step)
// as long as the divisor is smaller than the dividend
// we multiply the divisor and the quotient_part by 2
// If no longer true, we correct by shifting everything
// back. This means registers should by 33 bit instead
// of 32 to accommodate the shifts.
DIV_SHIFTL : begin
if(~overshoot) begin
divisor <= divisor << 1;
quotient_part <= quotient_part << 1;
end else begin
divisor <= divisor >> 1;
quotient_part <= quotient_part >> 1;
step <= DIV_SUBTRACT;
end
end
// the next state is all about subtracting the divisor
// if it is smaller than the dividend. If it is, we
// perform the subtraction and or in the quotient_part
// into the quotient. Then divisor and quotient_part
// are halved again until the quotient_part is zero, in
// which case we are done.
DIV_SUBTRACT: begin
if(quotient_part == 0)
step <= DIV_AVAILABLE;
else begin
if(~overshoot) begin
dividend <= dividend - divisor;
quotient <= quotient | quotient_part;
end
divisor <= divisor >> 1;
quotient_part <= quotient_part >> 1;
end
end
// we signal availability of the result (for one clock)
// to the cpu and set the result to the chosen option.
DIV_AVAILABLE: begin
step <= DIV_DONE;
available <= 1;
result <= remainder ? dividend[31:0] : quotient[31:0];
end
default : available <= 0;
endcase
end
// these wires make sure that the correct sign correction is applied
// and the relevant flags are returned.
assign c = divs ? (sign ? -result : result) : result;
assign is_zero = (c == 0);
assign is_negative = c[31];
endmodule
Performance test
Because the Robin CPU provides a mark instruction to get the current clock counter, it is pretty easy to compare the number of clock cycles it takes to calculate a signed division and remainder in software versus a hardware instruction. The software implementation could probably be optimized a bit, although it already returns both quotient and remainder in one go, whereas this needs two instructions in hardware, but the difference is enormous:
It is interesting to note that less cycles are needed for bigger divisors. This is mainly due to needing less shifts of the divisor to match it up with the dividend. The hardware implementation could probably be made even faster if we would explicitly add shortcuts for small divisors (less than 256 perhaps), something extra worthwhile because dividing by small numbers is pretty common.
Code availability
The divider is part of the GitHub repository for the Robin SoC, the file is named divider.v