Small FPGA things: verilog

Showing posts with label verilog. Show all posts

Sunday, 15 March 2020

Count leading zeros in verilog

Added a new article on the website dedicated to documenting the Robin SoC/CPU.

It documents (with code) the implementation of a new module to count leading zeros.

Saturday, 14 March 2020

Verilog test benches

Because of my very limited experience with Verilog all the test cases until now have been 'real' tests, i.e. they automate the execution of programs running on the actual hardware and verify expected results.

In a way these tests are better than just simulations because they verify that the final product actually runs on the hardware as intended but they are also cumbersome: until you have a fair bit of infrastructure in place (like a UART and a monitor program) you cannot test at all and also these additional components contain modules that should better be tested themselves first.

I had a bit of luck so it wasn't too difficult to get something working and then proceed from there but i still wanted to have some proper simulations in place to test individual modules like the ALU for example.

So I finally got around this and managed to create my first test bench. This test bench is for a new module I am developing to count the number of leading zeros (probably more on that in a future article). I run the test bench with the Icarus Verilog compiler (iverilog).

While running the test bench I noted a couple of oddities:

Unlike yosys, iverilog does not like postfix operators (like i--), so the following generate block gave an error


for(i=7; i>=0 ; i--) begin

    ...

end

The simple solution of course was to replace i-- with i=i-1, but it is still not completely clear to me what the exact differences are in Verilog versions supported by Yosys and Icarus.

Also, even though the developers are aware of this, iverilog has no option to return a non-zero return code: errors and fatal conditions only write messages to stdout. This means we have to check for specific strings to appear in the output in order to stop a Makefile. This isn't difficult and easily done with awk:


clz_tb: ../clz.v clz_tb.v
 $(VERILOG) -o $@ $^ ; vvp $@ | awk "/FATAL/{exit(1)}"

Doubts about Verilog

I can't tell at this point if VHDL or other hardware definition languages are any better but the longer i work with Verilog the more doubts i have: It does not clearly separate simulation from synthesis, its syntax (especially scoping rules for variables) is illogical, you can't define functions with more than one statement (at least not in verilog-2001) and every implementation is allowed to diverge from the standard by chosing to implement some features or not. I am not sure why people in the hardware world accept this, couldn't imagine this happening to Python implementations for example.

Anyway, it works sort of, so we'll see where it gets us; maybe with a bit more experience it will be less awkward to work with.

Saturday, 7 March 2020

Optimizing the fetch decode execute cycle II

In the last article we identified a couple of opportunities to decrease the cycles needed to execute the branch, pop and load immediate instructions. The key issue hear was that we weren't reading the bytes until after we started setting the mem_raddr register in the DECODE cycle.

Because we know the opcode for any instruction already in the FETCH3 cycle we can set the mem_raddr register with the contents of the stackpointer if we are dealing with a pop instruction or keep on incrementing the mem_raddr for those instructions that are followed by some bytes after the instructions itself, like the two byte offset fir the branch instruction and the four bytes of the load immediate instruction. And if we set the mem_raddr register two cycles earlier that means that we can actually read those bytes two cycles earlier as well.

This newly implemented scenario is summed up in the table below (click to enlarge)

The highlighted areas show where the changes are. From the second column we can see that we are setting or updating the mem_raddr register for every cycle from FETCH1 to EXEC3, and reading a byte in every cycle from FETCH3 to EXEC5.

This means that for the load immediate and pop instructions we're done in EXEC3 and for the branch instruction even one cycle earlier (not two cycles because although we read two bytes less, we also need to add the offset to the program counter and that takes a cycle).

Some more opportunities

There are still a few opportunities left for optimization for the mover, load byte and the push instruction and i'll probably discuss that in a future article.

Saturday, 29 February 2020

The Robin SoC has a dedicated website now

I started documenting the design of the Robin SoC (and in particular the CPU) in a more structured manner than just a Wiki.

It is implemented as a GitHub site, check it out from time to time as articles get added.

Sunday, 23 February 2020

The Robin SoC on the iCEbreaker: current status

The main focus in the last couple of weeks has been on the simplification of the CPU and ALU.

Simplification

The main decoding loop in the CPU was rather convoluted so both were redesigned a bit to improve readability of the Verilog code as well as reduce resource consumption. (The ALU code was updated in place, the CPU code got a new file)

Because by now I also have some experience with the code that is being generated by the compiler, I was able to remove unused instructions and ALU operations. Previously the pop, push and setxxx instructions were considered sub-instructions within one special opcode, now they are individual instructions (in case of pop and push) or rolled into a single set-and-branch instruction. The new instruction set architecture was highlighted in a separate article.

Less resources

All in all this redesign shrunk the number of LUTs consumed from 5279 (yes, just one LUT removed from 100%) to 4879 (92%), which is pretty neat because it leaves some room for additional functionality or tweaks. The biggest challenge by the way is Yosys: even slight changes in the design, like assigning different values to labels of a case statement that is not full, may result in a different number of LUTs consumed. This is something that needs some more research, maybe Yosys offers additional optimization options that let me get the lowest resource count in a more predictable manner.

Better testing

A significant amount of effort was spent on designing more and better regression tests. Both for the SoC and the support programs (assembler, simulator, ...) regression tests and syntax checkers were added. Most of these were also added to GitHub push actions, with the exception of the actual hardware tests because I cannot run those on GitHub. And of course this mainly done to show a few green banners on the repository home page 😀

Bug fixes

With a better testing framework in place it is far easier to check whether changes don't inadvertently break something. This was put to work in fixing one of the more annoying bugs left in the ALU design: previously shift left by more than 31 and shift right by 0 did not give a proper result. This is now fixed.

Frustrations

The up5k on the iCEbreaker board has 8 dsp cores. We currently use 4 of them to implement 32x32 bit multiplication. The SB_MAC16 primitives we use for this are inferred by Yosys from some multiplication statements we use in the ALU (i.e. we do not instantiate them directly) and work fine.
However, when I want to instantiate some of them directly and configure them to be used as 32 bit adders these instantiations will still multiply instead of add! No matter what I do, the result stays teh same. I have to admit I have no idea how Yosys infers stuff so it might very well be that my direct instantiation gets rewritten by some Yosys stage, so I will have to do some more research here.

What next?

I think next on the agenda is performance: I think I use too many read states for the fetch/decode/execute cycle. The Lattice technical documentation seems to imply we can read and write new data every clock cycle, at least for block ram. Unfortunately the docs for the SPRAM are less clear. Anyway, this area for sure needs some attention.

Saturday, 8 February 2020

A simulator for the Robin cpu

When I started playing around with a SoC design that I wanted to implement on the iCEbreaker, I quickly realized that without proper testing tools even a moderately ambiteous design would quickly become too complex to change and improve.

There exist of course tools to simulate verilog designs and even perform formal verification but my skill level is not quite up to that yet. On top of that I am convinced that many changes that I want to try out in the cpu would benefit from regression tests that are based on real code, i.e. code generated by a compiler instead of artificial tiny bits of code: code that you do not directly implement yourself tends to expose issues in the instruction set or bugs in it hardware implementation quicker than when you deliberately try to construct tiny test cases.

For these more realistic tests an assembler and a C compiler were created and they were used to implement small string and floating point libraries mimicking some of the functions in the C standard library. And they proved their worth as they uncovered among other things bugs in the handling of conditional branches for example.

However, as we will use the assembler and compiler to perform regression tests on the cpu it is important that these tools themselves are as bug free as possible, even when we add new functionality or change implementation details. Ideally some contineous integration would be implemented using GitHub actions that would be triggered on every push.

There is one catch here though: we cannot perform the final test in our chain of dependencies simply because the GitHub machines do not have an iCEbreaker board attached ☺️

We can deal with this challenge by creating a program the will simulate the cpu we have implemented on our fpga. This way we should be able to perform the tests for the compiler/assembler toolchain against this simulator with the added benefit of having more debugging options available (because they are much easier to implement in a bit of Python that in our resource constrained hardware.

The first version of this simulator is now commited and i hope to create some contineous integration actions in the near future.

Tuesday, 21 January 2020

Turning things around: Implementing shift instructions using multiplications

Often when cpu instruction sets lack a direct multiplication operation, people resort to implementing multiplication by using combinations of shift and add instructions. Even when a multiplication instruction is available, multiplication by simple powers of two might be faster when performed in a single shift operation that is executed in a single clock cycle than with a multiplication instruction that may take many cycles.

Implementing a shift instruction that can shift a 32 bit register by an arbitrary number of bits can consume a lot of resources though. On a the Lattice up5k i found that it could easily use hundreds of LUTs. (The exact number depends on various things other than register size because placement by next-pnr has some randomness and some additional LUTs might be consumed to meet fan-out and timing requirements, so a design size might change considerably even when changing just a few bits. The multiplexers alone already will consume 160 LUTs)

I didn't have that many resources left for my design so i either had to economize or devise a cunning plan 🙂

Turning things around

The up5k on the iCEbreaker board does have something the iCEstick hx1k didn't have: dsp cores, i.e. fast multipliers (at 12Mhz they operate in less than a clock cycle). In fact the up5k has eight dsp cores so i already implemented 32 x 32 bit multiplication using 4 of those, but I still want to have variable shift instructions because they might be needed in all sorts of bit twiddling operations used when implementing a soft floating point library for example.

The fun bit is that we can reuse the multiplication units here if we convert the variable shift amount into a power of two. Because calculating the power of two is simply setting a single bit in an otherwise empty register, this takes far less resources.

The verilog code for this part of the ALU is shown below (ALU code ob GitHub)


// first part: calculate a power of two
wire shiftq    = op[4:0] == 12;     // true if operaration is shift left
wire shiftlo   = shiftq & ~b[4];    // true if shifting < 16 bits
wire shifthi   = shiftq &  b[4];    // true if shifting >= 16 bits

// determine power of two
wire shiftla0  = b[3:0]  == 4'd0;   // 2^0 = 1
wire shiftla1  = b[3:0]  == 4'd1;   // 2^1 = 2
wire shiftla2  = b[3:0]  == 4'd2;   // 2^2 = 3
wire shiftla3  = b[3:0]  == 4'd3;   // ... etc 
...
wire shiftla15 = b[3:0]  == 4'd15;

// combine into 16 bit word
wire [15:0] shiftla16 = {shiftla15,shiftla14,shiftla13,shiftla12,
                         shiftla11,shiftla10,shiftla9 ,shiftla8 ,
                         shiftla7 ,shiftla6 ,shiftla5 ,shiftla4 ,
                         shiftla3 ,shiftla2 ,shiftla1 ,shiftla0};

// second part: reusing the multiplication code
// 4 16x16 bit partial multiplications
// the multiplier is either the b operand or a power of two for a shift
// note that b[31:16] for shift operations [31-0] is always zero
// so when shiftlo is true al_bh and ah_bh still result in zero
// the same is not true the other way around hence the extra shiftq check
// note that the behavior is undefined for shifts > 31

wire [31:0] mult_al_bl = a[15: 0] * (shiftlo ? shiftla16 : shiftq ? 16'b0 : b[15: 0]);
wire [31:0] mult_al_bh = a[15: 0] * (shifthi ? shiftla16 : b[31:16]);
wire [31:0] mult_ah_bl = a[31:16] * (shiftlo ? shiftla16 : shiftq ? 16'b0 : b[15: 0]);
wire [31:0] mult_ah_bh = a[31:16] * (shifthi ? shiftla16 : b[31:16]);

// combine the intermediate results into a 64 bit result
wire [63:0] mult64 = {32'b0,mult_al_bl} + {16'b0,mult_al_bh,16'b0}
                   + {16'b0,mult_ah_bl,16'b0} + {mult_ah_bh,32'b0};

// final part: compute the result of the whole ALU
wire [32:0] result;

assign result = 
            op[4:0] == 0 ? add :
            op[4:0] == 1 ? adc :
            ...
            shiftq ? {1'b0, mult64[31:0]} :
            ...
            ;

The first half constructs rather than computes the power of two by creating a single 16 bit word with just a single bit set.

The second half selects the proper multiplier parts based on the instruction (regular multiplication or shift left)

The final part is about returning the result: it will be in the lower 32 bits of the combined results. Note that shifting by 32 bits should return zero but selecting for this explicit situation will add more LUTs to my design than I have currently available (using 5181 out of 5280). So for this implementation the behavior for shifts outside the range [0-31] is not defined.

Implementation notes

The code is simple because we do not need all multiplication and addition steps of a full 32 x 32 bit multiplication because if a number is a power of two, only one of the two 16 bits of the multiplier will be non zero (for shift amounts < 32).

Multiplying two 32 bit numbers involves four 16 bit multiplications (of each combination of the 16 bit halves of the multiplier and multiplicand). The four intermediate 32 bit results are then added to a 64 bit result.

If one of the halves of the multiplier is zero then two multiplication steps are no longer necessary as their result will be zero and the corresponding addition steps will be redundant too.

LUT Usage

Just to give some idea about the resources used by a barrel shifter vs. this multiplication based implementation I have created bare bone implementations (shiftleft.v and shiftleft2.v) and checked those with yosys/next-pnr.

	shiftleft.v (barrel)	shiftleft2.v (multiplier)
ICESTORM_LC	199	67
ICESTORM_DSP	0	3

(side note: the stand alone multiplier implementation only uses 3 DSPs compared to the 4 used by the full ALU but that is because yosys optimizes away the multiplication of both upper halves of the words as they can only end up in the upper 32 bits of the result which we do not use for the left shift)

Thursday, 16 January 2020

Additional instructions

The Robin cpu/soc is coming along nicely but when i started playing around with implementing a compiler it became quickly clear that code generation was hindered by not having relative branch instructions that could reach destinations beyond -128 or +127 bytes (a 8-bit signed integer).

Long branch

So I expanded the instruction set to take a full 32-bit signed offset. If the 8bit offset is zero, the next 4 bytes will be used as a the offset. The complete instruction now looks like this:

[15:12] opcode (13)
[11: 8] condition
[ 7: 0] offset
Optional: 4 bytes offset (if offset == 0)

The condition is used to check against the flags register. The highest bit of the condition determines if a flag should be set or unset and because bit 31 of the flags register is always 1 we even have an option for an unconditional branch (or even to never take the branch, which is rather useless)

if cond[2:0] & R13[31:29] == cond[3]  then PC += offset  ? offset : (PC)

Bit 30 and 29 of the flags register are the negative (sign) and zero bit respectively.

Stack instructions

I also added pop and push instructions to reduce code size, even though it is a bit at odds with the RISC philosophy. These always use R14 as the stack pointer and the opcode looks like this:

[15:12] opcode (15)
[11: 8] register
[ 7: 0] 1 = pop, 2 = push

Verilog observations

I have a few other instructions I wish to implement, for example to sign extend a byte to a long, but already i am using almost all available LUTs on the iCEbreaker.
There are a few options though: until now i have been using next-pnr's heap placer which is quite fast (just a few seconds on my machine). The sa placer however is much slower (more than 60 seconds) but also generates a result that saves me about 250 LUTs!
The second option is to play around with the numerical values of the state labels. This may sound weird but the current implementation of the cpu has 29 states, i.e. a 5 bit state register. If i number them consecutively from 0 - 28 yosys uses more LUTs than when I assign the last state the number 31. Apparently the huge multiplexer generated for this state machine benefits from gaps in the list of possible states.
In the end I intend to simplify and optimise this design but for now I stick with the sa placer.

Friday, 3 January 2020

Divider module

Because software division is rather slow a hardware division implementation might be nice to have, even though it can eat lots of resources on your fpga (think hundreds of LUTs for a 32 bit implementation).

Also, unlike the regular operations in the ALU that can be performed completely combinatorial and therefore deliver a result instantly (i.e. in one cycle after fetching and decoding an instruction), a divider needs to perform a number of shifts and subtracts to calculate the quotient or the remainder.

Calling the divider module

Therefore the divider module needs to be able to signal to the cpu that it is done (that is, that the output reflects the final result) and also needs to be told to start. The code snippet below shows how the main CPU state machine deals with those div_go and div_available signals when the alu operation signifies that the divider module should be used.


DECODE  : begin
        state <= EXECUTE;
        if(alu_op[5]) div_go <= 1; // start the divider module if we have a divider operation
      end
EXECUTE : begin
        state <= WAIT;
        div_go <= 0;
        case(cmd)
          CMD_MOVEP:  begin
                  if(writable_destination) r[R2] <= sumr1r0;
                end
          CMD_ALU:  begin
                  if(~alu_op[5]) begin // regular alu operation (single cycle)
                    if(writable_destination) r[R2] <= alu_c;
                    r[13][28] <= alu_carry_out;
                    r[13][29] <= alu_is_zero;
                    r[13][30] <= alu_is_negative;
                  end else begin // divider operation (multiple cycles)
                    if(div_is_available) begin
                      if(writable_destination) r[R2] <= div_c;
                      r[13][29] <= div_is_zero;
                      r[13][30] <= div_is_negative;
                    end else
                      state <= EXECUTE; 
                  end
                end

Divider module implementation

The divider module is fairly large (and therefore resource heavy) because among other things it needs to be able to deal with the signs of the operands so there are multiple negations that take exclusive ors and additions over the full register width when implemented in hardware. I have annotated the source code below so it should be fairly straight forward to read. Note that the actual division part is a slightly adapted form of long division, sometimes referred to as "Kenyan division".


 module divider(
    input clk,
    input reset,
  input [31:0] a,
  input [31:0] b,
  input go,
  input divs,
  input remainder,
  output [31:0] c,
  output is_zero,
  output is_negative,
  output reg available
  );

  localparam DIV_SHIFTL    = 2'd0;
  localparam DIV_SUBTRACT  = 2'd1;
  localparam DIV_AVAILABLE = 2'd2;
  localparam DIV_DONE      = 2'd3;
  reg [1:0] step;

  reg [32:0] dividend;
  reg [32:0] divisor;
  reg [32:0] quotient, quotient_part;
  wire overshoot = divisor > dividend;
  wire division_by_zero = (b == 0);
  // for signed division the sign of the remainder is always equal 
  // to the sign of the dividend (a) while the sign of the quotient
  // is equal to the product of the sign of dividend and divisor
  // this to keep the following realation true
  // quotient * divisor + remainder == dividend
  wire signq = a[31] ^ b[31];
  wire sign = remainder ? a[31] : signq ;
  reg [31:0] result;
  wire [31:0] abs_a = a[31] ? -a : a;
  wire [31:0] abs_b = b[31] ? -b : b;

  always @(posedge clk) begin
    if(go) begin
      // on receiving the go signal we initializer all registers
      // we take care of taking the absolute values for
      // dividend and divisor. We skip any calculations of a
      // quotient if the divisor is zero.
      step <= division_by_zero ? DIV_AVAILABLE : DIV_SHIFTL;
      available <= 0;
      dividend  <= divs ? {1'b0, abs_a} : {1'b0, a};
      divisor   <= divs ? {1'b0, abs_b} : {1'b0, b};
      quotient  <= 0;
      quotient_part <= 1;
    end else
      case(step)
        // as long as the divisor is smaller than the dividend
        // we multiply the divisor and the quotient_part by 2
        // If no longer true, we correct by shifting everything
        // back. This means registers should by 33 bit instead
        // of 32 to accommodate the shifts.
        DIV_SHIFTL  :   begin
                  if(~overshoot) begin
                    divisor <= divisor << 1;
                    quotient_part <= quotient_part << 1;
                  end else begin
                    divisor <= divisor >> 1;
                    quotient_part <= quotient_part >> 1;
                    step <= DIV_SUBTRACT;
                  end
                end
        // the next state is all about subtracting the divisor
        // if it is smaller than the dividend. If it is, we
        // perform the subtraction and or in the quotient_part
        // into the quotient. Then divisor and quotient_part
        // are halved again until the quotient_part is zero, in
        // which case we are done.
        DIV_SUBTRACT: begin
                  if(quotient_part == 0)
                    step <= DIV_AVAILABLE;
                  else begin
                    if(~overshoot) begin
                      dividend <= dividend - divisor;
                      quotient <= quotient | quotient_part;
                    end 
                    divisor <= divisor >> 1;
                    quotient_part <= quotient_part >> 1;
                  end
                end
        // we signal availability of the result (for one clock)
        // to the cpu and set the result to the chosen option.
        DIV_AVAILABLE:  begin
                  step <= DIV_DONE;
                  available <= 1;
                  result <= remainder ? dividend[31:0] : quotient[31:0];
                end
        default   :   available <= 0;
      endcase
  end

  // these wires make sure that the correct sign correction is applied
  // and the relevant flags are returned.
  assign c = divs ? (sign ? -result : result) : result;
  assign is_zero = (c == 0);
  assign is_negative = c[31];

endmodule

Performance test

Because the Robin CPU provides a mark instruction to get the current clock counter, it is pretty easy to compare the number of clock cycles it takes to calculate a signed division and remainder in software versus a hardware instruction. The software implementation could probably be optimized a bit, although it already returns both quotient and remainder in one go, whereas this needs two instructions in hardware, but the difference is enormous:

It is interesting to note that less cycles are needed for bigger divisors. This is mainly due to needing less shifts of the divisor to match it up with the dividend. The hardware implementation could probably be made even faster if we would explicitly add shortcuts for small divisors (less than 256 perhaps), something extra worthwhile because dividing by small numbers is pretty common.

Code availability

The divider is part of the GitHub repository for the Robin SoC, the file is named divider.v

Wednesday, 1 January 2020

ALU

Currently the ALU is a pretty straight forward pure combinatorial design. That isn't something we can keep up forever because the Lattice up5k on the iCEbreaker has dsp cores that provide fast multiplication, we will have to implement division ourselves.

Nevertheless i present the current implementation as is (mainly to test the verilog syntax highlighting capabilities of highlight.js :-) )


module alu(
 input [31:0] a,
 input [31:0] b,
 input carry_in,
 input [7:0] op,
 output [31:0] c,
 output carry_out,
 output is_zero,
 output is_negative
 );

 wire [32:0] add = {0, a} + {0, b};
 wire [32:0] adc = add + { 32'd0, carry_in};
 wire [32:0] sub = {0, a} - {0, b};
 wire [32:0] sbc = sub - { 32'd0, carry_in};
 wire [32:0] b_and = {0, a & b};
 wire [32:0] b_or  = {0, a | b};
 wire [32:0] b_xor = {0, a ^ b};
 wire [32:0] b_not = {0,~a    };
 wire [32:0] extend = {a[31],a};
 wire [32:0] min_a = -extend;
 wire [32:0] cmp = sub[32] ? 33'h1ffff_ffff : sub == 0 ? 0 : 1;
 wire [32:0] shiftl = {a[31:0],1'b0};
 wire [32:0] shiftr = {a[0],1'b0,a[31:1]};
 wire [31:0] mult_al_bl = a[15: 0] * b[15: 0];
 wire [31:0] mult_al_bh = a[15: 0] * b[31:16];
 wire [31:0] mult_ah_bl = a[31:16] * b[15: 0];
 wire [31:0] mult_ah_bh = a[31:16] * b[31:16];
 wire [63:0] mult64 = {32'b0,mult_al_bl} + {16'b0,mult_al_bh,16'b0} 
                    + {16'b0,mult_ah_bl,16'b0} + {mult_ah_bh,32'b0};

 wire [32:0] result;

 always @(*) begin
  result= op == 0 ? add :
    op == 1 ? adc :
    op == 2 ? sub :
    op == 3 ? sbc :

    op == 4 ? b_or :
    op == 5 ? b_and :
    op == 6 ? b_not :
    op == 7 ? b_xor :

    op == 8 ? cmp :
    op == 9 ? {1'b0, a} :

    op == 12 ? shiftl :
    op == 13 ? shiftr :

    op == 16 ? {17'b0, mult_al_bl} :
    op == 17 ? {1'b0, mult64[31:0]} :
    op == 18 ? {1'b0, mult64[63:32]} :
    33'b0;
 end

 assign c = result[31:0];
 assign carry_out = result[32];
 assign is_zero = (c == 0);
 assign is_negative = c[31];

endmodule

Rotating blinkenlights

As everbody knows, no fpga design is worth anything unless can blink your on board LEDs :-)

Now I am a long way still from documenting fully what i have implemented but the current implementation of the cpu is fully functional and is capable of running a small program that lights the leds on the iCEbreaker board in a rotating manner until a key is pressed.

The actual code can be found in blinkenlights.S and when writing the program I noticed that when working with bytes getting a byte into the low order bits of a register almost always requires two instruction: one to clear the register and a second one to actually load an immediate byte value.

Now this is convenient when loading the alu operation into the lower byte of the flags register without clearing the flags but in most other situations I am starting to doubt this implementation decision. That is one thing I want to think about.

Tuesday, 31 December 2019

CPU design

The CPU design as currently implemented largely follows the diagram shown below. It features a 16 x 32bit register file and 16 bit instructions. It has an ALU that performs actions on any two input registers and can write it back. The actual alu operation is encoded in the low byte of R13 (the flags register). This means choosing an ALU operation and performing it are two instructions. This does keep the instruction size down and allows for apply the same operation to different combinations of registers without an extra instruction. (How useful this is, is somethign we will have to see when we start writing real code).
(The opcodes and alu operations implemented are documented in this sheet)
Address operations (basically adding any two registers) are done by a separate adder. The verilog implementation of the current cpu can be found in the GitHub repo (cpu.v, alu.v).