Long branch
So I expanded the instruction set to take a full 32-bit signed offset. If the 8bit offset is zero, the next 4 bytes will be used as a the offset. The complete instruction now looks like this:[15:12] opcode (13) [11: 8] condition [ 7: 0] offset Optional: 4 bytes offset (if offset == 0)
The condition is used to check against the flags register. The highest bit of the condition determines if a flag should be set or unset and because bit 31 of the flags register is always 1 we even have an option for an unconditional branch (or even to never take the branch, which is rather useless)
if cond[2:0] & R13[31:29] == cond[3] then PC += offset ? offset : (PC)
Bit 30 and 29 of the flags register are the negative (sign) and zero bit respectively.
Stack instructions
[15:12] opcode (15) [11: 8] register [ 7: 0] 1 = pop, 2 = push
Verilog observations
There are a few options though: until now i have been using next-pnr's heap placer which is quite fast (just a few seconds on my machine). The sa placer however is much slower (more than 60 seconds) but also generates a result that saves me about 250 LUTs!
The second option is to play around with the numerical values of the state labels. This may sound weird but the current implementation of the cpu has 29 states, i.e. a 5 bit state register. If i number them consecutively from 0 - 28 yosys uses more LUTs than when I assign the last state the number 31. Apparently the huge multiplexer generated for this state machine benefits from gaps in the list of possible states.
In the end I intend to simplify and optimise this design but for now I stick with the sa placer.