Homebrew CPU in an FPGA

Note

I wrote this page in 2003, and a few people over the years have taken the source and made their own CPUs. The text has been reformatted, some typos fixed, but apart from that it is just as it appeared all those years ago. This was an early attempt: The J1 Forth CPU is a cleaner design and a faster core.

I’ve been playing around with small computers - PICs, Scenix, Rabbits - for a few years now, and they never seemed to do quite what I wanted. So I decided to design my own small CPU, and implement it in an FPGA. I aimed low, so it’s pretty simple. Some highlights:

  • Stack-based 16-bit CPU
  • 1024 words of general purpose RAM
  • Fits in 140 slices in Xilinx Spartan II FPGA (the smallest device in the family is the $8 XC2S15, 196 slices)
  • 22 MIPS with a 66MHz basic clock

Some of the many things it doesn’t have:

  • registers
  • fancy instructions like multiply, divide, left-shift or subtract
  • interrupts

It actually can run real applications - a couple of things I’ve done with it so far:

  • controller for an in-car MP3 player (~900 instructions)
  • TCP/IP node (~700 instructions)

All the source code is in this directory.

  • VHDL source for the CPU is in two parts: xpu_header.inc and xpu_behavioral.inc.
  • The source for the assembler - it’s a single lex input grammar - is asm.l.
  • An example VHDL file that uses the CPU is c2a.vhd, and source code for the CPU is in xpumain.sl.

Programming Model

There are two seperate stacks: one for data and one for call-return. Both stacks are 16 levels deep.

I followed a FORTH naming convention for the instructions:

  • Stack manipulation:

    dup                                     # a -- a a
    drop                                    # a --
    over                                    # a b -- a b a
    swap                                    # a b -- b a
    nip                                     # a b -- b
    
  • Arithmetic:

    +                                       # a b -- (a+b)
    not                                     # a -- ~a
    and                                     # a b -- (a&b)
    or                                      # a b -- (a|b)
    xor                                     # a b -- (a^b)
    rshift1                                 # a -- (a>>1)
    
  • Memory load and store:

    @                                       # a -- mem[a]
    !                                       # a b -- (a written to mem[b])
    
  • I/O:

    in                                      # -- a
    out                                     # a b --
    
  • Push immediate:

    N                                       # -- N
    

These are the control flow instructions:

  • jump
  • call
  • return

Both jump and call can be predicated (that is, conditionally executed) depending on the value on top of the stack. Also, a jump or call can optionally pop the stack, i.e. do a ‘drop’ instruction.

Assembly Language

The assembler uses a FORTH-like syntax, so labels are preceded by a colon (:). Calls are just the label. FORTH doesn’t have an explicit jump, so I used greater-than (>) followed by the label. Return is a semicolon (;). For example:

: foo ;                          # declares an empty subroutine "foo"
: bar foo foo ;                  # subroutine "bar" calls "foo" twice
: bar2 foo >foo                  # more efficient version of "bar"

The predication on jumps and calls is pretty fancy. It allows you to do these tests:

modifier condition
?fa never
?tr always
?lt ST0 < 0
?gt ST0 > 0
?eq ST0 == 0
?ne ST0 != 0
?le ST0 <= 0
?ge ST0 >= 0

A (-) suffix means also pop the stack. So ?ne- x looks at ST0, if it’s not zero decides to make the call, pops the stack, then either calls x or does a NOP.

For example, a loop that calls “foo” 666 times might look like:

        666                       # loop counter on top of stack, ST0
: loop
        foo                       # call foo
        -1 +                      # decrement loop counter
        ?ne >loop                 # ST0 not zero yet, do it again
        drop                      # Remove loop counter from stack

Saving Space

The machine can only hold 1K instructions, so code size is important. Here are things that I do to save space. Some of these are old hat, some are specific to the CPU.

  • If subroutine FOO’s last act is to call function BAR, then this code:

    : foo
            ...
            bar
            ;
    

can be rewritten:

: foo
        ...
        >bar
  • Use fall-through whenever possible. In the above example, if FOO immediately precedes BAR, then the code can just be:

    : foo
            ...
    : bar
    
  • The machine allows the RET bit to be set in any ALU-type instruction. (In fact, a return instruction (;) is just a NOP with the return bit set). This means that machine will do the subroutine return in parallel with the operation. The assembler syntax for this is just to append ‘;’ to the instruction. So:

    : bar
            drop
            ;
    

    can become:

    : bar

    drop;

  • Use subroutines whenever possible. Because a call is one instruction and the return is free, the break-even point for subroutines is low. In this code with two repeated instructions in it:

    . . .
    X
    Y
    . . .
    X
    Y
    

    factoring out the common code takes the same space - 4 instructions:

            . . .
            common
            . . .
            common
            . . .
    
    : common
            X
            Y;
    
If the common code is longer than 2 instructions or is called more than twice, then the transformation saves room.
  • On the other hand, subroutines that are called only once can be expanded inline, saving an instruction.

Some examples

Some frequently-used operations:

: 2dup                                  # a b -- a b a b
        over                            # a b a
        over;                           # a b a b

: 2drop                                 # a b --
        drop                            # a
        drop;                           #

: 1-    -1 +;                           # a -- a-1

: neg                                   # a -- -a
        not                             # ~a
                                        # and fall-through into...
: 1+    1 +;                            # a -- a+1

: -                                     # a b -- a-b
        neg                             # a -b
        +;                              # a-b

A subroutine that takes parameters a and b, and returns 2a+b:

: 2a+b                                  # a b -- 2a+b
        over                            # a b a
        +                               # a b+a
        +;                              # a+b+a

Maximum of two numbers:

: max                                   # a b -- max(a,b)
        2dup                            # a b a b
        -                               # a b a-b
        ?gt- >_drop                     # a
        nip;                            # b

: _drop drop;

A subroutine to calculate the greatest common divisor (GCD) using Euclid’s algorithm:

# GCD(m, n)
#     if (m == n)
#        return m
#     if (m > n)
#        return GCD(m - n, n)
#     else
#        return GCD(m, n - m)

With care, this can fit in 8 instructions:

: gcd                   # m n -- gcd(m,n)
        2dup            # m n m n
        -               # m n m-n
        ?eq >2drop      # if (m==n), return m
        ?gt- _swap      # if (m>n), swap them, so (m&lt;n)
        over            # m n m
        -               # m n-m
        >gcd            # repeat

: _swap swap;           # subroutine to do a swap

VHDL Implementation in an FPGA

The program counter is 10 bits wide, as are all memory address paths. Memory, busses and ALU are all 16-bits wide.

The call-return stack lives in a dedicated 16x10 RAM, and the topmost item on that stack is the program counter. The logic for doing a call or return is trivial: just increment or decrement the CSP.

The data stack is more complicated, because many instructions reference the top two items on the stack. If the whole stack was in main memory, this would require two reads. To avoid this, the VHDL implementation keeps top-of-stack (ST0) in a register. The data stack pointer SP actually points to the second item on the stack (ST1).

This makes stack operations a little bit fiddly. Pushing X on the stack means that the CPU has to

  • increment SP
  • write ST0 to the new SP (i.e. ST1)
  • copy X to ST0

and to pop an item off the stack:

  • read ST1
  • decrement SP
  • copy the old ST1 into ST0

A binary operation - like addition - requires these stages:

  • read ST1
  • add ST0 to ST1, result into ST0
  • decrement SP.

Execution is a 3-cycle process:

  • Fetch: current instruction from memory. Write back the incremented PC.
  • Read: depending on the instruction, this is either a read of ST1 from the data stack or from memory. Adjust SP and CSP according to the current instruction.
  • Write: depending on the instruction this can be a write to the data stack or to main memory.

There are four instruction formats, distinguished by their two high bits

00: ALU

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 LOGIC IOW ALU STORE FETCH RET WRITE STACK
STACK
Change stack pointer at the end of the read cycle. 00 means no change, 01 means increment, 11 means decrement.
WRITE
If 1 then do a write-cycle write. The destination of the write is either ST1 or [ST1], depending on the STORE field. The data written is always ST0.
RET
add 1 to the Call Stack Pointer (CSP): do a subroutine return.
FETCH
If 0 then read-cycle read is from ST1. If 1 then the read is from memory address [ST0].
STORE
If 0 then the write-cycle write (if any) is to ST1. If 1 then the last cycle write is to memory address [ST1].
ALU

The three bit code controls the ALU function. If ST0 is top-of-stack and ST1 is the element under it, then the possible functions are:

000 ST1 + ST0
001 Result from LOGIC unit, see below
010 ST0 >> 1
011 Status word read
101 I/O read
110 I/O read
111 I/O read
IOW
Perform an I/O write on the last cycle.
LOGIC

The four bit code controls the result of the logic unit. The possible bitwise functions of the 16 bit values ST0 and ST1 are:

0000 CLEAR 0
0001 AND ST1 & ST0
0010 AND_REVERSE ST0 & !ST1
0011 COPY ST0
0100 AND_INVERTED !ST0 & ST1
0101 NOOP ST1
0110 XOR ST1 ^ ST0
0111 OR ST1 | ST0
1000 NOR !(ST1 | ST0)
1001 EQUIV !(ST1 ^ ST0)
1010 INVERT !ST1
1011 OR_REVERSE ST0 | !ST1
1100 COPY_INVERTED !ST0
1101 OR_INVERTED !ST0 | ST1
1110 LOGIC_NAND !(ST1 & ST0)
1111 SET 1

01: Push literal

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 LITERAL

The 14-bit literal is sign extended to 16 bits and pushed on the stack.

10: Jump

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 PRED DROP DEST

The 3 bit PRED field specifies the predicate: if it is true the CPU takes the jump.

000 false
001 ST0 < 0
010 ST0 == 0
011 ST0 <= 0
100 ST0 > 0
101 ST0 !- 0
110 ST0 >= 0
111 true

The DROP field is 1 if the CPU should pop the stack. Note that this pop happens after predicate evaluation, and before the jump.

11: call

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 PRED DROP DEST

PRED and DROP fields are the same as for “jump” above.

How the assembler maps instructions to micro-ops

It’s pretty clear how the assembler translates call, jump, and load-immediate instructions to bit patterns. The remaining instructions all use the ALU; here’s a table that shows the mapping from mnemonic to microcode fields.

Opcode 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 LOGIC IOW ALU STORE FETCH RET WRITE STACK
dup 0 0 0011 (ST0) 0 001 (LOGIC) 0 0 0 1 01
drop 0 0 0101 (ST1) 0 001 (LOGIC) 0 0 0 0 11
over 0 0 0101 (ST1) 0 001 (LOGIC) 0 0 0 1 01
swap 0 0 0101 (ST1) 0 001 (LOGIC) 0 0 0 1 00
nip 0 0 0101 (ST0) 0 001 (LOGIC) 0 0 0 0 11
+ 0 0 0 000 (ADD) 0 0 0 0 01
not 0 0 1100 (!ST0) 0 001 (LOGIC) 0 0 0 0 00
and 0 0 0001 (ST0 & ST1) 0 001 (LOGIC) 0 0 0 0 01
or 0 0 0111 (ST0 | ST1) 0 001 (LOGIC) 0 0 0 0 01
xor 0 0 0110 (ST0 ^ ST1) 0 001 (LOGIC) 0 0 0 0 01
rshift 0 0 0 010 (RSHIFT1) 0 0 0 0 01
@ 0 0 0101 (ST1) 0 001 (LOGIC) 0 1 0 0 00
! 0 0 0101 (ST1) 0 001 (LOGIC) 1 0 0 1 11
in 0 0 0 101,110,111 (I/O) 0 0 0 1 01
out 0 0 0101 (ST1) 1 001 (LOGIC) 1 0 0 0 11
; 0 0 0011 (ST0) 0 001 (LOGIC) 0 0 1 0 00

More VHDL, Verilog and FPGA notes.