Note
I wrote this page in 2003, and a few people over the years have taken the source and made their own CPUs. The text has been reformatted, some typos fixed, but apart from that it is just as it appeared all those years ago. This was an early attempt: The J1 Forth CPU is a cleaner design and a faster core.
I’ve been playing around with small computers - PICs, Scenix, Rabbits - for a few years now, and they never seemed to do quite what I wanted. So I decided to design my own small CPU, and implement it in an FPGA. I aimed low, so it’s pretty simple. Some highlights:
- Stack-based 16-bit CPU
- 1024 words of general purpose RAM
- Fits in 140 slices in Xilinx Spartan II FPGA (the smallest device in the family is the $8 XC2S15, 196 slices)
- 22 MIPS with a 66MHz basic clock
Some of the many things it doesn’t have:
- registers
- fancy instructions like multiply, divide, left-shift or subtract
- interrupts
It actually can run real applications - a couple of things I’ve done with it so far:
- controller for an in-car MP3 player (~900 instructions)
- TCP/IP node (~700 instructions)
All the source code is in this directory.
- VHDL source for the CPU is in two parts: xpu_header.inc and xpu_behavioral.inc.
- The source for the assembler - it’s a single lex input grammar - is asm.l.
- An example VHDL file that uses the CPU is c2a.vhd, and source code for the CPU is in xpumain.sl.
There are two seperate stacks: one for data and one for call-return. Both stacks are 16 levels deep.
I followed a FORTH naming convention for the instructions:
Stack manipulation:
dup # a -- a a drop # a -- over # a b -- a b a swap # a b -- b a nip # a b -- bArithmetic:
+ # a b -- (a+b) not # a -- ~a and # a b -- (a&b) or # a b -- (a|b) xor # a b -- (a^b) rshift1 # a -- (a>>1)Memory load and store:
@ # a -- mem[a] ! # a b -- (a written to mem[b])I/O:
in # -- a out # a b --Push immediate:
N # -- N
These are the control flow instructions:
- jump
- call
- return
Both jump and call can be predicated (that is, conditionally executed) depending on the value on top of the stack. Also, a jump or call can optionally pop the stack, i.e. do a ‘drop’ instruction.
The assembler uses a FORTH-like syntax, so labels are preceded by a colon (:). Calls are just the label. FORTH doesn’t have an explicit jump, so I used greater-than (>) followed by the label. Return is a semicolon (;). For example:
: foo ; # declares an empty subroutine "foo"
: bar foo foo ; # subroutine "bar" calls "foo" twice
: bar2 foo >foo # more efficient version of "bar"
The predication on jumps and calls is pretty fancy. It allows you to do these tests:
modifier | condition |
---|---|
?fa | never |
?tr | always |
?lt | ST0 < 0 |
?gt | ST0 > 0 |
?eq | ST0 == 0 |
?ne | ST0 != 0 |
?le | ST0 <= 0 |
?ge | ST0 >= 0 |
A (-) suffix means also pop the stack. So ?ne- x looks at ST0, if it’s not zero decides to make the call, pops the stack, then either calls x or does a NOP.
For example, a loop that calls “foo” 666 times might look like:
666 # loop counter on top of stack, ST0
: loop
foo # call foo
-1 + # decrement loop counter
?ne >loop # ST0 not zero yet, do it again
drop # Remove loop counter from stack
The machine can only hold 1K instructions, so code size is important. Here are things that I do to save space. Some of these are old hat, some are specific to the CPU.
If subroutine FOO’s last act is to call function BAR, then this code:
: foo ... bar ;can be rewritten:
: foo ... >bar
Use fall-through whenever possible. In the above example, if FOO immediately precedes BAR, then the code can just be:
: foo ... : barThe machine allows the RET bit to be set in any ALU-type instruction. (In fact, a return instruction (;) is just a NOP with the return bit set). This means that machine will do the subroutine return in parallel with the operation. The assembler syntax for this is just to append ‘;’ to the instruction. So:
: bar drop ;can become:
- : bar
drop;
Use subroutines whenever possible. Because a call is one instruction and the return is free, the break-even point for subroutines is low. In this code with two repeated instructions in it:
. . . X Y . . . X Yfactoring out the common code takes the same space - 4 instructions:
. . . common . . . common . . . : common X Y;If the common code is longer than 2 instructions or is called more than twice, then the transformation saves room.
- On the other hand, subroutines that are called only once can be expanded inline, saving an instruction.
Some frequently-used operations:
: 2dup # a b -- a b a b
over # a b a
over; # a b a b
: 2drop # a b --
drop # a
drop; #
: 1- -1 +; # a -- a-1
: neg # a -- -a
not # ~a
# and fall-through into...
: 1+ 1 +; # a -- a+1
: - # a b -- a-b
neg # a -b
+; # a-b
A subroutine that takes parameters a and b, and returns 2a+b:
: 2a+b # a b -- 2a+b
over # a b a
+ # a b+a
+; # a+b+a
Maximum of two numbers:
: max # a b -- max(a,b)
2dup # a b a b
- # a b a-b
?gt- >_drop # a
nip; # b
: _drop drop;
A subroutine to calculate the greatest common divisor (GCD) using Euclid’s algorithm:
# GCD(m, n)
# if (m == n)
# return m
# if (m > n)
# return GCD(m - n, n)
# else
# return GCD(m, n - m)
With care, this can fit in 8 instructions:
: gcd # m n -- gcd(m,n)
2dup # m n m n
- # m n m-n
?eq >2drop # if (m==n), return m
?gt- _swap # if (m>n), swap them, so (m<n)
over # m n m
- # m n-m
>gcd # repeat
: _swap swap; # subroutine to do a swap
The program counter is 10 bits wide, as are all memory address paths. Memory, busses and ALU are all 16-bits wide.
The call-return stack lives in a dedicated 16x10 RAM, and the topmost item on that stack is the program counter. The logic for doing a call or return is trivial: just increment or decrement the CSP.
The data stack is more complicated, because many instructions reference the top two items on the stack. If the whole stack was in main memory, this would require two reads. To avoid this, the VHDL implementation keeps top-of-stack (ST0) in a register. The data stack pointer SP actually points to the second item on the stack (ST1).
This makes stack operations a little bit fiddly. Pushing X on the stack means that the CPU has to
- increment SP
- write ST0 to the new SP (i.e. ST1)
- copy X to ST0
and to pop an item off the stack:
- read ST1
- decrement SP
- copy the old ST1 into ST0
A binary operation - like addition - requires these stages:
- read ST1
- add ST0 to ST1, result into ST0
- decrement SP.
Execution is a 3-cycle process:
- Fetch: current instruction from memory. Write back the incremented PC.
- Read: depending on the instruction, this is either a read of ST1 from the data stack or from memory. Adjust SP and CSP according to the current instruction.
- Write: depending on the instruction this can be a write to the data stack or to main memory.
There are four instruction formats, distinguished by their two high bits
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | LOGIC | IOW | ALU | STORE | FETCH | RET | WRITE | STACK |
The three bit code controls the ALU function. If ST0 is top-of-stack and ST1 is the element under it, then the possible functions are:
000 | ST1 + ST0 |
001 | Result from LOGIC unit, see below |
010 | ST0 >> 1 |
011 | Status word read |
101 | I/O read |
110 | I/O read |
111 | I/O read |
The four bit code controls the result of the logic unit. The possible bitwise functions of the 16 bit values ST0 and ST1 are:
0000 | CLEAR | 0 |
0001 | AND | ST1 & ST0 |
0010 | AND_REVERSE | ST0 & !ST1 |
0011 | COPY | ST0 |
0100 | AND_INVERTED | !ST0 & ST1 |
0101 | NOOP | ST1 |
0110 | XOR | ST1 ^ ST0 |
0111 | OR | ST1 | ST0 |
1000 | NOR | !(ST1 | ST0) |
1001 | EQUIV | !(ST1 ^ ST0) |
1010 | INVERT | !ST1 |
1011 | OR_REVERSE | ST0 | !ST1 |
1100 | COPY_INVERTED | !ST0 |
1101 | OR_INVERTED | !ST0 | ST1 |
1110 | LOGIC_NAND | !(ST1 & ST0) |
1111 | SET | 1 |
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | LITERAL |
The 14-bit literal is sign extended to 16 bits and pushed on the stack.
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | PRED | DROP | DEST |
The 3 bit PRED field specifies the predicate: if it is true the CPU takes the jump.
000 | false |
001 | ST0 < 0 |
010 | ST0 == 0 |
011 | ST0 <= 0 |
100 | ST0 > 0 |
101 | ST0 !- 0 |
110 | ST0 >= 0 |
111 | true |
The DROP field is 1 if the CPU should pop the stack. Note that this pop happens after predicate evaluation, and before the jump.
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | PRED | DROP | DEST |
PRED and DROP fields are the same as for “jump” above.
It’s pretty clear how the assembler translates call, jump, and load-immediate instructions to bit patterns. The remaining instructions all use the ALU; here’s a table that shows the mapping from mnemonic to microcode fields.
Opcode | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | LOGIC | IOW | ALU | STORE | FETCH | RET | WRITE | STACK | |||||||
dup | 0 | 0 | 0011 (ST0) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 1 | 01 | ||||||
drop | 0 | 0 | 0101 (ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 11 | ||||||
over | 0 | 0 | 0101 (ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 1 | 01 | ||||||
swap | 0 | 0 | 0101 (ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 1 | 00 | ||||||
nip | 0 | 0 | 0101 (ST0) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 11 | ||||||
+ | 0 | 0 | 0 | 000 (ADD) | 0 | 0 | 0 | 0 | 01 | |||||||
not | 0 | 0 | 1100 (!ST0) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 00 | ||||||
and | 0 | 0 | 0001 (ST0 & ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 01 | ||||||
or | 0 | 0 | 0111 (ST0 | ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 01 | ||||||
xor | 0 | 0 | 0110 (ST0 ^ ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 01 | ||||||
rshift | 0 | 0 | 0 | 010 (RSHIFT1) | 0 | 0 | 0 | 0 | 01 | |||||||
@ | 0 | 0 | 0101 (ST1) | 0 | 001 (LOGIC) | 0 | 1 | 0 | 0 | 00 | ||||||
! | 0 | 0 | 0101 (ST1) | 0 | 001 (LOGIC) | 1 | 0 | 0 | 1 | 11 | ||||||
in | 0 | 0 | 0 | 101,110,111 (I/O) | 0 | 0 | 0 | 1 | 01 | |||||||
out | 0 | 0 | 0101 (ST1) | 1 | 001 (LOGIC) | 1 | 0 | 0 | 0 | 11 | ||||||
; | 0 | 0 | 0011 (ST0) | 0 | 001 (LOGIC) | 0 | 0 | 1 | 0 | 00 |