Note

I wrote this page in 2003, and a few people over the years have taken the source and made their own CPUs. The text has been reformatted, some typos fixed, but apart from that it is just as it appeared all those years ago. This was an early attempt: *The J1 Forth CPU* is a cleaner design and a faster core.

I’ve been playing around with small computers - PICs, Scenix, Rabbits - for a few years now, and they never seemed to do quite what I wanted. So I decided to design my own small CPU, and implement it in an FPGA. I aimed low, so it’s pretty simple. Some highlights:

- Stack-based 16-bit CPU
- 1024 words of general purpose RAM
- Fits in 140 slices in Xilinx Spartan II FPGA (the smallest device in the family is the $8 XC2S15, 196 slices)
- 22 MIPS with a 66MHz basic clock

Some of the many things it doesn’t have:

- registers
- fancy instructions like multiply, divide, left-shift or subtract
- interrupts

It actually can run real applications - a couple of things I’ve done with it so far:

- controller for an in-car MP3 player (~900 instructions)
- TCP/IP node (~700 instructions)

All the source code is in this directory.

- VHDL source for the CPU is in two parts:
xpu_header.incandxpu_behavioral.inc.- The source for the assembler - it’s a single lex input grammar - is
asm.l.- An example VHDL file that uses the CPU is
c2a.vhd, and source code for the CPU is inxpumain.sl.

There are two seperate stacks: one for data and one for call-return. Both stacks are 16 levels deep.

I followed a FORTH naming convention for the instructions:

Stack manipulation:

dup # a -- a a drop # a -- over # a b -- a b a swap # a b -- b a nip # a b -- bArithmetic:

+ # a b -- (a+b) not # a -- ~a and # a b -- (a&b) or # a b -- (a|b) xor # a b -- (a^b) rshift1 # a -- (a>>1)Memory load and store:

@ # a -- mem[a] ! # a b -- (a written to mem[b])I/O:

in # -- a out # a b --Push immediate:

N # -- N

These are the control flow instructions:

- jump
- call
- return

Both `jump` and `call` can be predicated (that is, conditionally
executed) depending on the value on top of the stack. Also, a jump or
call can optionally pop the stack, i.e. do a ‘drop’ instruction.

The assembler uses a FORTH-like syntax, so labels are preceded by a colon (:). Calls are just the label. FORTH doesn’t have an explicit jump, so I used greater-than (>) followed by the label. Return is a semicolon (;). For example:

```
: foo ; # declares an empty subroutine "foo"
: bar foo foo ; # subroutine "bar" calls "foo" twice
: bar2 foo >foo # more efficient version of "bar"
```

The predication on jumps and calls is pretty fancy. It allows you to do these tests:

modifier | condition |
---|---|

?fa | never |

?tr | always |

?lt | ST0 < 0 |

?gt | ST0 > 0 |

?eq | ST0 == 0 |

?ne | ST0 != 0 |

?le | ST0 <= 0 |

?ge | ST0 >= 0 |

A (-) suffix means also pop the stack. So `?ne- x` looks at
ST0, if it’s not zero decides to make the call, pops the
stack, then either calls `x` or does a NOP.

For example, a loop that calls “foo” 666 times might look like:

```
666 # loop counter on top of stack, ST0
: loop
foo # call foo
-1 + # decrement loop counter
?ne >loop # ST0 not zero yet, do it again
drop # Remove loop counter from stack
```

The machine can only hold 1K instructions, so code size is important. Here are things that I do to save space. Some of these are old hat, some are specific to the CPU.

If subroutine FOO’s last act is to call function BAR, then this code:

: foo ... bar ;can be rewritten:

: foo ... >bar

Use fall-through whenever possible. In the above example, if FOO immediately precedes BAR, then the code can just be:

: foo ... : barThe machine allows the RET bit to be set in any ALU-type instruction. (In fact, a return instruction (;) is just a NOP with the return bit set). This means that machine will do the subroutine return in parallel with the operation. The assembler syntax for this is just to append ‘;’ to the instruction. So:

: bar drop ;can become:

- : bar
drop;

Use subroutines whenever possible. Because a call is one instruction and the return is free, the break-even point for subroutines is low. In this code with two repeated instructions in it:

. . . X Y . . . X Yfactoring out the common code takes the same space - 4 instructions:

. . . common . . . common . . . : common X Y;If the common code is longer than 2 instructions or is called more than twice, then the transformation saves room.

- On the other hand, subroutines that are called only once can be expanded inline, saving an instruction.

Some frequently-used operations:

```
: 2dup # a b -- a b a b
over # a b a
over; # a b a b
: 2drop # a b --
drop # a
drop; #
: 1- -1 +; # a -- a-1
: neg # a -- -a
not # ~a
# and fall-through into...
: 1+ 1 +; # a -- a+1
: - # a b -- a-b
neg # a -b
+; # a-b
```

A subroutine that takes parameters a and b, and returns 2a+b:

```
: 2a+b # a b -- 2a+b
over # a b a
+ # a b+a
+; # a+b+a
```

Maximum of two numbers:

```
: max # a b -- max(a,b)
2dup # a b a b
- # a b a-b
?gt- >_drop # a
nip; # b
: _drop drop;
```

A subroutine to calculate the greatest common divisor (GCD) using Euclid’s algorithm:

```
# GCD(m, n)
# if (m == n)
# return m
# if (m > n)
# return GCD(m - n, n)
# else
# return GCD(m, n - m)
```

With care, this can fit in 8 instructions:

```
: gcd # m n -- gcd(m,n)
2dup # m n m n
- # m n m-n
?eq >2drop # if (m==n), return m
?gt- _swap # if (m>n), swap them, so (m<n)
over # m n m
- # m n-m
>gcd # repeat
: _swap swap; # subroutine to do a swap
```

The program counter is 10 bits wide, as are all memory address paths. Memory, busses and ALU are all 16-bits wide.

The call-return stack lives in a dedicated 16x10 RAM, and the topmost item on that stack is the program counter. The logic for doing a call or return is trivial: just increment or decrement the CSP.

The data stack is more complicated, because many instructions reference the top two items on the stack. If the whole stack was in main memory, this would require two reads. To avoid this, the VHDL implementation keeps top-of-stack (ST0) in a register. The data stack pointer SP actually points to the second item on the stack (ST1).

This makes stack operations a little bit fiddly. Pushing X on the stack means that the CPU has to

- increment SP
- write ST0 to the new SP (i.e. ST1)
- copy X to ST0

and to pop an item off the stack:

- read ST1
- decrement SP
- copy the old ST1 into ST0

A binary operation - like addition - requires these stages:

- read ST1
- add ST0 to ST1, result into ST0
- decrement SP.

Execution is a 3-cycle process:

- Fetch: current instruction from memory. Write back the incremented PC.
- Read: depending on the instruction, this is either a read of ST1 from the data stack or from memory. Adjust SP and CSP according to the current instruction.
- Write: depending on the instruction this can be a write to the data stack or to main memory.

There are four instruction formats, distinguished by their two high bits

15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | LOGIC | IOW | ALU | STORE | FETCH | RET | WRITE | STACK |

- STACK
- Change stack pointer at the end of the read cycle. 00 means no change, 01 means increment, 11 means decrement.
- WRITE
- If 1 then do a write-cycle write. The destination of the write is either ST1 or [ST1], depending on the STORE field. The data written is always ST0.
- RET
- add 1 to the Call Stack Pointer (CSP): do a subroutine return.
- FETCH
- If 0 then read-cycle read is from ST1. If 1 then the read is from memory address [ST0].
- STORE
- If 0 then the write-cycle write (if any) is to ST1. If 1 then the last cycle write is to memory address [ST1].
- ALU
The three bit code controls the ALU function. If ST0 is top-of-stack and ST1 is the element under it, then the possible functions are:

000 ST1 + ST0 001 Result from LOGIC unit, see below 010 ST0 >> 1 011 Status word read 101 I/O read 110 I/O read 111 I/O read - IOW
- Perform an I/O write on the last cycle.
- LOGIC
The four bit code controls the result of the logic unit. The possible bitwise functions of the 16 bit values ST0 and ST1 are:

0000 CLEAR 0 0001 AND ST1 & ST0 0010 AND_REVERSE ST0 & !ST1 0011 COPY ST0 0100 AND_INVERTED !ST0 & ST1 0101 NOOP ST1 0110 XOR ST1 ^ ST0 0111 OR ST1 | ST0 1000 NOR !(ST1 | ST0) 1001 EQUIV !(ST1 ^ ST0) 1010 INVERT !ST1 1011 OR_REVERSE ST0 | !ST1 1100 COPY_INVERTED !ST0 1101 OR_INVERTED !ST0 | ST1 1110 LOGIC_NAND !(ST1 & ST0) 1111 SET 1

15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | LITERAL |

The 14-bit literal is sign extended to 16 bits and pushed on the stack.

15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | PRED | DROP | DEST |

The 3 bit PRED field specifies the predicate: if it is true the CPU takes the jump.

000 | false |

001 | ST0 < 0 |

010 | ST0 == 0 |

011 | ST0 <= 0 |

100 | ST0 > 0 |

101 | ST0 !- 0 |

110 | ST0 >= 0 |

111 | true |

The DROP field is 1 if the CPU should pop the stack. Note that this pop happens after predicate evaluation, and before the jump.

15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 1 | PRED | DROP | DEST |

PRED and DROP fields are the same as for “jump” above.

It’s pretty clear how the assembler translates call, jump, and load-immediate instructions to bit patterns. The remaining instructions all use the ALU; here’s a table that shows the mapping from mnemonic to microcode fields.

Opcode | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | LOGIC | IOW | ALU | STORE | FETCH | RET | WRITE | STACK | |||||||

dup | 0 | 0 | 0011 (ST0) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 1 | 01 | ||||||

drop | 0 | 0 | 0101 (ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 11 | ||||||

over | 0 | 0 | 0101 (ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 1 | 01 | ||||||

swap | 0 | 0 | 0101 (ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 1 | 00 | ||||||

nip | 0 | 0 | 0101 (ST0) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 11 | ||||||

+ | 0 | 0 | 0 | 000 (ADD) | 0 | 0 | 0 | 0 | 01 | |||||||

not | 0 | 0 | 1100 (!ST0) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 00 | ||||||

and | 0 | 0 | 0001 (ST0 & ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 01 | ||||||

or | 0 | 0 | 0111 (ST0 | ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 01 | ||||||

xor | 0 | 0 | 0110 (ST0 ^ ST1) | 0 | 001 (LOGIC) | 0 | 0 | 0 | 0 | 01 | ||||||

rshift | 0 | 0 | 0 | 010 (RSHIFT1) | 0 | 0 | 0 | 0 | 01 | |||||||

@ | 0 | 0 | 0101 (ST1) | 0 | 001 (LOGIC) | 0 | 1 | 0 | 0 | 00 | ||||||

! | 0 | 0 | 0101 (ST1) | 0 | 001 (LOGIC) | 1 | 0 | 0 | 1 | 11 | ||||||

in | 0 | 0 | 0 | 101,110,111 (I/O) | 0 | 0 | 0 | 1 | 01 | |||||||

out | 0 | 0 | 0101 (ST1) | 1 | 001 (LOGIC) | 1 | 0 | 0 | 0 | 11 | ||||||

; | 0 | 0 | 0011 (ST0) | 0 | 001 (LOGIC) | 0 | 0 | 1 | 0 | 00 |