An output-stationary systolic-array TPU running on a Digilent Arty A7-35T FPGA. The NxN array multiplies signed int8 matrices on a 2N-1-cycle diagonal wavefront. A Python host tiles arbitrary-size matmuls over the fixed hardware tile via UART.
pip install -r requirements.txt # numpy, cocotb, pyserial
make test # run all three test suites
make test-sa # systolic array unit tests
make test-tpu # tpu_top integration tests
make test-tiling # arbitrary-size tiling tests| Item | Value |
|---|---|
| Board | Digilent Arty A7-35T |
| FPGA | Artix-7 xc7a35ticsg324-1L |
| Toolchain | Vivado 2024.2 |
| Host link | USB-UART 115200 8N1, /dev/ttyUSB1 |
make build-fpga # synth + impl -> build/uart_matmul.bit
make program-fpga # load over JTAG (volatile)
make flash-fpga # write to QSPI flash (persists across power cycles)
make fpga-selftest # compare FPGA matmuls against numpy
make fpga-mnist # MNIST digit inference on the FPGA (ASCII output)
make fpga-draw # draw a digit with the mouse, classified live on FPGAOverride port or Vivado path if needed:
PORT=/dev/ttyUSB0 make fpga-selftest
VIVADO=/opt/Xilinx/Vivado/2024.2/bin/vivado make build-fpga| Direction | Bytes | Layout |
|---|---|---|
| Host -> FPGA | 57 | 0xA5 sync + 7 A-slices x 4 B + 7 B-slices x 4 B (little-endian, lane0 = bits[7:0]) |
| FPGA -> Host | 64 | 16 result words x 4 B, row-major C[0][0]..C[3][3], signed |
arty-tpu/
|-- rtl/
| |-- pe.v Multiply-accumulate processing element
| |-- systolic_array.v NxN PE grid
| |-- bram.v Register file (sync write, async read)
| |-- tpu_controller.v FSM: IDLE->CLEAR->STREAM->FLUSH->WRITE->DONE
| |-- tpu_top.v Core: BRAM + array + controller
| |-- uart_rx.v 8N1 UART receiver
| |-- uart_tx.v 8N1 UART transmitter
| `-- uart_matmul.v FPGA top - UART shell wrapping tpu_top
|-- tiling/
| |-- skew.py Skew/pack helpers (shared by sim and hardware)
| |-- tpu_driver.py Cocotb async driver
| `-- tiling_engine.py Software tiler for arbitrary-shape matmuls
|-- host/
| |-- tpu_uart.py Serial driver + tiling + numpy self-test
| |-- mnist_fpga.py MNIST inference on the FPGA
| `-- mnist_draw.py Draw-a-digit GUI, classified live on the FPGA
|-- constr/
| `-- arty_a35t.xdc Pin constraints
|-- scripts/
| |-- build_arty.tcl Vivado build
| |-- program_arty.tcl JTAG program
| `-- flash_arty.tcl QSPI flash
|-- tests/
| |-- test_systolic_array.py
| |-- test_tpu.py
| `-- test_tiling.py
`-- docs/
`-- architecture.md
| Parameter | Default | Description |
|---|---|---|
N |
4 | Array dimension (NxN PEs) |
DATA_WIDTH |
8 | Input width in bits (signed) |
ACC_WIDTH |
32 | Accumulator width in bits |
The controller (tpu_controller.v) runs one tile through five FSM phases:
| Phase | Cycles | What happens |
|---|---|---|
CLEAR |
1 |
Zero the PE accumulators |
STREAM |
2N-1 |
Feed the diagonal-skewed input slices (the wavefront) |
FLUSH |
N |
Drain the pipeline so the last input reaches the far corner PE |
WRITE |
N^2 |
Copy the N² results out to BRAM_C, one word per cycle |
DONE |
1 |
Single-cycle done pulse |
| Total | N^2 + 3N + 1 |
29 cycles at N=4 |
The WRITE phase (N^2) dominates: the array finishes computing well before the serial result drain completes.