arty-tpu

An output-stationary systolic-array TPU running on a Digilent Arty A7-35T FPGA. The NxN array multiplies signed int8 matrices on a 2N-1-cycle diagonal wavefront. A Python host tiles arbitrary-size matmuls over the fixed hardware tile via UART.

Simulation

pip install -r requirements.txt   # numpy, cocotb, pyserial
make test                         # run all three test suites
make test-sa                      # systolic array unit tests
make test-tpu                     # tpu_top integration tests
make test-tiling                  # arbitrary-size tiling tests

FPGA (Arty A7-35T)

Item	Value
Board	Digilent Arty A7-35T
FPGA	Artix-7 `xc7a35ticsg324-1L`
Toolchain	Vivado 2024.2
Host link	USB-UART 115200 8N1, `/dev/ttyUSB1`

make build-fpga      # synth + impl -> build/uart_matmul.bit
make program-fpga    # load over JTAG (volatile)
make flash-fpga      # write to QSPI flash (persists across power cycles)
make fpga-selftest   # compare FPGA matmuls against numpy
make fpga-mnist      # MNIST digit inference on the FPGA (ASCII output)
make fpga-draw       # draw a digit with the mouse, classified live on FPGA

Override port or Vivado path if needed:

PORT=/dev/ttyUSB0 make fpga-selftest
VIVADO=/opt/Xilinx/Vivado/2024.2/bin/vivado make build-fpga

UART protocol (N=4)

Direction	Bytes	Layout
Host -> FPGA	57	`0xA5` sync + 7 A-slices x 4 B + 7 B-slices x 4 B (little-endian, lane0 = bits[7:0])
FPGA -> Host	64	16 result words x 4 B, row-major `C[0][0]..C[3][3]`, signed

Repository layout

arty-tpu/
|-- rtl/
|   |-- pe.v               Multiply-accumulate processing element
|   |-- systolic_array.v   NxN PE grid
|   |-- bram.v             Register file (sync write, async read)
|   |-- tpu_controller.v   FSM: IDLE->CLEAR->STREAM->FLUSH->WRITE->DONE
|   |-- tpu_top.v          Core: BRAM + array + controller
|   |-- uart_rx.v          8N1 UART receiver
|   |-- uart_tx.v          8N1 UART transmitter
|   `-- uart_matmul.v      FPGA top - UART shell wrapping tpu_top
|-- tiling/
|   |-- skew.py            Skew/pack helpers (shared by sim and hardware)
|   |-- tpu_driver.py      Cocotb async driver
|   `-- tiling_engine.py   Software tiler for arbitrary-shape matmuls
|-- host/
|   |-- tpu_uart.py        Serial driver + tiling + numpy self-test
|   |-- mnist_fpga.py      MNIST inference on the FPGA
|   `-- mnist_draw.py      Draw-a-digit GUI, classified live on the FPGA
|-- constr/
|   `-- arty_a35t.xdc      Pin constraints
|-- scripts/
|   |-- build_arty.tcl     Vivado build
|   |-- program_arty.tcl   JTAG program
|   `-- flash_arty.tcl     QSPI flash
|-- tests/
|   |-- test_systolic_array.py
|   |-- test_tpu.py
|   `-- test_tiling.py
`-- docs/
    `-- architecture.md

Parameters

Parameter	Default	Description
`N`	4	Array dimension (NxN PEs)
`DATA_WIDTH`	8	Input width in bits (signed)
`ACC_WIDTH`	32	Accumulator width in bits

Latency (one NxN tile)

The controller (tpu_controller.v) runs one tile through five FSM phases:

Phase	Cycles	What happens
`CLEAR`	`1`	Zero the PE accumulators
`STREAM`	`2N-1`	Feed the diagonal-skewed input slices (the wavefront)
`FLUSH`	`N`	Drain the pipeline so the last input reaches the far corner PE
`WRITE`	`N^2`	Copy the N² results out to BRAM_C, one word per cycle
`DONE`	`1`	Single-cycle `done` pulse
Total	`N^2 + 3N + 1`	29 cycles at N=4

The WRITE phase (N^2) dominates: the array finishes computing well before the serial result drain completes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arty-tpu

Simulation

FPGA (Arty A7-35T)

UART protocol (N=4)

Repository layout

Parameters

Latency (one NxN tile)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
constr		constr
docs		docs
host		host
rtl		rtl
scripts		scripts
tests		tests
tiling		tiling
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

arty-tpu

Simulation

FPGA (Arty A7-35T)

UART protocol (N=4)

Repository layout

Parameters

Latency (one NxN tile)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages