Skip to content

spr4bhu/arty-tpu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arty-tpu

An output-stationary systolic-array TPU running on a Digilent Arty A7-35T FPGA. The NxN array multiplies signed int8 matrices on a 2N-1-cycle diagonal wavefront. A Python host tiles arbitrary-size matmuls over the fixed hardware tile via UART.


Simulation

pip install -r requirements.txt   # numpy, cocotb, pyserial
make test                         # run all three test suites
make test-sa                      # systolic array unit tests
make test-tpu                     # tpu_top integration tests
make test-tiling                  # arbitrary-size tiling tests

FPGA (Arty A7-35T)

Item Value
Board Digilent Arty A7-35T
FPGA Artix-7 xc7a35ticsg324-1L
Toolchain Vivado 2024.2
Host link USB-UART 115200 8N1, /dev/ttyUSB1
make build-fpga      # synth + impl -> build/uart_matmul.bit
make program-fpga    # load over JTAG (volatile)
make flash-fpga      # write to QSPI flash (persists across power cycles)
make fpga-selftest   # compare FPGA matmuls against numpy
make fpga-mnist      # MNIST digit inference on the FPGA (ASCII output)
make fpga-draw       # draw a digit with the mouse, classified live on FPGA

Override port or Vivado path if needed:

PORT=/dev/ttyUSB0 make fpga-selftest
VIVADO=/opt/Xilinx/Vivado/2024.2/bin/vivado make build-fpga

UART protocol (N=4)

Direction Bytes Layout
Host -> FPGA 57 0xA5 sync + 7 A-slices x 4 B + 7 B-slices x 4 B (little-endian, lane0 = bits[7:0])
FPGA -> Host 64 16 result words x 4 B, row-major C[0][0]..C[3][3], signed

Repository layout

arty-tpu/
|-- rtl/
|   |-- pe.v               Multiply-accumulate processing element
|   |-- systolic_array.v   NxN PE grid
|   |-- bram.v             Register file (sync write, async read)
|   |-- tpu_controller.v   FSM: IDLE->CLEAR->STREAM->FLUSH->WRITE->DONE
|   |-- tpu_top.v          Core: BRAM + array + controller
|   |-- uart_rx.v          8N1 UART receiver
|   |-- uart_tx.v          8N1 UART transmitter
|   `-- uart_matmul.v      FPGA top - UART shell wrapping tpu_top
|-- tiling/
|   |-- skew.py            Skew/pack helpers (shared by sim and hardware)
|   |-- tpu_driver.py      Cocotb async driver
|   `-- tiling_engine.py   Software tiler for arbitrary-shape matmuls
|-- host/
|   |-- tpu_uart.py        Serial driver + tiling + numpy self-test
|   |-- mnist_fpga.py      MNIST inference on the FPGA
|   `-- mnist_draw.py      Draw-a-digit GUI, classified live on the FPGA
|-- constr/
|   `-- arty_a35t.xdc      Pin constraints
|-- scripts/
|   |-- build_arty.tcl     Vivado build
|   |-- program_arty.tcl   JTAG program
|   `-- flash_arty.tcl     QSPI flash
|-- tests/
|   |-- test_systolic_array.py
|   |-- test_tpu.py
|   `-- test_tiling.py
`-- docs/
    `-- architecture.md

Parameters

Parameter Default Description
N 4 Array dimension (NxN PEs)
DATA_WIDTH 8 Input width in bits (signed)
ACC_WIDTH 32 Accumulator width in bits

Latency (one NxN tile)

The controller (tpu_controller.v) runs one tile through five FSM phases:

Phase Cycles What happens
CLEAR 1 Zero the PE accumulators
STREAM 2N-1 Feed the diagonal-skewed input slices (the wavefront)
FLUSH N Drain the pipeline so the last input reaches the far corner PE
WRITE N^2 Copy the N² results out to BRAM_C, one word per cycle
DONE 1 Single-cycle done pulse
Total N^2 + 3N + 1 29 cycles at N=4

The WRITE phase (N^2) dominates: the array finishes computing well before the serial result drain completes.

About

A weight-stationary systolic-array TPU running on an Arty A7-35T FPGA with UART matmul service and MNIST demos

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors