Quantize YOLOv8 to INT8 with TensorRT (Python), run inference in C++.
Tested on GTX 1650 4 GB, YOLOv8n, 640×640, 100 COCO val images.
| Runtime | Precision | Mean latency | P99 latency | Throughput |
|---|---|---|---|---|
| PyTorch (Python) | FP32 | 39.45 ms | 78.78 ms | 25.3 FPS |
| TensorRT (Python) | INT8 | 14.44 ms | 16.28 ms | 69.2 FPS |
| TensorRT (C++) | INT8 | 8.85 ms | 12.57 ms | 113.0 FPS |
All three measure preprocess → forward → postprocess, excluding disk I/O, with 10 warmup passes.
- INT8 TensorRT (Python) is 2.7× faster than FP32 PyTorch
- INT8 TensorRT (C++) is 4.5× faster than FP32 PyTorch
- C++ eliminates ~5.6 ms of Python/ultralytics pipeline overhead vs the Python TRT path
├── python/
│ ├── quantize.py # export .pt → .engine (INT8)
│ └── benchmark.py # compare FP32 vs INT8 latency
├── cpp/
│ ├── CMakeLists.txt
│ ├── include/
│ │ ├── engine.hpp # TensorRT engine wrapper
│ │ └── preprocess.hpp
│ └── src/
│ ├── main.cpp # CLI inference app
│ ├── engine.cpp
│ └── preprocess.cpp
├── models/ # put .engine files here
├── data/
│ ├── calibration/ # ~100 representative images for INT8 calibration
│ └── test/ # images to benchmark on
└── pyproject.toml
- CUDA ≥ 12.4
- TensorRT ≥ 10.0
- OpenCV ≥ 4.6
- CMake ≥ 3.18
- uv (
pip install uvorbrew install uv)
uv sync
# put ~100 representative images in data/calibration/
uv run python/quantize.py --model yolov8n.pt --data data/calibration/dataset.yaml
cp yolov8n.engine models/mkdir cpp/build && cd cpp/build
cmake .. -DTRT_ROOT=/usr/local/tensorrt
make -j$(nproc)./cpp/build/infer models/yolov8n.engine data/test output/uv run python/benchmark.py --fp32 yolov8n.pt --int8 models/yolov8n.engine --images data/test