This sample application demonstrates the single-view 3D tracking with DeepStream SDK. Given the camera matrix and human model of a static camera, this application estimates and keeps tracking of object states in the 3D physical world. It can recover the complete bounding box, foot location and body convex hulls precisely from partial occlusions. For algorithm and setup details, please refer to DeepStream Single View 3D Tracking Documentation.
The currnet SV3DT configuration in this sample uses an 2D pose estimator, which estimates human key-points on the 2D image plane. The SV3DT algorithm uses height and waist key-points as anchor to precisely estimate each person's 3D human height. If setting poseEstimatorType: 0 in configs/config_tracker_NvDCF_accuracy_3D.yml, pose estimator will be disabled, and the algorithm uses 2D detection bounding boxes and a human model with fixed height. It estimates the 3D location by matching the head with 2D bounding box's top edge.
This sample application can be run on both x86 and Jetson platforms inside DeepStream container. Check here for DeepStream container setup.
-
Download the latest DeepStream container image from NGC (e.g., DS 9.0 in the example below)
export DS_IMG_NAME="nvcr.io/nvidia/deepstream:9.0-triton-multiarch" docker pull $DS_IMG_NAME
-
Git clone the current
deepstream_reference_appsrepository to the host machine, and enter single-view 3D tracking directory inside the repository.git clone https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps.git cd deepstream_reference_apps/deepstream-tracker-3d -
Download NVIDIA pretrained
PeopleNetfor 2D detection.# current directory: deepstream_reference_apps/deepstream-tracker-3d mkdir -p models/PeopleNet cd models/PeopleNet; wget --no-check-certificate --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tao/peoplenet/versions/deployable_quantized_onnx_v2.6.3/zip -O peoplenet_deployable_quantized_onnx_v2.6.3.zip; unzip peoplenet_deployable_quantized_onnx_v2.6.3.zip
The model files are now stored in
PeopleNetdirectory asdeepstream-tracker-3d ├── configs ├── streams └── models └── PeopleNet ├── labels.txt ├── resnet34_peoplenet.onnx └── resnet34_peoplenet_int8.txt
Launch the container from current directory, and execute the 3D tracking pipeline inside the container. The current config requires users to run with a display because it uses EGL sink to visualize the overlay results. To run through ssh without display, please change type=2 to 1 in group [sink0] in that file. Users can check DeepStream sink group for the usage of each sink.
cd ../..
sudo xhost + # give container access to display
# current directory: deepstream_reference_apps/deepstream-tracker-3d
docker run --runtime=nvidia -it --rm --net=host --privileged -v /tmp/.X11-unix:/tmp/.X11-unix -v $(pwd):/opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-tracker-3d -e DISPLAY=$DISPLAY $DS_IMG_NAMEInside container, run the following commands. Please note that when deepstream-app is launched for the first time, it tries to create model engine files, which may take a couple minutes, depending on HW platforms.
# Install prerequisites
cd /opt/nvidia/deepstream/deepstream/
bash user_additional_install.sh
# Download ReID and BodyPose3DNet model
export MODEL_DIR="/opt/nvidia/deepstream/deepstream/samples/models/Tracker"
mkdir -p $MODEL_DIR
wget 'https://api.ngc.nvidia.com/v2/models/nvidia/tao/reidentificationnet/versions/deployable_v1.0/files/resnet50_market1501.etlt' -P $MODEL_DIR
wget 'https://api.ngc.nvidia.com/v2/models/nvidia/tao/bodypose3dnet/versions/deployable_accuracy_onnx_1.0/files/bodypose3dnet_accuracy.onnx' -P $MODEL_DIR
# Run 3D tracking pipeline
cd /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-tracker-3d/configs
mkdir -p track_results
deepstream-app -c deepstream_app_source1_3d_tracking.txtWhen the pipeline is launced, DeepStream shows the output video like below while processing the input video. The 3D bounding boxes of the people are reconstructed and plotted even though there are partial occlusions. The result video is saved as out.mp4.
The extracted metadata (e.g., bounding box, frame num, target ID, etc.) is saved in both extended MOT and KITTI format. For detailed explanation, please refer to DeepStream Tracker Miscellaneous Data Output.
The MOT results can be found in track_dump_0.txt file, which contains all the object metadata for one stream, and the data format is defined below. The foot image position and the convex hull of the projected cylindrical human model are defined in video frame coordinates, and can be used to draw the visualization figures below. Users can create such overlay video or image using their favorite tools like OpenCV. The foot world position is defined in the 3D world ground plane corresponding to the 3x4 camera projection matrix.
| frame number(starting from 1) | object unique id | bbox left | bbox top | bbox width | bbox height | confidence | Foot World Position X | Foot World Position Y | blank | class id | tracker state | visibility | Foot Image Position X | Foot Image Position Y | ConvexHull Points Relative to bbox center |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| unsigned int | long unsigned int | int | int | int | int | float | float | float | int | unsigned int | int | float | float | float | int separated by vertical bar |
Sample output is like below. The green dot in the visualization figure is plotted as (Foot Image Position X, Foot Image Position Y), and the cylinder is plotted by connecting convex hull points. For example, the bbox in the first line of the output file has center (1433, -50). Then its convex hull points are (1433 - 94, -50 - 170), (1433 - 87, -50 - 176), ..., (1433 - 23, -50 + 190). Users can plot the foot location and convex hull on their own as shown in the figure below.
1,1,1366,-195,134,290,1.000,-171.080,1058.295,-1,0,2,0.220,1458,80,-94|-170|-87|-176|-71|-183|-49|-191|-23|-196|0|-200|18|-201|29|-198|95|165|95|173|85|183|66|191|42|198|16|202|-4|201|-18|197|-23|190
2,1,1365,-194,135,290,0.989,-170.646,1045.254,-1,0,2,0.230,1458,84,-94|-170|-87|-176|-71|-183|-49|-191|-23|-196|0|-200|18|-201|29|-198|95|165|95|173|85|183|66|191|42|198|16|202|-4|201|-18|197|-23|190
3,1,1366,-196,134,290,0.860,-170.679,1054.089,-1,0,2,0.229,1458,82,-94|-170|-87|-176|-71|-183|-49|-191|-23|-196|0|-200|18|-201|29|-198|95|165|95|173|85|183|66|191|42|198|16|202|-4|201|-18|197|-23|190
...The KITTI results can be found in track_results folder. A file will be created for each frame in each stream, and the data format is defined below.
| object Label | object Unique Id | blank | blank | blank | bbox left | bbox top | bbox right | bbox bottom | blank | blank | blank | blank | blank | blank | blank | confidence | visibility (optional) | Foot Image Position X (optional) | Foot Image Position Y (optional) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| string | long unsigned | float | int | float | float | float | float | float | float | float | float | float | float | float | float | float | float | float | float |
Each frame is saved as track_results/00_000_xxxxxx.txt. Sample output of a frame is like below. Note that if a object is found in past frame data, it wouldn't have visibility and foot position in KITTI dump.
person 1 0.0 0 0.0 1365.907227 -196.875290 1500.249146 93.554581 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.890186 0.229870 1458.166992 81.585060
person 6 0.0 0 0.0 1419.655151 72.774818 1647.446167 561.540894 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.841420 0.408513 1575.264160 531.851746
person 0 0.0 0 0.0 1008.387817 36.228714 1202.886353 421.609314 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.632065 0.504738 1148.117188 399.992645
...To run the single view 3D tracking on other videos, the following changes are required.
- In
deepstream_app_source1_3d_tracking.txt, changeuri=file://../streams/Retail02_short.mp4to the new video name,width=1920, height=1080, tracker-width=1920, tracker-height=1080to the new video's resolution. - Generate camera projection matrix for the new video. Change
projectionMatrix_3x4incamInfo.ymlinto the new matrix.
Note: Multiple streams can run at the same time. Add all the sources to deepstream_app_source1_3d_tracking.txt, and all the camera information files to config_tracker_NvDCF_accuracy_3D.yml.
cameraModelFilepath: # In order of the source streams
- 'camInfo-01.yml'
- 'camInfo-02.yml'
- ...