Improvements to the NonMaxSuppression kernel

Improvements that could be made to the NMS kernels and operator:

Performance:
* Fuse the nms_sort and nms_filter kernels into one by using a sorting kernel that doesn't have a restriction on the number of elements so that the number of threads between the kernels can be the same for the general case.
  * Also allows us to use LDS for the IoU mask instead of global memory scratch space.
* Gather kernels can be fused into NMS.
* Use LDS memory for the sort and filter when the data can fit.
* Tune kernels for block size on different hardware.
* Need to do measurements on hardware.

Memory:
* Don't copy boxes or scores for the sort and filter and instead use indices. Could lead to lower performance because the sorted box & score access could be far apart.
* Do the compact kernel in-place in memory.
* Use bits instead of bools for the removed mask.
  * Makes work distribution for the filter more complicated.
* Limit output buffer size using max_output_boxes_per_class if it is known at compile time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to the NonMaxSuppression kernel #4905

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improvements to the NonMaxSuppression kernel #4905

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions