Skip to content

Improvements to the NonMaxSuppression kernel #4905

@CharlieL7

Description

@CharlieL7

Improvements that could be made to the NMS kernels and operator:

Performance:

  • Fuse the nms_sort and nms_filter kernels into one by using a sorting kernel that doesn't have a restriction on the number of elements so that the number of threads between the kernels can be the same for the general case.
    • Also allows us to use LDS for the IoU mask instead of global memory scratch space.
  • Gather kernels can be fused into NMS.
  • Use LDS memory for the sort and filter when the data can fit.
  • Tune kernels for block size on different hardware.
  • Need to do measurements on hardware.

Memory:

  • Don't copy boxes or scores for the sort and filter and instead use indices. Could lead to lower performance because the sorted box & score access could be far apart.
  • Do the compact kernel in-place in memory.
  • Use bits instead of bools for the removed mask.
    • Makes work distribution for the filter more complicated.
  • Limit output buffer size using max_output_boxes_per_class if it is known at compile time.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions