Improvements that could be made to the NMS kernels and operator:
Performance:
- Fuse the nms_sort and nms_filter kernels into one by using a sorting kernel that doesn't have a restriction on the number of elements so that the number of threads between the kernels can be the same for the general case.
- Also allows us to use LDS for the IoU mask instead of global memory scratch space.
- Gather kernels can be fused into NMS.
- Use LDS memory for the sort and filter when the data can fit.
- Tune kernels for block size on different hardware.
- Need to do measurements on hardware.
Memory:
- Don't copy boxes or scores for the sort and filter and instead use indices. Could lead to lower performance because the sorted box & score access could be far apart.
- Do the compact kernel in-place in memory.
- Use bits instead of bools for the removed mask.
- Makes work distribution for the filter more complicated.
- Limit output buffer size using max_output_boxes_per_class if it is known at compile time.
Improvements that could be made to the NMS kernels and operator:
Performance:
Memory: