Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
8e3f22e
Initial AI implementation from prototype
CharlieL7 May 8, 2026
ae350e3
Merge branch 'develop' of github.com:ROCm/AMDMIGraphX into gpu_nms_ke…
CharlieL7 May 11, 2026
4ec2fe1
AI edit with "compact" kernel
CharlieL7 May 12, 2026
18ae57e
AI split into 3 kernels
CharlieL7 May 13, 2026
ced7e69
Change NMS ONNX parsing and ref behavior
CharlieL7 May 13, 2026
84c7d3b
Progress, refactor
CharlieL7 May 14, 2026
43c10be
Cleanup before refactor into 3 JIT instructions
CharlieL7 May 14, 2026
f2734dc
minor progress
CharlieL7 May 14, 2026
6379377
AI refactor to separate instructions
CharlieL7 May 14, 2026
2ac67b0
Progress on cleanup, now segementation fault in kernel
CharlieL7 May 15, 2026
5ca611f
Fix JIT global and local. Single verify_test test_nms works.
CharlieL7 May 18, 2026
a48c909
Fixes
CharlieL7 May 18, 2026
e1e936b
Add ref-like tests for GPU NMS, rename shape's flatten to flatten_shapes
CharlieL7 May 18, 2026
fc728f3
Remove verify NMS tests. They don't make sense for random data.
CharlieL7 May 18, 2026
c2ddb73
Fix kernels and tests
CharlieL7 May 19, 2026
600d9fb
Progress update
CharlieL7 May 20, 2026
d5934c0
Version with iterator nms_data
CharlieL7 May 20, 2026
b5c1e77
Kernel version using block shared memory for nms_data
CharlieL7 May 20, 2026
1011256
Progress on polish
CharlieL7 May 20, 2026
b5a9568
Minor cleanup
CharlieL7 May 20, 2026
32c779d
Move prepare_nonmaxsuppression into lowering
CharlieL7 May 21, 2026
c5fb107
Add env var for retaining current NMS behavior for now
CharlieL7 May 21, 2026
fc7a5cc
Merge branch 'develop' of github.com:ROCm/AMDMIGraphX into gpu_nms_ke…
CharlieL7 May 21, 2026
289d5ad
Formatting
CharlieL7 May 21, 2026
49e3a2a
Update NMS op to do fixed_shape_error_check only on fixed shapes
CharlieL7 May 21, 2026
94c3744
Update tests and fixes
CharlieL7 May 21, 2026
22d8beb
Add ref fallback for dynamic input NMS and cleanup kernel types
CharlieL7 May 21, 2026
8fc4844
Get rid of nms_data in kernel to use global memory only for now
CharlieL7 May 21, 2026
229cf90
doc comments cleanup
CharlieL7 May 21, 2026
8bb7865
Formatting
CharlieL7 May 21, 2026
4c27d5f
Licensing
CharlieL7 May 21, 2026
b3765f6
Formatting continued
CharlieL7 May 21, 2026
0bd8d04
Add changelog
CharlieL7 May 26, 2026
59b95b7
Tidy and formatting
CharlieL7 May 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ Full documentation for MIGraphX is available at
* Added N-D scale and zero-point support for `QLinearMatMul` operator.
* Added test cases for `QLinearConv` per-channel scale and `QLinearMatMul` N-D per-channel quantization.
* Added find_concat_same_input matcher to convert concat(N*x) into multibroadcast(x) to reduce hipCopy() (#4981)
* Added GPU kernel for ONNX `NonMaxSuppression` operation and redesigned the `nonmaxsuppression` operation to better represent the data-dependent output shape in the MIGraphX IR (#4893).

### Changed

* Converted `nonzero` operator from device implementation to JIT compilation (#4720).
Expand Down Expand Up @@ -68,6 +70,7 @@ Full documentation for MIGraphX is available at

### Removed
* Removed legacy device implementations for `argmin` and `argmax` in favor of the JIT implementations recently added (#4658).
* Removed `onnx_options::use_dyn_output` after redesign of `NonMaxSuppression` operator (#4893).

## MIGraphX 2.15 for ROCm 7.2.0

Expand Down
18 changes: 9 additions & 9 deletions docs/dev/onnx_operators.rst
Original file line number Diff line number Diff line change
Expand Up @@ -511,15 +511,15 @@ Operator Support Matrix
+--------------------------+-----------+-----------------+------------------------------+
| NegativeLogLikelihoodLoss| ❌ | | |
+--------------------------+-----------+-----------------+------------------------------+
| NonMaxSuppression | ✅ | FP8, FP16, | fixed output |
| | | FP32, FP64 | size unless |
| | | | ``use_dyn_output`` |
| | | | set |
+--------------------------+-----------+-----------------+------------------------------+
| NonZero | ✅ | FP8, FP16, | fixed output |
| | | FP32, FP64 | size unless |
| | | | ``use_dyn_output`` |
| | | | set |
| NonMaxSuppression | ✅ | FP8, FP16, | |
| | | FP32, FP64 | |
| | | | |
| | | | |
Comment thread
CharlieL7 marked this conversation as resolved.
+--------------------------+-----------+-----------------+------------------------------+
| NonZero | ✅ | FP8, FP16, | fixed output size |
| | | FP32, FP64 | |
| | | | |
| | | | |
+--------------------------+-----------+-----------------+------------------------------+
| Not | ✅ | BOOL | |
+--------------------------+-----------+-----------------+------------------------------+
Expand Down
8 changes: 8 additions & 0 deletions docs/reference/MIGraphX-dev-env-vars.rst
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,14 @@ Model performance tunable variables change the compilation behavior of a model.

| Default: Full dynamic shape support is disabled.

* - | ``MIGRAPHX_USE_DYNAMIC_NMS``
| When set, the ``NonMaxSuppression`` ONNX parser performs a dynamic slice on the raw indices tensor to trim it to the number of selected boxes, producing an output with a dynamic shape.

- | ``1``: A dynamic slice is applied to the raw indices tensor, producing a dynamic-shaped output.
| ``0``: Returns to default behavior.

| Default: The whole raw indices tensor is returned without slicing.

Matching
**********

Expand Down
2 changes: 0 additions & 2 deletions src/include/migraphx/onnx.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,6 @@ struct onnx_options
/// Since loop will become a tensor of max iter size a huge number can cause overflow during
/// shape computations.
int64_t limit_max_iterations = std::numeric_limits<uint16_t>::max();
Comment thread
CharlieL7 marked this conversation as resolved.
/// Use dynamic output for operators when available
bool use_dyn_output = false;
/// Parse in ONNX node names as debug symbols
bool use_debug_symbols = false;
/// Path to use for the external data if it is stored at different location compared to onnx
Expand Down
63 changes: 30 additions & 33 deletions src/include/migraphx/op/nonmaxsuppression.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,27 +36,32 @@
#include <migraphx/tensor_view.hpp>
#include <migraphx/shape_for_each.hpp>
#include <migraphx/check_shapes.hpp>
#include <migraphx/shape.hpp>
#include <migraphx/output_iterator.hpp>
#include <migraphx/argument.hpp>
#include <migraphx/par.hpp>

/*
https://github.com/onnx/onnx/blob/main/docs/Operators.md#NonMaxSuppression
*/
/**
* nonmaxsuppression(boxes,
* scores,
* optional(max_output_boxes_per_class),
* optional(iou_threshold),
* optional(score_threshold));
* Outputs tuple of {tensor with dims[max_num_boxes, 3]: selected_box_indices, scalar int64_t:
* num_selected_indices}
*/
namespace migraphx {
inline namespace MIGRAPHX_INLINE_NS {
namespace op {

struct nonmaxsuppression
{
bool center_point_box = false;
bool use_dyn_output = false;

template <class Self, class F>
static auto reflect(Self& self, F f)
{
return pack(f(self.center_point_box, "center_point_box"),
f(self.use_dyn_output, "use_dyn_output"));
return pack(f(self.center_point_box, "center_point_box"));
}

std::string name() const { return "nonmaxsuppression"; }
Expand All @@ -69,8 +74,9 @@ struct nonmaxsuppression
auto max_classes = inputs.at(1).max_lens().at(1);
auto max_spatial_dimension = inputs.at(0).max_lens().at(1);
// Per ONNX spec, output is [num_selected_indices, 3] where each row is
// [batch_index, class_index, box_index]. The maximum possible
// [batch_index, class_index, box_index]. The maximum possible
// num_selected_indices = num_batches * num_classes * spatial_dimension.
// TODO: can also be limited by max_output_boxes_per_class
const auto max_num_boxes = max_batches * max_classes * max_spatial_dimension;

auto fixed_shape_error_check = [&]() {
Expand All @@ -87,21 +93,14 @@ struct nonmaxsuppression
}
};

bool needs_dyn_output = use_dyn_output or inputs.at(0).dynamic() or inputs.at(1).dynamic();

if(needs_dyn_output)
{
std::vector<shape::dynamic_dimension> out_lens = {};
out_lens.push_back({0, max_num_boxes});
out_lens.push_back({3, 3});
return {shape::int64_type, out_lens};
}
else
if(not(inputs.at(0).dynamic() or inputs.at(1).dynamic()))
{
fixed_shape_error_check();
std::vector<std::size_t> out_lens = {max_num_boxes, 3};
return {shape::int64_type, out_lens};
}
std::vector<std::size_t> out_lens = {max_num_boxes, 3};
shape s_ind{shape::int64_type, out_lens};
shape s_num_selected{shape::int64_type, {1}};
return shape({s_ind, s_num_selected});
}

struct box
Expand Down Expand Up @@ -190,7 +189,8 @@ struct nonmaxsuppression
return intersection_over_union > iou_threshold;
}

// filter boxes below score_threshold
// Filter boxes below score_threshold.
// Don't filter for score if score_threshold == 0.f
template <class T>
std::vector<std::pair<double, int64_t>>
filter_boxes_by_score(T scores_start, std::size_t num_boxes, double score_threshold) const
Expand Down Expand Up @@ -232,10 +232,11 @@ struct nonmaxsuppression
std::size_t compute_nms(Output output,
const Boxes& boxes,
const Scores& scores,
std::size_t max_output_boxes_per_class,
int64_t max_output_boxes_per_class,
double iou_threshold,
double score_threshold) const
{
// NOTE: should not need to fill with 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just remove this then?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to preserve the previous behavior for now. Technically the operator after NMS should never be reading the values after num_selected_indices.

std::fill(output.begin(), output.end(), 0);
const auto& lens = scores.get_shape().lens();
const auto num_batches = lens[0];
Expand Down Expand Up @@ -302,14 +303,16 @@ struct nonmaxsuppression
argument compute(const shape& output_shape, std::vector<argument> args) const
{
// make buffer of maximum size
shape max_output_shape = {output_shape.type(), output_shape.max_lens()};
auto output_shapes = flatten_shapes({output_shape});
shape max_output_shape = {output_shapes.at(0).type(), output_shapes.at(0).max_lens()};
argument result{max_output_shape};
argument num_selected_result{output_shapes.at(1)};

std::size_t max_output_boxes_per_class =
(args.size() > 2) ? (args.at(2).at<std::size_t>()) : 0;
int64_t max_output_boxes_per_class = (args.size() > 2) ? (args.at(2).at<std::size_t>()) : 0;
if(max_output_boxes_per_class == 0)
{
return result;
num_selected_result.visit([&](auto output) { output[0] = 0; });
return {{result, num_selected_result}};
}
double iou_threshold = (args.size() > 3) ? (args.at(3).at<double>()) : 0.0f;
double score_threshold = (args.size() > 4) ? (args.at(4).at<double>()) : 0.0f;
Expand All @@ -325,14 +328,8 @@ struct nonmaxsuppression
score_threshold);
});
});
if(output_shape.dynamic())
{
return result.reshape({output_shape.type(), {num_selected, 3}});
}
else
{
return result;
}
num_selected_result.visit([&](auto output) { output[0] = num_selected; });
return {{result, num_selected_result}};
}
};

Expand Down
2 changes: 1 addition & 1 deletion src/include/migraphx/shape.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -621,7 +621,7 @@ struct MIGRAPHX_EXPORT shape
};

/// Flatten subshapes to a single vector of non-tuple type of shapes
MIGRAPHX_EXPORT std::vector<shape> flatten(const std::vector<shape>& shapes);
MIGRAPHX_EXPORT std::vector<shape> flatten_shapes(const std::vector<shape>& shapes);

Comment thread
CharlieL7 marked this conversation as resolved.
MIGRAPHX_EXPORT void migraphx_to_value(value& v, const shape& s);
MIGRAPHX_EXPORT void migraphx_from_value(const value& v, shape& s);
Expand Down
1 change: 0 additions & 1 deletion src/onnx/include/migraphx/onnx/onnx_parser.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,6 @@ struct onnx_parser
std::unordered_map<std::string, std::vector<std::size_t>> map_input_dims;
std::unordered_map<std::string, shape::dynamic_dimension> dim_params;
std::unordered_map<std::string, std::vector<shape::dynamic_dimension>> map_dyn_input_dims;
bool use_dyn_output = false;
bool skip_unknown_operators = false;
bool use_debug_symbols = false;
int64_t max_loop_iterations = 10;
Expand Down
1 change: 0 additions & 1 deletion src/onnx/onnx.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,6 @@ static program parse_onnx_from(const onnx_options& options, Ts&&... xs)
parser.skip_unknown_operators = options.skip_unknown_operators;
parser.max_loop_iterations = options.max_loop_iterations;
parser.limit_max_iterations = options.limit_max_iterations;
parser.use_dyn_output = options.use_dyn_output;

if(options.print_program_on_error)
{
Expand Down
25 changes: 21 additions & 4 deletions src/onnx/parse_nonmaxsuppression.cpp
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
/*
* The MIT License (MIT)
*
* Copyright (c) 2015-2022 Advanced Micro Devices, Inc. All rights reserved.
* Copyright (c) 2015-2026 Advanced Micro Devices, Inc. All rights reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
Expand All @@ -24,6 +24,9 @@
#include <migraphx/onnx/op_parser.hpp>
#include <migraphx/ranges.hpp>
#include <migraphx/make_op.hpp>
#include <migraphx/env.hpp>

MIGRAPHX_DECLARE_ENV_VAR(MIGRAPHX_USE_DYNAMIC_NMS)

namespace migraphx {
inline namespace MIGRAPHX_INLINE_NS {
Expand All @@ -38,9 +41,23 @@
const onnx_parser::node_info& info,
const std::vector<instruction_ref>& args) const
{
auto op = parser.load(opd.op_name, info);
op.from_value({{"use_dyn_output", parser.use_dyn_output}});
return info.add_instruction(op, args);
auto op = parser.load(opd.op_name, info);
auto nms_ins = info.add_instruction(op, args);
// slice with variable ends to handle dynamic shape output.
auto indices = info.add_instruction(make_op("get_tuple_elem", {{"index", 0}}), nms_ins);
if(enabled(MIGRAPHX_USE_DYNAMIC_NMS{}))
{
// TODO: planning to make this the default behavior and removing the env var.
auto num_selected =
info.add_instruction(make_op("get_tuple_elem", {{"index", 1}}), nms_ins);
auto slice_ins = info.add_instruction(
make_op("slice", {{"axes", {0}}, {"starts", {0}}}), indices, num_selected);
return slice_ins;

Check warning on line 55 in src/onnx/parse_nonmaxsuppression.cpp

View workflow job for this annotation

GitHub Actions / cppcheck

style: Variable is returned immediately after its declaration, can be simplified to just return expression. [migraphx-RedundantLocalVariable]
}
else
{
return indices;
}
}
};

Expand Down
4 changes: 2 additions & 2 deletions src/shape.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1380,14 +1380,14 @@ const std::vector<shape>& shape::sub_shapes() const { return impl->m_shapes; }

void shape::debug_print() const { std::cout << *this << std::endl; }

std::vector<shape> flatten(const std::vector<shape>& shapes)
std::vector<shape> flatten_shapes(const std::vector<shape>& shapes)
{
std::vector<shape> result;
for(const auto& s : shapes)
{
if(s.type() == shape::tuple_type)
{
auto subs = flatten(s.sub_shapes());
auto subs = flatten_shapes(s.sub_shapes());
result.insert(result.end(), subs.begin(), subs.end());
}
else
Expand Down
1 change: 1 addition & 0 deletions src/targets/gpu/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,7 @@ add_library(migraphx_gpu
loop.cpp
lrn.cpp
mlir.cpp
nms_ops.cpp
no_device.cpp
pack_args.cpp
prefuse_ops.cpp
Expand Down
2 changes: 1 addition & 1 deletion src/targets/gpu/code_object_op.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ shape code_object_op::compute_shape(std::vector<shape> inputs) const
std::transform(einputs.begin(), einputs.end(), einputs.begin(), [](const shape& s) {
return s.normalize_standard();
});
if(not migraphx::equal(flatten(einputs), flatten(inputs), &shape::is_compatible))
if(not migraphx::equal(flatten_shapes(einputs), flatten_shapes(inputs), &shape::is_compatible))
MIGRAPHX_THROW("Input shapes have changed: [" + to_string_range(einputs) + "] -> [" +
to_string_range(inputs) + "]");
auto output_buffer_shape = inputs.at(get_output_arg(inputs.size()));
Expand Down
2 changes: 2 additions & 0 deletions src/targets/gpu/compile_hip_code_object.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,8 @@ compute_global_for(const context& ctx, std::size_t n, std::size_t over)
};
}

// `n`: The amount of parallel work within a block.
// `max_block_size`: Upper limit on block size.
std::size_t compute_block_size(const context& ctx, std::size_t n, std::size_t max_block_size)
{
const std::size_t min_block_size = ctx.get_current_device().get_wavefront_size();
Expand Down
11 changes: 10 additions & 1 deletion src/targets/gpu/device/include/migraphx/gpu/device/scan.hpp
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
/*
* The MIT License (MIT)
*
* Copyright (c) 2015-2024 Advanced Micro Devices, Inc. All rights reserved.
* Copyright (c) 2015-2026 Advanced Micro Devices, Inc. All rights reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -33,6 +33,14 @@ inline namespace MIGRAPHX_INLINE_NS {
namespace gpu {
namespace device {

// Inclusive prefix sum within a kernel block.
// Hillis-Steele scan with double-buffered (ping-pong) shared array.
// `N`: upper bound on blockDim.x, sizes the shared buffer.
// `op`: associative binary reduce function ex. sum or max.
// `init`: initializer
// `fs`: striding function for thread work distribution.
// `input`: input with input(index_int).
// `output`: output with output(index_int, inclusive_scan_value_at_index_int).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the added comments here.

template <index_int N,
class Op,
class T,
Expand Down Expand Up @@ -72,6 +80,7 @@ __device__ void block_scan(index idx, Op op, T init, ForStride fs, Input input,
});
}

// Overload of block_scan with default local_stride up to `n`.
template <index_int N, class Op, class T, class Input, class Output>
__device__ void block_scan(index idx, Op op, T init, index_int n, Input input, Output output)
{
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
/*
* The MIT License (MIT)
*
* Copyright (c) 2015-2025 Advanced Micro Devices, Inc. All rights reserved.
* Copyright (c) 2015-2026 Advanced Micro Devices, Inc. All rights reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -37,7 +37,9 @@ struct context;

struct hip_compile_options
{
// Total number of threads
std::size_t global;
// Threads per block
std::size_t local;
std::vector<shape> inputs;
shape output;
Expand Down
Loading
Loading