Fixing nonzero accumulator overflow and scatternd stride aware by urpetkov-amd · Pull Request #4919 · ROCm/AMDMIGraphX

urpetkov-amd · 2026-05-28T15:44:22Z

Motivation

Two latent correctness bugs in the GPU kernels for NonZero and ScatterND
combined to break any model whose runtime path runs

NonZero -> unsqueeze -> transpose -> slice -> concat -> ScatterND

on more than ~255 selected elements. The vision-encoder model used by
SmolVLM (vision_encoder.onnx with pixel_attention_mask all-true and
pixel_values from a real image) is the simplest reproducer.

Running

migraphx-driver perf vision_encoder.onnx \
    --input-dim @pixel_values 1 1 3 512 512 \
                @pixel_attention_mask 1 1 512 512 \
    --fill1 pixel_attention_mask

on Windows produced one of two failure modes depending on inputs:

With every mask element true (1024 set bits feeding NonZero):
nonzero_kernel returned hipErrorLaunchFailure from an
out-of-bounds store and the process exited.
With non-trivial mask values (≤ 1024 set bits but a non-packed
indices layout downstream): the whole run completed but
migraphx-driver verify showed RMS ≈ 0.58, Max diff ≈ 1023,
with the first divergence at output[0] = 0 vs reference 1.
verify --bisect localized the first divergent instruction to a
scatternd_kernel whose indices tensor had logical shape {1024, 2}
but strides {1, 1024}.

Both failure modes share one root cause: a GPU kernel making an
assumption about its inputs that the surrounding pipeline does not
guarantee. The goals of this PR are:

Make NonZero correct for inputs with more than 255 set elements.
Make ScatterND correct for indices tensors with arbitrary strides
(in particular, the non-packed layouts produced by
transpose/slice/squeeze/concat chains downstream of NonZero).
Keep both kernels behaviorally identical for every previously-passing
case.

Technical Details

Bug 1 — `nonzero_kernel`: accumulator type too narrow

src/targets/gpu/kernels/include/migraphx/kernels/nonzero.hpp

The kernel uses a single-block prefix sum (block_scan) to assign each
non-zero input position a contiguous output slot. The element-wise
predicate was returning uint8_t:

// before
block_scan(
    idx,
    op::sum{},
    0,
    elem_num,
    [&](auto j) -> uint8_t { return float_equal(input[j], 0) ? 0 : 1; },
    ...
);

block_scan deduces the running-sum accumulator type from the predicate
return type, so the accumulator was also uint8_t. Once more than 255
input elements were non-zero the accumulator wrapped, value - 1 became
a very large index_int, and the kernel wrote into

output[make_array<index_int>(k, out_loc)] = multi_idx[k];

at an out-of-bounds offset, which on AMD GPU surfaces as
hipErrorLaunchFailure.

Fix: widen the accumulator by widening the predicate return type to
index_int (== uint32_t), which is the same type already used
everywhere else for flat-index arithmetic in the kernels:

// after
// input (elem_num) uint32_t covers any input we realistically see;
// a narrower type uint8_t wraps once the prefix sum exceeds the
// type's range, producing negative out_loc values and OOB stores
block_scan(
    idx,
    op::sum{},
    index_int{0},
    elem_num,
    [&](auto j) -> index_int { return float_equal(input[j], 0) ? 0 : 1; },
    [&](auto j, auto value) {
        MIGRAPHX_ASSERT(j < elem_num);
        if(float_equal(input[j], 0))
            return;
        MIGRAPHX_ASSERT(value > 0);
        const index_int out_loc = value - 1;
        const auto multi_idx    = in_shape.multi(j);
        for(auto k = 0; k < multi_idx.size(); ++k)
        {
            output[make_array<index_int>(k, out_loc)] = multi_idx[k];
        }
    });

The added MIGRAPHX_ASSERT(value > 0) catches the broken-invariant case
in debug builds: after the float_equal(input[j], 0) early-out, any
thread still in flight must have contributed 1 to the prefix sum, so
the running sum is strictly positive at that point. With the old
uint8_t accumulator the assertion would have fired immediately on
overflow.

Why this scope is correct:

uint8_t was not chosen for memory bandwidth — block_scan operates
on shared memory and the cost is identical with uint32_t.
out_loc is used as a tensor coordinate, which is already
index_int everywhere else, so widening at the source removes the
only narrowing in the data path.
No behavior change for any input with ≤ 255 set bits — the prior
path produced identical results.

Bug 2 — `scatternd_kernel`: packed-layout assumption

src/targets/gpu/kernels/include/migraphx/kernels/scatternd.hpp

For each update element the kernel needs to read the k scatter
components for the current outer position from the indices tensor.
The original code did:

// before
auto index_start = indices_t.begin() + indices_shape.index(indices_idx);
auto index_end   = index_start + k;
auto out_idx     = output_shape.multi(0);
copy(index_start, index_end, out_idx.begin());

Reading carefully:

indices_shape.index(indices_idx) correctly computes the memory
offset of the first of the k components via the strides.
index_start + k then advances k memory slots forward, and
copy(...) reads k contiguous int64 values from memory.

That is a packed-layout assumption: it requires
strides_indices.back() == 1 and no later dim to interleave the k
components in memory. MIGraphX does not guarantee that — the chain

unsqueeze[axes={1}] -> transpose[perm={0,2,1}] -> slice -> squeeze -> concat

(which is what every NonZero -> Gather/Scatter model produces) hands
the scatter operator an indices tensor with logical shape {N, k} but
in-memory layout that of {k, N}. For {N, k} = {1024, 2} the actual
strides are {1, 1024} and the memory layout is

addr:  [   0,    1, …, 1023, 1024, 1025, …, 2047]
value: [ r_0,  r_1, …, r_1023, c_0,  c_1, …, c_1023]
       └──── all rows ────┘  └──── all columns ────┘

For thread t, the kernel was reading memory[t] and memory[t+1],
i.e. (indices[t, 0], indices[t+1, 0]) — instead of the intended
(indices[t, 0], indices[t, 1]). In the vision-encoder case every row
component is 0, so every thread ended up scattering to
output[{0, 0}], producing both an assign_none write race and stale
values everywhere else in the output.

Fix: use the tensor view's stride-aware iterator.

// after
// begin_at is stride-aware; raw begin()+offset would only be correct
// for packed indices.
auto index_start = indices_t.begin_at(indices_idx);
auto out_idx     = output_shape.multi(0);
copy(index_start, index_start + k, out_idx.begin());
copy(updates_idx.begin() + q - 1, updates_idx.end(), out_idx.begin() + k);

begin_at(multi_index) returns an iota-style iterator carrying the
logical flat index Shape::single(multi_index). Each dereference
routes through tensor_view::operator[](index_int) →
Shape::index(index_int) → compute_index, which produces the correct
memory offset for any stride pattern. Incrementing the iterator by 1 in
logical flat space corresponds to incrementing the last axis by 1,
which is exactly the walk we want for reading the k scatter
components.

Why this scope is correct:

For packed layouts Shape::index(index_int) takes the
if (this->standard()) return i; fast path, so the generated memory
accesses are identical to the previous code. No hot-path regression.
The slow path only fires when the input actually has non-standard
strides — the case that used to be silently wrong.
No other kernel needs the same surgery. scatter.hpp (Scatter,
non-ND) uses indices_t[i] (single flat index), which already
routes through compute_index. Gather kernels likewise.

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

- Added: New functionality.
- Changed: Changes to existing functionality.
- Removed: Functionality or support that has been removed. (Compared to a previous release)
- Optimized: Component performance that has been optimized or improved.
- Resolved Issues: Known issues from a previous version that have been resolved.
- Not Applicable: This PR is not to be included in the changelog.

Fixing nonzero accumulator overflow and scatternd stride aware

f9b99c6

urpetkov-amd requested a review from TedThemistokleous May 28, 2026 15:44

Fixing licensing

1532a77

urpetkov-amd requested a review from pfultz2 May 28, 2026 15:52

Merge branch 'develop' into fix_non_zero_scatternd

6937dc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing nonzero accumulator overflow and scatternd stride aware#4919

Fixing nonzero accumulator overflow and scatternd stride aware#4919
urpetkov-amd wants to merge 3 commits into
developfrom
fix_non_zero_scatternd

urpetkov-amd commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

urpetkov-amd commented May 28, 2026

Motivation

Technical Details

Bug 1 — nonzero_kernel: accumulator type too narrow

Bug 2 — scatternd_kernel: packed-layout assumption

Changelog Category

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1 — `nonzero_kernel`: accumulator type too narrow

Bug 2 — `scatternd_kernel`: packed-layout assumption