Skip to content

Fixing nonzero accumulator overflow and scatternd stride aware#4919

Draft
urpetkov-amd wants to merge 3 commits into
developfrom
fix_non_zero_scatternd
Draft

Fixing nonzero accumulator overflow and scatternd stride aware#4919
urpetkov-amd wants to merge 3 commits into
developfrom
fix_non_zero_scatternd

Conversation

@urpetkov-amd
Copy link
Copy Markdown
Collaborator

Motivation

Two latent correctness bugs in the GPU kernels for NonZero and ScatterND
combined to break any model whose runtime path runs

NonZero -> unsqueeze -> transpose -> slice -> concat -> ScatterND

on more than ~255 selected elements. The vision-encoder model used by
SmolVLM (vision_encoder.onnx with pixel_attention_mask all-true and
pixel_values from a real image) is the simplest reproducer.

Running

migraphx-driver perf vision_encoder.onnx \
    --input-dim @pixel_values 1 1 3 512 512 \
                @pixel_attention_mask 1 1 512 512 \
    --fill1 pixel_attention_mask

on Windows produced one of two failure modes depending on inputs:

  1. With every mask element true (1024 set bits feeding NonZero):
    nonzero_kernel returned hipErrorLaunchFailure from an
    out-of-bounds store and the process exited.
  2. With non-trivial mask values (≤ 1024 set bits but a non-packed
    indices layout downstream): the whole run completed but
    migraphx-driver verify showed RMS ≈ 0.58, Max diff ≈ 1023,
    with the first divergence at output[0] = 0 vs reference 1.
    verify --bisect localized the first divergent instruction to a
    scatternd_kernel whose indices tensor had logical shape {1024, 2}
    but strides {1, 1024}.

Both failure modes share one root cause: a GPU kernel making an
assumption about its inputs that the surrounding pipeline does not
guarantee. The goals of this PR are:

  • Make NonZero correct for inputs with more than 255 set elements.
  • Make ScatterND correct for indices tensors with arbitrary strides
    (in particular, the non-packed layouts produced by
    transpose/slice/squeeze/concat chains downstream of NonZero).
  • Keep both kernels behaviorally identical for every previously-passing
    case.

Technical Details

Bug 1 — nonzero_kernel: accumulator type too narrow

src/targets/gpu/kernels/include/migraphx/kernels/nonzero.hpp

The kernel uses a single-block prefix sum (block_scan) to assign each
non-zero input position a contiguous output slot. The element-wise
predicate was returning uint8_t:

// before
block_scan(
    idx,
    op::sum{},
    0,
    elem_num,
    [&](auto j) -> uint8_t { return float_equal(input[j], 0) ? 0 : 1; },
    ...
);

block_scan deduces the running-sum accumulator type from the predicate
return type, so the accumulator was also uint8_t. Once more than 255
input elements were non-zero the accumulator wrapped, value - 1 became
a very large index_int, and the kernel wrote into

output[make_array<index_int>(k, out_loc)] = multi_idx[k];

at an out-of-bounds offset, which on AMD GPU surfaces as
hipErrorLaunchFailure.

Fix: widen the accumulator by widening the predicate return type to
index_int (== uint32_t), which is the same type already used
everywhere else for flat-index arithmetic in the kernels:

// after
// input (elem_num) uint32_t covers any input we realistically see;
// a narrower type uint8_t wraps once the prefix sum exceeds the
// type's range, producing negative out_loc values and OOB stores
block_scan(
    idx,
    op::sum{},
    index_int{0},
    elem_num,
    [&](auto j) -> index_int { return float_equal(input[j], 0) ? 0 : 1; },
    [&](auto j, auto value) {
        MIGRAPHX_ASSERT(j < elem_num);
        if(float_equal(input[j], 0))
            return;
        MIGRAPHX_ASSERT(value > 0);
        const index_int out_loc = value - 1;
        const auto multi_idx    = in_shape.multi(j);
        for(auto k = 0; k < multi_idx.size(); ++k)
        {
            output[make_array<index_int>(k, out_loc)] = multi_idx[k];
        }
    });

The added MIGRAPHX_ASSERT(value > 0) catches the broken-invariant case
in debug builds: after the float_equal(input[j], 0) early-out, any
thread still in flight must have contributed 1 to the prefix sum, so
the running sum is strictly positive at that point. With the old
uint8_t accumulator the assertion would have fired immediately on
overflow.

Why this scope is correct:

  • uint8_t was not chosen for memory bandwidth — block_scan operates
    on shared memory and the cost is identical with uint32_t.
  • out_loc is used as a tensor coordinate, which is already
    index_int everywhere else, so widening at the source removes the
    only narrowing in the data path.
  • No behavior change for any input with ≤ 255 set bits — the prior
    path produced identical results.

Bug 2 — scatternd_kernel: packed-layout assumption

src/targets/gpu/kernels/include/migraphx/kernels/scatternd.hpp

For each update element the kernel needs to read the k scatter
components for the current outer position from the indices tensor.
The original code did:

// before
auto index_start = indices_t.begin() + indices_shape.index(indices_idx);
auto index_end   = index_start + k;
auto out_idx     = output_shape.multi(0);
copy(index_start, index_end, out_idx.begin());

Reading carefully:

  • indices_shape.index(indices_idx) correctly computes the memory
    offset of the first of the k components via the strides.
  • index_start + k then advances k memory slots forward, and
    copy(...) reads k contiguous int64 values from memory.

That is a packed-layout assumption: it requires
strides_indices.back() == 1 and no later dim to interleave the k
components in memory. MIGraphX does not guarantee that — the chain

unsqueeze[axes={1}] -> transpose[perm={0,2,1}] -> slice -> squeeze -> concat

(which is what every NonZero -> Gather/Scatter model produces) hands
the scatter operator an indices tensor with logical shape {N, k} but
in-memory layout that of {k, N}. For {N, k} = {1024, 2} the actual
strides are {1, 1024} and the memory layout is

addr:  [   0,    1, …, 1023, 1024, 1025, …, 2047]
value: [ r_0,  r_1, …, r_1023, c_0,  c_1, …, c_1023]
       └──── all rows ────┘  └──── all columns ────┘

For thread t, the kernel was reading memory[t] and memory[t+1],
i.e. (indices[t, 0], indices[t+1, 0]) — instead of the intended
(indices[t, 0], indices[t, 1]). In the vision-encoder case every row
component is 0, so every thread ended up scattering to
output[{0, 0}], producing both an assign_none write race and stale
values everywhere else in the output.

Fix: use the tensor view's stride-aware iterator.

// after
// begin_at is stride-aware; raw begin()+offset would only be correct
// for packed indices.
auto index_start = indices_t.begin_at(indices_idx);
auto out_idx     = output_shape.multi(0);
copy(index_start, index_start + k, out_idx.begin());
copy(updates_idx.begin() + q - 1, updates_idx.end(), out_idx.begin() + k);

begin_at(multi_index) returns an iota-style iterator carrying the
logical flat index Shape::single(multi_index). Each dereference
routes through tensor_view::operator[](index_int)
Shape::index(index_int)compute_index, which produces the correct
memory offset for any stride pattern. Incrementing the iterator by 1 in
logical flat space corresponds to incrementing the last axis by 1,
which is exactly the walk we want for reading the k scatter
components.

Why this scope is correct:

  • For packed layouts Shape::index(index_int) takes the
    if (this->standard()) return i; fast path, so the generated memory
    accesses are identical to the previous code. No hot-path regression.
  • The slow path only fires when the input actually has non-standard
    strides — the case that used to be silently wrong.
  • No other kernel needs the same surgery. scatter.hpp (Scatter,
    non-ND) uses indices_t[i] (single flat index), which already
    routes through compute_index. Gather kernels likewise.

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

@urpetkov-amd urpetkov-amd requested a review from pfultz2 May 28, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant