Fixing nonzero accumulator overflow and scatternd stride aware#4919
Draft
urpetkov-amd wants to merge 3 commits into
Draft
Fixing nonzero accumulator overflow and scatternd stride aware#4919urpetkov-amd wants to merge 3 commits into
urpetkov-amd wants to merge 3 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Two latent correctness bugs in the GPU kernels for
NonZeroandScatterNDcombined to break any model whose runtime path runs
on more than ~255 selected elements. The vision-encoder model used by
SmolVLM (
vision_encoder.onnxwithpixel_attention_maskall-true andpixel_valuesfrom a real image) is the simplest reproducer.Running
on Windows produced one of two failure modes depending on inputs:
NonZero):nonzero_kernelreturnedhipErrorLaunchFailurefrom anout-of-bounds store and the process exited.
indices layout downstream): the whole run completed but
migraphx-driver verifyshowedRMS ≈ 0.58,Max diff ≈ 1023,with the first divergence at
output[0] = 0vs reference1.verify --bisectlocalized the first divergent instruction to ascatternd_kernelwhose indices tensor had logical shape{1024, 2}but strides
{1, 1024}.Both failure modes share one root cause: a GPU kernel making an
assumption about its inputs that the surrounding pipeline does not
guarantee. The goals of this PR are:
NonZerocorrect for inputs with more than 255 set elements.ScatterNDcorrect forindicestensors with arbitrary strides(in particular, the non-packed layouts produced by
transpose/slice/squeeze/concatchains downstream ofNonZero).case.
Technical Details
Bug 1 —
nonzero_kernel: accumulator type too narrowsrc/targets/gpu/kernels/include/migraphx/kernels/nonzero.hppThe kernel uses a single-block prefix sum (
block_scan) to assign eachnon-zero input position a contiguous output slot. The element-wise
predicate was returning
uint8_t:block_scandeduces the running-sum accumulator type from the predicatereturn type, so the accumulator was also
uint8_t. Once more than 255input elements were non-zero the accumulator wrapped,
value - 1becamea very large
index_int, and the kernel wrote intoat an out-of-bounds offset, which on AMD GPU surfaces as
hipErrorLaunchFailure.Fix: widen the accumulator by widening the predicate return type to
index_int(==uint32_t), which is the same type already usedeverywhere else for flat-index arithmetic in the kernels:
The added
MIGRAPHX_ASSERT(value > 0)catches the broken-invariant casein debug builds: after the
float_equal(input[j], 0)early-out, anythread still in flight must have contributed
1to the prefix sum, sothe running sum is strictly positive at that point. With the old
uint8_taccumulator the assertion would have fired immediately onoverflow.
Why this scope is correct:
uint8_twas not chosen for memory bandwidth —block_scanoperateson shared memory and the cost is identical with
uint32_t.out_locis used as a tensor coordinate, which is alreadyindex_inteverywhere else, so widening at the source removes theonly narrowing in the data path.
path produced identical results.
Bug 2 —
scatternd_kernel: packed-layout assumptionsrc/targets/gpu/kernels/include/migraphx/kernels/scatternd.hppFor each update element the kernel needs to read the
kscattercomponents for the current outer position from the
indicestensor.The original code did:
Reading carefully:
indices_shape.index(indices_idx)correctly computes the memoryoffset of the first of the
kcomponents via the strides.index_start + kthen advanceskmemory slots forward, andcopy(...)readskcontiguous int64 values from memory.That is a packed-layout assumption: it requires
strides_indices.back() == 1and no later dim to interleave thekcomponents in memory. MIGraphX does not guarantee that — the chain
(which is what every
NonZero -> Gather/Scattermodel produces) handsthe scatter operator an
indicestensor with logical shape{N, k}butin-memory layout that of
{k, N}. For{N, k} = {1024, 2}the actualstrides are
{1, 1024}and the memory layout isFor thread
t, the kernel was readingmemory[t]andmemory[t+1],i.e.
(indices[t, 0], indices[t+1, 0])— instead of the intended(indices[t, 0], indices[t, 1]). In the vision-encoder case every rowcomponent is
0, so every thread ended up scattering tooutput[{0, 0}], producing both anassign_nonewrite race and stalevalues everywhere else in the output.
Fix: use the tensor view's stride-aware iterator.
begin_at(multi_index)returns an iota-style iterator carrying thelogical flat index
Shape::single(multi_index). Each dereferenceroutes through
tensor_view::operator[](index_int)→Shape::index(index_int)→compute_index, which produces the correctmemory offset for any stride pattern. Incrementing the iterator by 1 in
logical flat space corresponds to incrementing the last axis by 1,
which is exactly the walk we want for reading the
kscattercomponents.
Why this scope is correct:
Shape::index(index_int)takes theif (this->standard()) return i;fast path, so the generated memoryaccesses are identical to the previous code. No hot-path regression.
strides — the case that used to be silently wrong.
scatter.hpp(Scatter,non-ND) uses
indices_t[i](single flat index), which alreadyroutes through
compute_index. Gather kernels likewise.Changelog Category
Add a
CHANGELOG.mdentry for any option other thanNot Applicable