[OpenMP][DeviceRTL] Fix wrong iteration stride in DistributeFor SPMD no-loop fast path by sbryngelson · Pull Request #2602 · ROCm/llvm-project

sbryngelson · 2026-05-19T19:26:52Z

Problem

When compiled with -fopenmp-target-fast (which implies both
-fopenmp-assume-threads-oversubscription and
-fopenmp-assume-teams-oversubscription), target teams distribute parallel do loops that use array expressions (constructors [a,b,c] or
whole-array slice ops A(i,:)) silently produce wrong output: a
deterministic suffix of iterations never executes, leaving those output
elements at zero.

Trigger: all three must be present:

-fopenmp-assume-threads-oversubscription AND -fopenmp-assume-teams-oversubscription
-fopenmp-target-fast
Loop body uses array expressions that pull in the device stdlib (for implicit malloc)

-O0 through -O3 without -fopenmp-target-fast all pass. Scalar
assignments are immune.

Quantitative signature (verified on amdflang 23.1.0–23.2.1, gfx90a):

N_wrong = max(0,  N − (⌈N/32⌉ + 31))

e.g. N=64 → 31 wrong, N=128 → 93 wrong. Always the same (last) iterations,
always zero, perfectly deterministic.

Root cause

The five-step chain is detailed in issue #2601. The short version:

Both oversubscription flags → canPromoteToNoLoop → no_loop kernel mode → one_iteration_per_thread=1 emitted as a literal constant in the call to __kmpc_distribute_for_static_loop_4u.
Array-expression kernels set MayUseNestedParallelism=1 in KernelEnvironmentTy (device stdlib is linked in). This prevents LTO from replacing omp_get_num_threads() with a hardware register read.
Instead, LTO hoists omp_get_num_threads() to kernel entry — before __kmpc_parallel_spmd has set icv::Level=1. At Level=0, omp_get_num_threads() returns 1.

DistributeFor receives NumThreads=1. NormalizedLoopNestNoChunk computes:

KernelIteration = NumBlocks * 1        // should be NumBlocks * blocksize
IV = BId * 1 + TId                     // should be BId * blocksize + TId

With K=⌈N/32⌉ blocks of 32 threads: coverage is flat_ids 0..(K+30) only.
Flat_ids K+31..N-1 are never assigned → those iterations never execute.

Scalar kernels are immune because MayUseNestedParallelism=0 allows LTO to
replace omp_get_num_threads() with mapping::getMaxTeamThreads() (a
hardware register read that is correct at any point in kernel execution).

Fix

In DistributeFor, when OneIterationPerThread=1, override NumThreads
with mapping::getMaxTeamThreads() before it is used for BlockChunk
defaulting and stride computation. This hardware register is always valid
at kernel entry and reflects the actual number of threads per block.

if (OneIterationPerThread)
  NumThreads = static_cast<Ty>(mapping::getMaxTeamThreads());

The fix is placed before BlockChunk = NumThreads so the chunk size is
also set correctly.

Minimal Fortran reproducer

program t
    integer, parameter :: N = 64
    integer :: out(N, 2), i, nerr
    !$omp target teams distribute parallel do map(from:out)
    do i = 1, N
        out(i,:) = [i, i*2]
    end do
    !$omp end target teams distribute parallel do
    nerr = 0
    do i = 1, N
        if (out(i,1) /= i .or. out(i,2) /= i*2) nerr = nerr + 1
    end do
    if (nerr == 0) then
        print *, "PASS"
    else
        print *, "FAIL:", nerr, "of", N   ! without fix: FAIL: 31 of 64
    end if
end program

Compile with:

amdflang -fopenmp --offload-arch=gfx90a -fopenmp-target-fast \
  -fopenmp-assume-threads-oversubscription \
  -fopenmp-assume-teams-oversubscription \
  -o t t.f90 && ./t

A full test suite with additional cases (whole-array ops, varying N, scalar
immunity verification) is at https://github.com/sbryngelson/amd-flang-omp-bugs.

Testing

Verified on amdflang 23.1.0, 23.2.0, 23.2.1 / gfx90a (MI250X / Frontier)
Minimal reproducer above: FAIL: 31 of 64 without fix, PASS with fix
Scalar kernel unaffected (immune path unchanged)
-O0 through -O3 paths unaffected (OneIterationPerThread=0)
The ASSERT(NumBlocks * NumThreads >= NumIters) now checks against the correct blocksize

…st path In the SPMD no-loop fast path (one_iteration_per_thread=1, triggered by -fopenmp-target-fast with both oversubscription flags), the NumThreads argument to DistributeFor derives from omp_get_num_threads() which is evaluated before the parallel region is active. At that point icv::Level=0 and the parallel team state has not been initialized, so omp_get_num_threads() returns 1 instead of the actual hardware block size. NormalizedLoopNestNoChunk uses NumThreads as the per-block stride: IV = BId * NumThreads + TId // = BId * 1 + TId (wrong) KernelIteration = NumBlocks * NumThreads // = NumBlocks * 1 (wrong) With blocksize=32 and K=ceil(N/32) blocks this leaves iterations K+31 .. N-1 permanently unexecuted (output stays at zero). Fix: when OneIterationPerThread=1, override NumThreads with mapping::getMaxTeamThreads() which reads directly from hardware registers and is always valid at kernel entry. This also corrects the BlockChunk default and the ASSERT that follow. Reproducer (fails without fix, passes with fix on gfx90a / MI250X): integer, parameter :: N = 64 integer :: out(N, 2), i, nerr !$omp target teams distribute parallel do map(from:out) do i = 1, N; out(i,:) = [i, i*2]; end do !$omp end target teams distribute parallel do ! Fails: 31 of 64 cells are zero (never written) Verified on amdflang 23.1.0 / 23.2.0 / 23.2.1, gfx90a (MI250X). Fixes: ROCm#2601

sbryngelson · 2026-05-21T22:15:07Z

CI failure (Attempt #2) was unrelated to this PR.

Both Linux and Windows failed in the compiler-runtime stage because spirv-llvm-translator/lib/SPIRV/SPIRVReader.cpp couldn't find llvm/TargetParser/AMDGPUTargetParser.h — a pre-existing breakage in amd-staging around May 19 that has since been fixed (PR #2639, merged May 21, passes cleanly).

The fix was to rebase this branch onto the current amd-staging so the new CI run merges against the fixed tree.

DominikAdamski

LGTM, thanks for the patch.

sbryngelson · 2026-05-22T20:21:20Z

CI failed on what appears to be a flaky test (many runs fail it spuriously). I can look into it further if appropriate.

sbryngelson force-pushed the fix/no-loop-spmd-omp-get-num-threads branch from b8906a1 to d0d7275 Compare May 21, 2026 22:14

DominikAdamski self-requested a review May 21, 2026 22:27

DominikAdamski approved these changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP][DeviceRTL] Fix wrong iteration stride in DistributeFor SPMD no-loop fast path#2602

[OpenMP][DeviceRTL] Fix wrong iteration stride in DistributeFor SPMD no-loop fast path#2602
sbryngelson wants to merge 1 commit into
ROCm:amd-stagingfrom
sbryngelson:fix/no-loop-spmd-omp-get-num-threads

sbryngelson commented May 19, 2026

Uh oh!

sbryngelson commented May 21, 2026

Uh oh!

DominikAdamski left a comment

Uh oh!

sbryngelson commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sbryngelson commented May 19, 2026

Problem

Root cause

Fix

Minimal Fortran reproducer

Testing

Uh oh!

sbryngelson commented May 21, 2026

Uh oh!

DominikAdamski left a comment

Choose a reason for hiding this comment

Uh oh!

sbryngelson commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants