Skip to content

[OpenMP][DeviceRTL] Fix wrong iteration stride in DistributeFor SPMD no-loop fast path#2602

Open
sbryngelson wants to merge 1 commit into
ROCm:amd-stagingfrom
sbryngelson:fix/no-loop-spmd-omp-get-num-threads
Open

[OpenMP][DeviceRTL] Fix wrong iteration stride in DistributeFor SPMD no-loop fast path#2602
sbryngelson wants to merge 1 commit into
ROCm:amd-stagingfrom
sbryngelson:fix/no-loop-spmd-omp-get-num-threads

Conversation

@sbryngelson
Copy link
Copy Markdown

Fixes #2601.

Problem

When compiled with -fopenmp-target-fast (which implies both
-fopenmp-assume-threads-oversubscription and
-fopenmp-assume-teams-oversubscription), target teams distribute parallel do loops that use array expressions (constructors [a,b,c] or
whole-array slice ops A(i,:)) silently produce wrong output: a
deterministic suffix of iterations never executes, leaving those output
elements at zero.

Trigger: all three must be present:

  1. -fopenmp-assume-threads-oversubscription AND -fopenmp-assume-teams-oversubscription
  2. -fopenmp-target-fast
  3. Loop body uses array expressions that pull in the device stdlib (for implicit malloc)

-O0 through -O3 without -fopenmp-target-fast all pass. Scalar
assignments are immune.

Quantitative signature (verified on amdflang 23.1.0–23.2.1, gfx90a):

N_wrong = max(0,  N − (⌈N/32⌉ + 31))

e.g. N=64 → 31 wrong, N=128 → 93 wrong. Always the same (last) iterations,
always zero, perfectly deterministic.

Root cause

The five-step chain is detailed in issue #2601. The short version:

  1. Both oversubscription flags → canPromoteToNoLoopno_loop kernel mode → one_iteration_per_thread=1 emitted as a literal constant in the call to __kmpc_distribute_for_static_loop_4u.

  2. Array-expression kernels set MayUseNestedParallelism=1 in KernelEnvironmentTy (device stdlib is linked in). This prevents LTO from replacing omp_get_num_threads() with a hardware register read.

  3. Instead, LTO hoists omp_get_num_threads() to kernel entry — before __kmpc_parallel_spmd has set icv::Level=1. At Level=0, omp_get_num_threads() returns 1.

  4. DistributeFor receives NumThreads=1. NormalizedLoopNestNoChunk computes:

    KernelIteration = NumBlocks * 1        // should be NumBlocks * blocksize
    IV = BId * 1 + TId                     // should be BId * blocksize + TId
    
  5. With K=⌈N/32⌉ blocks of 32 threads: coverage is flat_ids 0..(K+30) only.
    Flat_ids K+31..N-1 are never assigned → those iterations never execute.

Scalar kernels are immune because MayUseNestedParallelism=0 allows LTO to
replace omp_get_num_threads() with mapping::getMaxTeamThreads() (a
hardware register read that is correct at any point in kernel execution).

Fix

In DistributeFor, when OneIterationPerThread=1, override NumThreads
with mapping::getMaxTeamThreads() before it is used for BlockChunk
defaulting and stride computation. This hardware register is always valid
at kernel entry and reflects the actual number of threads per block.

if (OneIterationPerThread)
  NumThreads = static_cast<Ty>(mapping::getMaxTeamThreads());

The fix is placed before BlockChunk = NumThreads so the chunk size is
also set correctly.

Minimal Fortran reproducer

program t
    integer, parameter :: N = 64
    integer :: out(N, 2), i, nerr
    !$omp target teams distribute parallel do map(from:out)
    do i = 1, N
        out(i,:) = [i, i*2]
    end do
    !$omp end target teams distribute parallel do
    nerr = 0
    do i = 1, N
        if (out(i,1) /= i .or. out(i,2) /= i*2) nerr = nerr + 1
    end do
    if (nerr == 0) then
        print *, "PASS"
    else
        print *, "FAIL:", nerr, "of", N   ! without fix: FAIL: 31 of 64
    end if
end program

Compile with:

amdflang -fopenmp --offload-arch=gfx90a -fopenmp-target-fast \
  -fopenmp-assume-threads-oversubscription \
  -fopenmp-assume-teams-oversubscription \
  -o t t.f90 && ./t

A full test suite with additional cases (whole-array ops, varying N, scalar
immunity verification) is at https://github.com/sbryngelson/amd-flang-omp-bugs.

Testing

  • Verified on amdflang 23.1.0, 23.2.0, 23.2.1 / gfx90a (MI250X / Frontier)
  • Minimal reproducer above: FAIL: 31 of 64 without fix, PASS with fix
  • Scalar kernel unaffected (immune path unchanged)
  • -O0 through -O3 paths unaffected (OneIterationPerThread=0)
  • The ASSERT(NumBlocks * NumThreads >= NumIters) now checks against the correct blocksize

…st path

In the SPMD no-loop fast path (one_iteration_per_thread=1, triggered by
-fopenmp-target-fast with both oversubscription flags), the NumThreads
argument to DistributeFor derives from omp_get_num_threads() which is
evaluated before the parallel region is active. At that point icv::Level=0
and the parallel team state has not been initialized, so omp_get_num_threads()
returns 1 instead of the actual hardware block size.

NormalizedLoopNestNoChunk uses NumThreads as the per-block stride:
  IV = BId * NumThreads + TId     // = BId * 1 + TId  (wrong)
  KernelIteration = NumBlocks * NumThreads  // = NumBlocks * 1  (wrong)

With blocksize=32 and K=ceil(N/32) blocks this leaves iterations
K+31 .. N-1 permanently unexecuted (output stays at zero).

Fix: when OneIterationPerThread=1, override NumThreads with
mapping::getMaxTeamThreads() which reads directly from hardware registers
and is always valid at kernel entry. This also corrects the BlockChunk
default and the ASSERT that follow.

Reproducer (fails without fix, passes with fix on gfx90a / MI250X):

  integer, parameter :: N = 64
  integer :: out(N, 2), i, nerr
  !$omp target teams distribute parallel do map(from:out)
  do i = 1, N; out(i,:) = [i, i*2]; end do
  !$omp end target teams distribute parallel do
  ! Fails: 31 of 64 cells are zero (never written)

Verified on amdflang 23.1.0 / 23.2.0 / 23.2.1, gfx90a (MI250X).

Fixes: ROCm#2601
@sbryngelson sbryngelson force-pushed the fix/no-loop-spmd-omp-get-num-threads branch from b8906a1 to d0d7275 Compare May 21, 2026 22:14
@sbryngelson
Copy link
Copy Markdown
Author

CI failure (Attempt #2) was unrelated to this PR.

Both Linux and Windows failed in the compiler-runtime stage because spirv-llvm-translator/lib/SPIRV/SPIRVReader.cpp couldn't find llvm/TargetParser/AMDGPUTargetParser.h — a pre-existing breakage in amd-staging around May 19 that has since been fixed (PR #2639, merged May 21, passes cleanly).

The fix was to rebase this branch onto the current amd-staging so the new CI run merges against the fixed tree.

@DominikAdamski DominikAdamski self-requested a review May 21, 2026 22:27
Copy link
Copy Markdown

@DominikAdamski DominikAdamski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the patch.

@sbryngelson
Copy link
Copy Markdown
Author

CI failed on what appears to be a flaky test (many runs fail it spuriously). I can look into it further if appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

amdflang: array constructor and whole-array slice ops produce wrong values in target loop when both oversubscription flags are set

2 participants