Skip to content

[AMDGPU] comgr: swap s_barrier_signal_isfirst to s_barrier_signal for A0#2287

Merged
lamb-j merged 2 commits into
ROCm:amd-stagingfrom
xintin:hotswap-barrier-isfirst
Apr 30, 2026
Merged

[AMDGPU] comgr: swap s_barrier_signal_isfirst to s_barrier_signal for A0#2287
lamb-j merged 2 commits into
ROCm:amd-stagingfrom
xintin:hotswap-barrier-isfirst

Conversation

@xintin
Copy link
Copy Markdown

@xintin xintin commented Apr 22, 2026

Summary

A0 silicon has a race condition where s_barrier_signal_isfirst with a user cluster barrier ID may return garbage in SCC before the barrier completes. The non-isfirst variant (s_barrier_signal) has the same encoding size and operand layout but does not write SCC.

This patch adds the s_barrier_signal_isfirst -> s_barrier_signal in-place opcode swap to the existing applyInPlacePatches dispatcher alongside the cluster_load -> global_load and s_clause -> s_nop rules from #2222. The swap is unconditional for all s_barrier_signal_isfirst instructions. Non-isfirst s_barrier_signal sites are left unchanged.

Changes

  • comgr-hotswap-patch-inplace.cpp: Add s_barrier_signal_isfirst -> s_barrier_signal swap using the existing resolveOpcode / swapOpcode helpers. Same pattern as the cluster_load swap -- no new infrastructure.
  • hotswap-barrier-isfirst.s: Positive lit test - kernel with 2x s_barrier_signal_isfirst -1. Verifies both are swapped to s_barrier_signal, waits are unchanged, no s_barrier_signal_isfirst remains anywhere in disassembly. Idempotency check via second rewrite + cmp.
  • hotswap-barrier-no-isfirst.s: Passthrough lit test - kernel with only s_barrier_signal (workgroup -1 and user cluster -3). Verifies layout is structurally unchanged and no s_barrier_signal_isfirst appears. Idempotency check.
  • hotswap-barrier-mixed.s: Selectivity lit test - kernel with both s_barrier_signal_isfirst and s_barrier_signal interleaved. Verifies only the isfirst sites are swapped; the pre-existing non-isfirst site is preserved. Guards against a starts_with / contains regression in the mnemonic match.

Tests

  • 3 lit tests (positive / passthrough / mixed selectivity) with strict CHECK-NEXT chains and CHECK-NOT brackets
  • All existing lit, ctest, and unit tests pass

Follow up (already tracked)

Cache all in-place target opcodes in LLVMState at init time (covers cluster_load, s_barrier_signal, and future in-place swaps). Currently each swap calls resolveOpcode per instruction site;
caching at init eliminates redundant assemble+decode on every match.
This is tracked as part of the hotswap code-refactoring effort alongside the others (#2253)

@xintin xintin added comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature ci:skip Skip all CI builds/tests for this PR labels Apr 22, 2026
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@ronlieb
Copy link
Copy Markdown
Collaborator

ronlieb commented Apr 22, 2026

killed the npsdb since it has CI:SKIP set on it

@z1-cciauto
Copy link
Copy Markdown
Collaborator

@xintin xintin force-pushed the hotswap-barrier-isfirst branch from 021af07 to ece47cf Compare April 29, 2026 18:27
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@xintin xintin force-pushed the hotswap-barrier-isfirst branch from ece47cf to 9c951e2 Compare April 29, 2026 18:40
@xintin xintin removed the ci:skip Skip all CI builds/tests for this PR label Apr 29, 2026
@xintin xintin marked this pull request as ready for review April 29, 2026 18:42
@xintin xintin requested review from chinmaydd and lamb-j as code owners April 29, 2026 18:42
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Signed-off-by: xintin <gaurav.verma@amd.com>
@xintin xintin force-pushed the hotswap-barrier-isfirst branch from 9c951e2 to b20a587 Compare April 29, 2026 19:20
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Comment thread amd/comgr/src/comgr-hotswap-patch-inplace.cpp
Comment thread amd/comgr/src/comgr-hotswap-patch-inplace.cpp Outdated
Comment thread amd/comgr/test-lit/hotswap-barrier-isfirst.s
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown

@chinmaydd chinmaydd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, narrow transformation

Signed-off-by: xintin <gaurav.verma@amd.com>
@xintin xintin force-pushed the hotswap-barrier-isfirst branch from 1f7b645 to 345a42d Compare April 29, 2026 21:07
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@lamb-j lamb-j merged commit e772624 into ROCm:amd-staging Apr 30, 2026
20 of 24 checks passed
kirthana14m pushed a commit that referenced this pull request Apr 30, 2026
… A0 (#2287)

## Summary

A0 silicon has a race condition where `s_barrier_signal_isfirst` with a
user cluster barrier ID may return garbage in SCC before the barrier
completes. The non-isfirst variant (`s_barrier_signal`) has the same
encoding size and operand layout but does not write SCC.

This patch adds the `s_barrier_signal_isfirst` -> `s_barrier_signal`
in-place opcode swap to the existing `applyInPlacePatches` dispatcher
alongside the `cluster_load -> global_load` and `s_clause -> s_nop`
rules from #2222. The swap is unconditional for all
`s_barrier_signal_isfirst` instructions. Non-isfirst `s_barrier_signal`
sites are left unchanged.

## Changes
- **comgr-hotswap-patch-inplace.cpp**: Add `s_barrier_signal_isfirst` ->
`s_barrier_signal` swap using the existing `resolveOpcode` /
`swapOpcode` helpers. Same pattern as the `cluster_load` swap -- no new
infrastructure.
- **hotswap-barrier-isfirst.s**: Positive lit test - kernel with 2x
`s_barrier_signal_isfirst -1`. Verifies both are swapped to
`s_barrier_signal`, waits are unchanged, no `s_barrier_signal_isfirst`
remains anywhere in disassembly. Idempotency check via second rewrite +
`cmp`.
- **hotswap-barrier-no-isfirst.s**: Passthrough lit test - kernel with
only `s_barrier_signal` (workgroup -1 and user cluster -3). Verifies
layout is structurally unchanged and no `s_barrier_signal_isfirst`
appears. Idempotency check.
- **hotswap-barrier-mixed.s**: Selectivity lit test - kernel with both
`s_barrier_signal_isfirst` and `s_barrier_signal` interleaved. Verifies
only the isfirst sites are swapped; the pre-existing non-isfirst site is
preserved. Guards against a `starts_with` / `contains` regression in the
mnemonic match.

## Tests
- 3 lit tests (positive / passthrough / mixed selectivity) with strict
CHECK-NEXT chains and CHECK-NOT brackets
- All existing lit, ctest, and unit tests pass

## Follow up (already tracked)  
Cache all in-place target opcodes in LLVMState at init time (covers
cluster_load, s_barrier_signal, and future in-place swaps). Currently
each swap calls resolveOpcode per instruction site;
caching at init eliminates redundant assemble+decode on every match.
This is tracked as part of the hotswap code-refactoring effort alongside
the others (#2253)

---------

Signed-off-by: xintin <gaurav.verma@amd.com>
jammm added a commit that referenced this pull request Apr 30, 2026
Resolves two textual conflicts that surfaced after upstream landed
new hotswap patches (#2265 WMMA hazard, #2287 barrier rewrite, etc.):

* amd/comgr/CMakeLists.txt
  Both #2265 (`comgr-hotswap-patch-wmma-hazard.cpp`) and this branch
  (`comgr-hotswap-patch-wmma-split.cpp`) add a source file at the
  same position in the SOURCES list. Resolved by including both in
  alphabetical order (hazard, then split).

* amd/comgr/test-lit/comgr-sources/hotswap-rewrite.c
  This branch added an opt-in `--allow-grow` flag to bypass the
  output-size check on the `--output` path (PR #1 commit eacb7dc).
  In #2265, upstream went the opposite direction and removed the
  size check on the `--output` path entirely. Resolved by taking
  upstream's version verbatim, since their direction is the
  consensus and it makes the `--allow-grow` flag redundant.

  This effectively reverts the eacb7dc commit's diff for this file.
  The driver-side `--allow-grow` flag parsing is gone after the
  merge; my LIT tests are updated below to drop the flag from their
  RUN lines so they invoke the driver as upstream now expects.

Also stripped `--allow-grow` from the two WMMA-split LIT tests
(`hotswap-wmma-split.s`, `hotswap-wmma-split-imm-src2.s`) so they
match the new driver signature.

Verified: all 15 hotswap LIT tests pass after the merge -- the 6
new upstream ones (hotswap-barrier-{,no-}isfirst.s,
hotswap-wmma-hazard.s, etc.) plus the existing 7 plus my 2.
lamb-j pushed a commit that referenced this pull request May 2, 2026
## Summary
VOP3PX2 V_WMMA_SCALE* instructions have an unused scale_src2 field at
bits [58:50] (see VOP3PX2e::Inst{58-50} in VOP3PInstructions.td) that
the SQ incorrectly decodes as an SGPR reference, causing a 3-cycle SALU
stall after WMMA co-execution. This patch sets the field to VGPR0
encoding (0x100) to prevent the false dependency. Applies to both A0 and
B0 steppings.

The fix uses raw byte manipulation because scale_src2 is a hardware
encoding artifact not modeled as an MC operand. The MC layer has no
mechanism to read or set this field.

This only fires on the B0-to-A0 hotswap rewrite path. A0-native binaries
are compiled with an A0-targeted Clang that sets the field correctly at
codegen time.

## Changes
- **comgr-hotswap-patch-vop3px2-src2.cpp**: `applyVop3px2Src2Fix`
whole-kernel pass. Matches the 4 known `v_wmma_scale*` mnemonics via
explicit `StringSwitch` enumeration (not `starts_with`), patches bits
[58:50] in place. Includes diagnostic logging on size/bounds skip paths
and a gated summary log.
- **comgr-hotswap-internal.h**: Declare `patchScaleSrc2` for unit
testing.
- **comgr-hotswap-b0a0.cpp**: Add declaration, weak stub, and dispatcher
call for `applyVop3px2Src2Fix`.
- **CMakeLists.txt**: Add new source to library build.
- **hotswap-vop3px2-src2.s**: Positive lit test -- kernel with
`v_wmma_scale_f32_16x16x128_f8f6f4`. Verifies instruction survives
patching and idempotency (second rewrite produces identical bytes, which
proves the first pass set the field to VGPR0 since the assembler default
is not VGPR0).
- **hotswap-vop3px2-src2-noop.s**: Passthrough lit test -- kernel with
regular (non-scale) WMMA. Verifies no `v_wmma_scale` appears and layout
is preserved.
- **HotswapMCTest.cpp**: 6 new `PatchScaleSrc2` unit tests covering
zeroed field, all-ones field, already-VGPR0 (returns false), idempotency
(patch + re-patch), byte preservation outside scale_src2, and
non-scale-src2 bit preservation.
- **test-unit/CMakeLists.txt**: Link new source into HotswapMCTests.

## Tests
- 2 lit tests (positive + passthrough) with idempotency checks
- 6 unit tests for `patchScaleSrc2` byte-level correctness
- All existing lit, ctest, and unit tests pass

## Follow-up (deferred)
- [ ] `isVop3px2ScaleInst` unit tests (known-variant + near-miss
coverage); requires exposing the function in the header
- [ ] TSFlags-based dispatch instead of mnemonic StringSwitch
- [ ] Cache resolved opcodes in LLVMState at init time (same follow-up
as #2287)

Signed-off-by: xintin <gaurav.verma@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants