[AMDGPU] comgr: swap s_barrier_signal_isfirst to s_barrier_signal for A0#2287
Merged
Conversation
Collaborator
Collaborator
|
killed the npsdb since it has CI:SKIP set on it |
This was referenced Apr 22, 2026
5077487 to
021af07
Compare
Collaborator
021af07 to
ece47cf
Compare
Collaborator
ece47cf to
9c951e2
Compare
Collaborator
Signed-off-by: xintin <gaurav.verma@amd.com>
9c951e2 to
b20a587
Compare
Collaborator
chinmaydd
reviewed
Apr 29, 2026
chinmaydd
reviewed
Apr 29, 2026
chinmaydd
reviewed
Apr 29, 2026
3 tasks
Collaborator
1f7b645 to
345a42d
Compare
Collaborator
kirthana14m
pushed a commit
that referenced
this pull request
Apr 30, 2026
… A0 (#2287) ## Summary A0 silicon has a race condition where `s_barrier_signal_isfirst` with a user cluster barrier ID may return garbage in SCC before the barrier completes. The non-isfirst variant (`s_barrier_signal`) has the same encoding size and operand layout but does not write SCC. This patch adds the `s_barrier_signal_isfirst` -> `s_barrier_signal` in-place opcode swap to the existing `applyInPlacePatches` dispatcher alongside the `cluster_load -> global_load` and `s_clause -> s_nop` rules from #2222. The swap is unconditional for all `s_barrier_signal_isfirst` instructions. Non-isfirst `s_barrier_signal` sites are left unchanged. ## Changes - **comgr-hotswap-patch-inplace.cpp**: Add `s_barrier_signal_isfirst` -> `s_barrier_signal` swap using the existing `resolveOpcode` / `swapOpcode` helpers. Same pattern as the `cluster_load` swap -- no new infrastructure. - **hotswap-barrier-isfirst.s**: Positive lit test - kernel with 2x `s_barrier_signal_isfirst -1`. Verifies both are swapped to `s_barrier_signal`, waits are unchanged, no `s_barrier_signal_isfirst` remains anywhere in disassembly. Idempotency check via second rewrite + `cmp`. - **hotswap-barrier-no-isfirst.s**: Passthrough lit test - kernel with only `s_barrier_signal` (workgroup -1 and user cluster -3). Verifies layout is structurally unchanged and no `s_barrier_signal_isfirst` appears. Idempotency check. - **hotswap-barrier-mixed.s**: Selectivity lit test - kernel with both `s_barrier_signal_isfirst` and `s_barrier_signal` interleaved. Verifies only the isfirst sites are swapped; the pre-existing non-isfirst site is preserved. Guards against a `starts_with` / `contains` regression in the mnemonic match. ## Tests - 3 lit tests (positive / passthrough / mixed selectivity) with strict CHECK-NEXT chains and CHECK-NOT brackets - All existing lit, ctest, and unit tests pass ## Follow up (already tracked) Cache all in-place target opcodes in LLVMState at init time (covers cluster_load, s_barrier_signal, and future in-place swaps). Currently each swap calls resolveOpcode per instruction site; caching at init eliminates redundant assemble+decode on every match. This is tracked as part of the hotswap code-refactoring effort alongside the others (#2253) --------- Signed-off-by: xintin <gaurav.verma@amd.com>
jammm
added a commit
that referenced
this pull request
Apr 30, 2026
Resolves two textual conflicts that surfaced after upstream landed new hotswap patches (#2265 WMMA hazard, #2287 barrier rewrite, etc.): * amd/comgr/CMakeLists.txt Both #2265 (`comgr-hotswap-patch-wmma-hazard.cpp`) and this branch (`comgr-hotswap-patch-wmma-split.cpp`) add a source file at the same position in the SOURCES list. Resolved by including both in alphabetical order (hazard, then split). * amd/comgr/test-lit/comgr-sources/hotswap-rewrite.c This branch added an opt-in `--allow-grow` flag to bypass the output-size check on the `--output` path (PR #1 commit eacb7dc). In #2265, upstream went the opposite direction and removed the size check on the `--output` path entirely. Resolved by taking upstream's version verbatim, since their direction is the consensus and it makes the `--allow-grow` flag redundant. This effectively reverts the eacb7dc commit's diff for this file. The driver-side `--allow-grow` flag parsing is gone after the merge; my LIT tests are updated below to drop the flag from their RUN lines so they invoke the driver as upstream now expects. Also stripped `--allow-grow` from the two WMMA-split LIT tests (`hotswap-wmma-split.s`, `hotswap-wmma-split-imm-src2.s`) so they match the new driver signature. Verified: all 15 hotswap LIT tests pass after the merge -- the 6 new upstream ones (hotswap-barrier-{,no-}isfirst.s, hotswap-wmma-hazard.s, etc.) plus the existing 7 plus my 2.
lamb-j
pushed a commit
that referenced
this pull request
May 2, 2026
## Summary
VOP3PX2 V_WMMA_SCALE* instructions have an unused scale_src2 field at
bits [58:50] (see VOP3PX2e::Inst{58-50} in VOP3PInstructions.td) that
the SQ incorrectly decodes as an SGPR reference, causing a 3-cycle SALU
stall after WMMA co-execution. This patch sets the field to VGPR0
encoding (0x100) to prevent the false dependency. Applies to both A0 and
B0 steppings.
The fix uses raw byte manipulation because scale_src2 is a hardware
encoding artifact not modeled as an MC operand. The MC layer has no
mechanism to read or set this field.
This only fires on the B0-to-A0 hotswap rewrite path. A0-native binaries
are compiled with an A0-targeted Clang that sets the field correctly at
codegen time.
## Changes
- **comgr-hotswap-patch-vop3px2-src2.cpp**: `applyVop3px2Src2Fix`
whole-kernel pass. Matches the 4 known `v_wmma_scale*` mnemonics via
explicit `StringSwitch` enumeration (not `starts_with`), patches bits
[58:50] in place. Includes diagnostic logging on size/bounds skip paths
and a gated summary log.
- **comgr-hotswap-internal.h**: Declare `patchScaleSrc2` for unit
testing.
- **comgr-hotswap-b0a0.cpp**: Add declaration, weak stub, and dispatcher
call for `applyVop3px2Src2Fix`.
- **CMakeLists.txt**: Add new source to library build.
- **hotswap-vop3px2-src2.s**: Positive lit test -- kernel with
`v_wmma_scale_f32_16x16x128_f8f6f4`. Verifies instruction survives
patching and idempotency (second rewrite produces identical bytes, which
proves the first pass set the field to VGPR0 since the assembler default
is not VGPR0).
- **hotswap-vop3px2-src2-noop.s**: Passthrough lit test -- kernel with
regular (non-scale) WMMA. Verifies no `v_wmma_scale` appears and layout
is preserved.
- **HotswapMCTest.cpp**: 6 new `PatchScaleSrc2` unit tests covering
zeroed field, all-ones field, already-VGPR0 (returns false), idempotency
(patch + re-patch), byte preservation outside scale_src2, and
non-scale-src2 bit preservation.
- **test-unit/CMakeLists.txt**: Link new source into HotswapMCTests.
## Tests
- 2 lit tests (positive + passthrough) with idempotency checks
- 6 unit tests for `patchScaleSrc2` byte-level correctness
- All existing lit, ctest, and unit tests pass
## Follow-up (deferred)
- [ ] `isVop3px2ScaleInst` unit tests (known-variant + near-miss
coverage); requires exposing the function in the header
- [ ] TSFlags-based dispatch instead of mnemonic StringSwitch
- [ ] Cache resolved opcodes in LLVMState at init time (same follow-up
as #2287)
Signed-off-by: xintin <gaurav.verma@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A0 silicon has a race condition where
s_barrier_signal_isfirstwith a user cluster barrier ID may return garbage in SCC before the barrier completes. The non-isfirst variant (s_barrier_signal) has the same encoding size and operand layout but does not write SCC.This patch adds the
s_barrier_signal_isfirst->s_barrier_signalin-place opcode swap to the existingapplyInPlacePatchesdispatcher alongside thecluster_load -> global_loadands_clause -> s_noprules from #2222. The swap is unconditional for alls_barrier_signal_isfirstinstructions. Non-isfirsts_barrier_signalsites are left unchanged.Changes
s_barrier_signal_isfirst->s_barrier_signalswap using the existingresolveOpcode/swapOpcodehelpers. Same pattern as thecluster_loadswap -- no new infrastructure.s_barrier_signal_isfirst -1. Verifies both are swapped tos_barrier_signal, waits are unchanged, nos_barrier_signal_isfirstremains anywhere in disassembly. Idempotency check via second rewrite +cmp.s_barrier_signal(workgroup -1 and user cluster -3). Verifies layout is structurally unchanged and nos_barrier_signal_isfirstappears. Idempotency check.s_barrier_signal_isfirstands_barrier_signalinterleaved. Verifies only the isfirst sites are swapped; the pre-existing non-isfirst site is preserved. Guards against astarts_with/containsregression in the mnemonic match.Tests
Follow up (already tracked)
Cache all in-place target opcodes in LLVMState at init time (covers cluster_load, s_barrier_signal, and future in-place swaps). Currently each swap calls resolveOpcode per instruction site;
caching at init eliminates redundant assemble+decode on every match.
This is tracked as part of the hotswap code-refactoring effort alongside the others (#2253)