Skip to content

[AMDGPU] comgr: add GFX1250 A0 WMMA hazard patch#2265

Merged
lamb-j merged 6 commits into
ROCm:amd-stagingfrom
xintin:wmma-hazard-port-v2
Apr 30, 2026
Merged

[AMDGPU] comgr: add GFX1250 A0 WMMA hazard patch#2265
lamb-j merged 6 commits into
ROCm:amd-stagingfrom
xintin:wmma-hazard-port-v2

Conversation

@xintin
Copy link
Copy Markdown

@xintin xintin commented Apr 20, 2026

Stacked on #2203. Merged!
picked from #2226 Closed.

Depends on #2331 (cherry-picked for now). Merged.

Summary

Hotswap patch for the GFX1250 A0 WMMA/SWMMAC co-execution hazard. Detects WMMA/SWMMAC instructions that lack sufficient v_nop separation before the first overlapping co-executable VALU on A0 silicon, and inserts the required v_nop padding via trampoline.
On A0, certain WMMA forms need more v_nop slots than B0 before an overlapping VALU can safely co-execute.
The patch scans the decoded instruction stream, classifies each WMMA by its A0/B0 nop requirement, counts existing safe slots (v_nop, non-overlapping VALU, non-terminating SALU), and emits a trampoline with the deficit v_nops before the hazardous VALU.

Changes

  • comgr-hotswap-patch-wmma-hazard.cpp: Strong-symbol override for applyWmmaHazardPatch. Classifies WMMA/SWMMAC via MCInstrDesc::TSFlags, detects VGPR overlap via MCRegisterInfo::regsOverlap, emits v_nop padding via buildTrampoline. Includes TSFlags sync comment, classifyWmmaNops ordering documentation, printInst round-trip documentation, duplicate VALU guard (DenseSet), buildTrampoline result check, dead variable removal.
  • comgr-hotswap-internal.h: Add WmmaNopReq struct and classifyWmmaNops declaration.
  • comgr-hotswap-elf.cpp: Remove duplicate refuse check from cherry-pick merge of [AMDGPU] comgr: Fix ELF trampoline growth for clang/lld-produced HSACOs #2331.
  • hotswap-rewrite.c: Remove output-size check from --output path (trampoline patches grow the ELF).
  • CMakeLists.txt: Add WMMA patch source to library build.
  • hotswap-wmma-hazard.s: Lit test: v_wmma_i32_16x16x64_iu8 followed by overlapping v_add_f32, verifies v_nop padding and idempotency.
  • HotswapMCTest.cpp: 10 new classifyWmmaNops unit tests covering all WMMA variants and ordering.
  • test-unit/CMakeLists.txt: Link WMMA patch source into HotswapMCTests.

Test plan

  • lit, ctest, and unit test passes
  • ASan lit tests: no memory errors

@xintin xintin added comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature ci:skip Skip all CI builds/tests for this PR labels Apr 20, 2026
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Comment thread amd/comgr/src/comgr-hotswap-b0a0.cpp Outdated
@xintin
Copy link
Copy Markdown
Author

xintin commented Apr 24, 2026

@yxsamliu, this is a draft atm.

Comment thread amd/comgr/src/comgr-hotswap-llvm.cpp Outdated
@yxsamliu
Copy link
Copy Markdown

Got it, thanks. I’ll treat this as early feedback for now and can take another pass once it is ready for detailed review.

@xintin xintin force-pushed the wmma-hazard-port-v2 branch from e51aab0 to 50bedd0 Compare April 27, 2026 19:01
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@z1-cciauto
Copy link
Copy Markdown
Collaborator

@z1-cciauto
Copy link
Copy Markdown
Collaborator

@xintin xintin marked this pull request as ready for review April 28, 2026 04:12
@xintin xintin requested review from chinmaydd and lamb-j as code owners April 28, 2026 04:12
@xintin
Copy link
Copy Markdown
Author

xintin commented Apr 28, 2026

Hi @yxsamliu, thanks for your early feedback.
I have made this PR available for your detailed review. Appreciate your time.
Thank you.

@chinmaydd
Copy link
Copy Markdown

Lets rebase this

@xintin xintin requested a review from yxsamliu April 28, 2026 15:20
@xintin xintin force-pushed the wmma-hazard-port-v2 branch from ea85fd9 to be0c579 Compare April 28, 2026 15:36
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@xintin xintin requested review from jmmartinez and yoonseoch April 28, 2026 15:40
Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-hazard.cpp
Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-hazard.cpp Outdated
}

static bool isVNop(const MCInst &Inst, const MCInstrInfo &MCII) {
return MCII.getName(Inst.getOpcode()) == "V_NOP_e32";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use cached SNopOpcode?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But looking at the caller site of isVNop, the information in the LHS might be already in InternalDecodeInst type's Mnemonic field. Also, is it safe to use string "V_NOP_e23" directly in RHS? no _gfx12 suffix?
I don't have knowlege about cached SNopOpcode.. don't know if my question is related to it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNopOpcode caches s_nop (scalar NOP), but the WMMA co-execution hazard requires v_nop (vector NOP)
they are different instructions on different execution units. The hazard is specifically about VALU co-execution timing with WMMA, so only v_nop provides the required separation.

That said, we could cache a VNopInst in LLVMState at init time (same pattern as SNopOpcode/SNopBytes) to avoid resolving it per patch call.

@chinmaydd , do you want to add that?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. Yeah I think thats a good idea.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxsamliu , you have valid point. we can switch isVNop to use DI.Mnemonic == "v_nop" at the InternalDecodedInst call sites.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yoonseoch.

the information in the LHS might be already in InternalDecodeInst type's Mnemonic field

Right. I think my point was to avoid the hardcoding of the V_NOP mnemonic string here and doing it like we did it for S_NOP

Also, is it safe to use string "V_NOP_e23" directly in RHS? no _gfx12 suffix?

I believe thats okay, we have only a single V_NOP opcode

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it Unresolved for now to make the question visible. It is about outdated code and not relevant to the current code. Just want to make sure if we are on the same page about opcodes. The relevant outdated code was

static bool isVNop(const MCInst &Inst, const MCInstrInfo &MCII) {
  return MCII.getName(Inst.getOpcode()) == "V_NOP_e32";

Also, is it safe to use string "V_NOP_e23" directly in RHS? no _gfx12 suffix?

I believe thats okay, we have only a single V_NOP opcode

I asked cursor and it claimed V_NOP_e32 is only pseudo and assembled/dissembled MCInst's opcode doesn't have that string, "V_NOP_e32". For gfx1250, MCII.getName(Inst.getOpcode()) will be V_NOP_e32_gfx12.
I tested with llvm-mc.

yoonchoi@banff-ccs-aus-g01-14:/work1/yoonchoi/llvm-project$ echo "v_nop" | ./build/bin/llvm-mc -arch=amdgcn -mcpu=gfx1250 -show-inst -show-encoding
        v_nop                                   ; encoding: [0x00,0x00,0x00,0x7e]
                                        ; <MCInst #50151 V_NOP_e32_gfx12>
yoonchoi@banff-ccs-aus-g01-14:/work1/yoonchoi/llvm-project$ echo "v_nop" | ./build/bin/llvm-mc -arch=amdgcn  -show-inst -show-encoding
        v_nop                                   ; encoding: [0x00,0x00,0x00,0x7e]
                                        ; <MCInst #50153 V_NOP_e32_gfx6_gfx7>

So, my understanding of the outdated code is that it would have been returning false.

Anyway, the key seems to be avoiding comparison against hardcoded string. It was addressed in the comments for parseASMToMCInsts :
https://github.com/ROCm/llvm-project/pull/2265/changes#diff-becfadabf2911e7ffce0f46469c09abc36bea67dcf6d0fdd02d282cea2a148d4L138-L144

Is my understanding correct? @chinmaydd @xintin @yxsamliu

Copy link
Copy Markdown

@chinmaydd chinmaydd Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @yoonseoch. Yes, that is correct. I think I misread the original code as .contains().

The == check against the MCII.getName() would not have worked due to the _gfx12 suffix string. And I think the current test doesnt really exercise that path.

The mnemonic check is okay for now but we are still baking some assumptions in regarding the type of v_nop. There is also v_nop_e64 no?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grep "V_NOP" path/llvm-project/build/lib/Target/AMDGPU/AMDGPUGenInstrInfo.inc | head -10
V_NOP_dpp                                                            = 10392, // VOP1Instructions.td:127
    V_NOP_e32                                                            = 10393, // VOP1Instructions.td:116
    V_NOP_e64                                                            = 10394, // VOP1Instructions.td:120
    V_NOP_sdwa                                                           = 10395, // VOP1Instructions.td:124
    V_NOP_dpp8_gfx10                                                     = 50146, // VOP1Instructions.td:1314
    V_NOP_dpp_gfx10                                                      = 50147, // VOP1Instructions.td:1310
    V_NOP_dpp_vi                                                         = 50148, // VOP1Instructions.td:1558
    V_NOP_e32_gfx10                                                      = 50149, // VOP1Instructions.td:1293
    V_NOP_e32_gfx11                                                      = 50150, // VOP1Instructions.td:1011
    V_NOP_e32_gfx12                                                      = 50151, // VOP1Instructions.td:1011

V_NOP_e64 does exist as an instruction variant.

But, both V_NOP_e32 and V_NOP_e64 have the same assembly mnemonic "v_nop"
The _e32/_e64 are encoding variants (VOP1 32-bit vs 64-bit encoding), not different instructions. So DI.Mnemonic == "v_nop" correctly matches all v_nop encoding variants.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yoonseoch, nice catch.

Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-hazard.cpp Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-hazard.cpp Outdated
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@xintin xintin force-pushed the wmma-hazard-port-v2 branch from c286c3c to b5c924e Compare April 28, 2026 21:58
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-hazard.cpp
Comment thread amd/comgr/src/comgr-hotswap-internal.h
Comment thread amd/comgr/test-lit/hotswap-wmma-hazard.s Outdated
@xintin xintin force-pushed the wmma-hazard-port-v2 branch from b5c924e to 33680fb Compare April 29, 2026 00:14
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Comment thread amd/comgr/test-lit/hotswap-wmma-hazard-partial.s
@xintin xintin force-pushed the wmma-hazard-port-v2 branch from 33680fb to 2344a54 Compare April 29, 2026 16:44
@xintin xintin requested a review from chinmaydd April 29, 2026 16:45
@z1-cciauto
Copy link
Copy Markdown
Collaborator

panditsa and others added 5 commits April 29, 2026 16:52
Move the WMMA co-exec hazard pass into its own patch translation\nunit so the GFX1250 dispatcher only keeps the weak declaration and\nguarded call site. Rework instruction classification to use MC\ninstruction flags instead of mnemonic prefix checks, and derive VGPR\noverlap from explicit MC defs and uses.\n\nThis keeps the hotswap patch-family structure consistent while\nreusing existing AMDGPU opcode metadata at the MC layer.\n\nTests: ../llvm-build/tools/amd-comgr/unittests/HotswapTests\nTests: ctest --test-dir ../llvm-build/tools/amd-comgr --output-on-failure -j8\nTests: cmake --build ../llvm-build --target test-lit -j8\nAssisted-by: OpenAI Codex
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Comment thread amd/comgr/src/comgr-hotswap-llvm.cpp Outdated
Comment thread amd/comgr/test-lit/hotswap-wmma-hazard-safe.s Outdated
Copy link
Copy Markdown

@chinmaydd chinmaydd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, thanks

Signed-off-by: xintin <gaurav.verma@amd.com>
@xintin xintin force-pushed the wmma-hazard-port-v2 branch from 2344a54 to b89d9bb Compare April 29, 2026 17:20
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@lamb-j lamb-j merged commit eb2dda2 into ROCm:amd-staging Apr 30, 2026
40 of 47 checks passed
@xintin xintin removed the ci:skip Skip all CI builds/tests for this PR label Apr 30, 2026
kirthana14m pushed a commit that referenced this pull request Apr 30, 2026
~Stacked on #2203.~ Merged! 
~picked from #2226 Closed.

~Depends on #2331 (cherry-picked for now).~ Merged.

## Summary
Hotswap patch for the GFX1250 A0 WMMA/SWMMAC co-execution hazard.
Detects WMMA/SWMMAC instructions that lack sufficient v_nop separation
before the first overlapping co-executable VALU on A0 silicon, and
inserts the required v_nop padding via trampoline.
On A0, certain WMMA forms need more v_nop slots than B0 before an
overlapping VALU can safely co-execute.
The patch scans the decoded instruction stream, classifies each WMMA by
its A0/B0 nop requirement, counts existing safe slots (v_nop,
non-overlapping VALU, non-terminating SALU), and emits a trampoline with
the deficit v_nops before the hazardous VALU.

## Changes
- **comgr-hotswap-patch-wmma-hazard.cpp**: Strong-symbol override for
applyWmmaHazardPatch. Classifies WMMA/SWMMAC via MCInstrDesc::TSFlags,
detects VGPR overlap via MCRegisterInfo::regsOverlap, emits v_nop
padding via buildTrampoline. Includes TSFlags sync comment,
classifyWmmaNops ordering documentation, printInst round-trip
documentation, duplicate VALU guard (DenseSet), buildTrampoline result
check, dead variable removal.
- **comgr-hotswap-internal.h**: Add WmmaNopReq struct and
classifyWmmaNops declaration.
- **comgr-hotswap-elf.cpp**: Remove duplicate refuse check from
cherry-pick merge of #2331.
- **hotswap-rewrite.c**: Remove output-size check from --output path
(trampoline patches grow the ELF).
- **CMakeLists.txt**: Add WMMA patch source to library build.
- **hotswap-wmma-hazard.s**: Lit test: v_wmma_i32_16x16x64_iu8 followed
by overlapping v_add_f32, verifies v_nop padding and idempotency.
- **HotswapMCTest.cpp**: 10 new classifyWmmaNops unit tests covering all
WMMA variants and ordering.
- **test-unit/CMakeLists.txt**: Link WMMA patch source into
HotswapMCTests.

### Test plan
- lit, ctest, and unit test passes
- ASan lit tests: no memory errors

---------

Signed-off-by: xintin <gaurav.verma@amd.com>
Co-authored-by: Sanket Pandit <sanket.pandit@amd.com>
jammm added a commit that referenced this pull request Apr 30, 2026
Resolves two textual conflicts that surfaced after upstream landed
new hotswap patches (#2265 WMMA hazard, #2287 barrier rewrite, etc.):

* amd/comgr/CMakeLists.txt
  Both #2265 (`comgr-hotswap-patch-wmma-hazard.cpp`) and this branch
  (`comgr-hotswap-patch-wmma-split.cpp`) add a source file at the
  same position in the SOURCES list. Resolved by including both in
  alphabetical order (hazard, then split).

* amd/comgr/test-lit/comgr-sources/hotswap-rewrite.c
  This branch added an opt-in `--allow-grow` flag to bypass the
  output-size check on the `--output` path (PR #1 commit eacb7dc).
  In #2265, upstream went the opposite direction and removed the
  size check on the `--output` path entirely. Resolved by taking
  upstream's version verbatim, since their direction is the
  consensus and it makes the `--allow-grow` flag redundant.

  This effectively reverts the eacb7dc commit's diff for this file.
  The driver-side `--allow-grow` flag parsing is gone after the
  merge; my LIT tests are updated below to drop the flag from their
  RUN lines so they invoke the driver as upstream now expects.

Also stripped `--allow-grow` from the two WMMA-split LIT tests
(`hotswap-wmma-split.s`, `hotswap-wmma-split-imm-src2.s`) so they
match the new driver signature.

Verified: all 15 hotswap LIT tests pass after the merge -- the 6
new upstream ones (hotswap-barrier-{,no-}isfirst.s,
hotswap-wmma-hazard.s, etc.) plus the existing 7 plus my 2.
jammm added a commit that referenced this pull request May 4, 2026
Addresses all six inline review comments from chinmaydd on PR #2379:

* line 138 (Avoid auto throughout): replaced the one `auto It =
  Cache.find(...)` with the explicit
  DenseMap<unsigned, const MCRegisterClass *>::iterator type. No other
  `auto` remains in the file.

* line 182 (use AMDGPU named operand metadata for src2,
  src2_modifiers, neg_lo, neg_hi -- duplicating the layout is fragile):
  replaced the "walk the first four register operands" pattern with a
  per-SplitKind VOP3PWmmaLayout table that names each MCInst slot
  (vdst, src0, src1, src2_modifiers, src2, plus any trailing modifier
  slots present in the profile). The K=128 fp8/bf8 family and the f4
  form have empirically-different operand counts (7 vs 5 ops at MCInst
  level), so a single shared positional table would have been wrong;
  splitting the table by SplitKind keeps each layout self-documenting
  and the structural validation in extractWmmaOps() (operand count +
  per-slot reg/imm kind check) catches any TableGen drift loudly
  rather than silently miscompiling. The AMDGPU backend's
  getNamedOperandIdx() and OpName enum live in
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h, which is a
  backend-private header (not installed in the LLVM dist), so we
  follow the same mirror-and-document pattern that
  comgr-hotswap-patch-wmma-hazard.cpp (#2265) uses for SIInstrFlags.

* line 318 (Does this handle fp operands [for src2 inline imm]):
  no -- and now refuses explicitly. validateSplitOperands() rejects
  any non-zero src2 inline immediate with a log message. Only integer
  0 (the compiler-folded zero accumulator -- the common shape for
  `C = wmma(A, B, 0)`-style code) is accepted. Non-zero / FP inline
  constants would need MCInstPrinter-mediated formatting to emit
  canonically; MCInstPrinter has no public single-operand entry point,
  so emitting via itostr would mis-encode FP inline constants like 1.0
  (which encode at HW slot 242, not as a 32-bit literal).

* line 342 (how are modifiers like neg_lo handled for src2):
  not preserved -- and now refuses explicitly. The K=128 fp8/bf8
  family and the f4 form do NOT accept any modifier asm syntax
  (op_sel, neg_lo, neg_hi, clamp, matrix_*_reuse) per the AMDGPU asm
  parser (verified empirically by attempting to assemble each
  modifier; all rejected at clang -c time). The TableGen profile
  reserves slots for some modifiers (e.g. HasMatrixReuse=1 on
  F32_FP8BF8X128_WMMA_w32) but the asm grammar does not expose them
  at this opcode level. extractWmmaOps() defensively scans every
  trailing operand from src2_modifiers onward; if any is non-zero,
  the splitter refuses the rewrite. This means a future opcode
  evolution that exposes modifier syntax fails loudly here rather
  than silently dropping the modifier.

* line 342 (what about cases where src2 is an inline constant [with
  modifiers]): both the non-zero src2 imm path and the modifier path
  are now rejected, so the inline-constant-with-modifier case is
  doubly refused.

* line 1 of the imm-src2 test (test coverage is weak): expanded
  hotswap-wmma-split.s to add exact-register-slicing checks for the
  fp8_fp8 K-split (verifies first-half A_lo=v[0:7], B_lo=v[8:15],
  src2=original v[16:23], and second-half A_hi=v[8:15], B_hi=v[16:23],
  src2=dst v[16:23] -- catches off-by-one slicing) and for the f4
  M-split (verifies dst slices v[4:11]/v[12:19], A slices v[0:7]/
  v[8:15], B broadcast v[2:9], and the literal MATRIX_FMT_FP4 modifier
  on both halves). Added a new hotswap-wmma-split-refuse.s LIT test
  that exercises the non-zero src2 imm refusal path (the original
  K=128 opcode must remain unchanged in the output, and idempotency
  still holds across two rewrites). Modifier-bearing tests are not
  written: the assembler refuses all modifier syntax for these 9
  opcodes, so a modifier-bearing input cannot be produced by clang
  to feed the splitter -- the defensive refusal in extractWmmaOps()
  is unreachable from any LIT-constructible input, which is itself
  the test that the contract holds.

Verified end-to-end on amd-staging amd-llvm dist with
all 16 hotswap LIT tests pass (the 13
pre-existing tests plus the two WMMA-split tests with the new
exact-slicing assertions plus the new refusal test).
jammm added a commit that referenced this pull request May 7, 2026
chinmaydd flagged on PR #2379 that the shared dispatcher API in
comgr-hotswap-b0a0.cpp returns uint32_t where 0 means EITHER
"this patch did not match the instruction" (correct dispatcher
fall-through) OR "matched but failed to apply" (silent miscompile
risk -- the rewriter returns SUCCESS with the original
A0-incompatible opcode left in .text). They identified this as a
cross-cutting concern across all hotswap patches (in-place #2222,
WMMA hazard #2265, WMMA split, future patches).

A proper fix is a separate follow-up upstream PR that refactors the
dispatcher to distinguish the two cases (e.g. an enum NoMatch /
Patched / Failed return type, or a `bool *Aborted` field threaded
through PatchContext that the dispatcher checks and uses to
short-circuit the rewrite with AMD_COMGR_STATUS_ERROR).

For now: classify each return-0 path in the splitter as either
"did not match" (correct) or "matched but failed" (currently silent),
add a header comment to applyWmmaSplitPatches documenting the
semantics + the cross-PR refactor concern, and tag every
matched-but-failed exit with an inline comment so it's clear which
return is the silent-miscompile risk.

The behavior is unchanged; this is documentation of an existing
limitation pending the cross-PR API refactor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants