[AMDGPU] comgr: add GFX1250 A0 WMMA hazard patch by xintin · Pull Request #2265 · ROCm/llvm-project

xintin · 2026-04-20T22:10:48Z

~~Stacked on #2203.~~ Merged!
~~picked from #2226~~ Closed.

~~Depends on #2331 (cherry-picked for now).~~ Merged.

Summary

Hotswap patch for the GFX1250 A0 WMMA/SWMMAC co-execution hazard. Detects WMMA/SWMMAC instructions that lack sufficient v_nop separation before the first overlapping co-executable VALU on A0 silicon, and inserts the required v_nop padding via trampoline.
On A0, certain WMMA forms need more v_nop slots than B0 before an overlapping VALU can safely co-execute.
The patch scans the decoded instruction stream, classifies each WMMA by its A0/B0 nop requirement, counts existing safe slots (v_nop, non-overlapping VALU, non-terminating SALU), and emits a trampoline with the deficit v_nops before the hazardous VALU.

Changes

comgr-hotswap-patch-wmma-hazard.cpp: Strong-symbol override for applyWmmaHazardPatch. Classifies WMMA/SWMMAC via MCInstrDesc::TSFlags, detects VGPR overlap via MCRegisterInfo::regsOverlap, emits v_nop padding via buildTrampoline. Includes TSFlags sync comment, classifyWmmaNops ordering documentation, printInst round-trip documentation, duplicate VALU guard (DenseSet), buildTrampoline result check, dead variable removal.
comgr-hotswap-internal.h: Add WmmaNopReq struct and classifyWmmaNops declaration.
comgr-hotswap-elf.cpp: Remove duplicate refuse check from cherry-pick merge of [AMDGPU] comgr: Fix ELF trampoline growth for clang/lld-produced HSACOs #2331.
hotswap-rewrite.c: Remove output-size check from --output path (trampoline patches grow the ELF).
CMakeLists.txt: Add WMMA patch source to library build.
hotswap-wmma-hazard.s: Lit test: v_wmma_i32_16x16x64_iu8 followed by overlapping v_add_f32, verifies v_nop padding and idempotency.
HotswapMCTest.cpp: 10 new classifyWmmaNops unit tests covering all WMMA variants and ordering.
test-unit/CMakeLists.txt: Link WMMA patch source into HotswapMCTests.

Test plan

lit, ctest, and unit test passes
ASan lit tests: no memory errors

z1-cciauto · 2026-04-20T22:16:58Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5319

xintin · 2026-04-24T17:38:32Z

@yxsamliu, this is a draft atm.

yxsamliu · 2026-04-27T15:51:32Z

Got it, thanks. I’ll treat this as early feedback for now and can take another pass once it is ready for detailed review.

z1-cciauto · 2026-04-27T19:08:48Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5497

z1-cciauto · 2026-04-28T00:07:18Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5508

z1-cciauto · 2026-04-28T00:52:35Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5509

xintin · 2026-04-28T04:14:40Z

Hi @yxsamliu, thanks for your early feedback.
I have made this PR available for your detailed review. Appreciate your time.
Thank you.

chinmaydd · 2026-04-28T15:18:18Z

Lets rebase this

z1-cciauto · 2026-04-28T15:38:16Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5527

chinmaydd · 2026-04-28T20:35:02Z

+}
+
+static bool isVNop(const MCInst &Inst, const MCInstrInfo &MCII) {
+  return MCII.getName(Inst.getOpcode()) == "V_NOP_e32";


Can use cached SNopOpcode?

But looking at the caller site of isVNop, the information in the LHS might be already in InternalDecodeInst type's Mnemonic field. Also, is it safe to use string "V_NOP_e23" directly in RHS? no _gfx12 suffix?
I don't have knowlege about cached SNopOpcode.. don't know if my question is related to it.

SNopOpcode caches s_nop (scalar NOP), but the WMMA co-execution hazard requires v_nop (vector NOP)
they are different instructions on different execution units. The hazard is specifically about VALU co-execution timing with WMMA, so only v_nop provides the required separation.

That said, we could cache a VNopInst in LLVMState at init time (same pattern as SNopOpcode/SNopBytes) to avoid resolving it per patch call.

@chinmaydd , do you want to add that?

Ah, gotcha. Yeah I think thats a good idea.

@yxsamliu , you have valid point. we can switch isVNop to use DI.Mnemonic == "v_nop" at the InternalDecodedInst call sites.

Hi @yoonseoch.

the information in the LHS might be already in InternalDecodeInst type's Mnemonic field

Right. I think my point was to avoid the hardcoding of the V_NOP mnemonic string here and doing it like we did it for S_NOP

Also, is it safe to use string "V_NOP_e23" directly in RHS? no _gfx12 suffix?

I believe thats okay, we have only a single V_NOP opcode

Made it Unresolved for now to make the question visible. It is about outdated code and not relevant to the current code. Just want to make sure if we are on the same page about opcodes. The relevant outdated code was

static bool isVNop(const MCInst &Inst, const MCInstrInfo &MCII) { return MCII.getName(Inst.getOpcode()) == "V_NOP_e32";

Also, is it safe to use string "V_NOP_e23" directly in RHS? no _gfx12 suffix?

I believe thats okay, we have only a single V_NOP opcode

I asked cursor and it claimed V_NOP_e32 is only pseudo and assembled/dissembled MCInst's opcode doesn't have that string, "V_NOP_e32". For gfx1250, MCII.getName(Inst.getOpcode()) will be V_NOP_e32_gfx12.
I tested with llvm-mc.

yoonchoi@banff-ccs-aus-g01-14:/work1/yoonchoi/llvm-project$ echo "v_nop" | ./build/bin/llvm-mc -arch=amdgcn -mcpu=gfx1250 -show-inst -show-encoding v_nop ; encoding: [0x00,0x00,0x00,0x7e] ; <MCInst #50151 V_NOP_e32_gfx12> yoonchoi@banff-ccs-aus-g01-14:/work1/yoonchoi/llvm-project$ echo "v_nop" | ./build/bin/llvm-mc -arch=amdgcn -show-inst -show-encoding v_nop ; encoding: [0x00,0x00,0x00,0x7e] ; <MCInst #50153 V_NOP_e32_gfx6_gfx7>

So, my understanding of the outdated code is that it would have been returning false.

Anyway, the key seems to be avoiding comparison against hardcoded string. It was addressed in the comments for parseASMToMCInsts :
https://github.com/ROCm/llvm-project/pull/2265/changes#diff-becfadabf2911e7ffce0f46469c09abc36bea67dcf6d0fdd02d282cea2a148d4L138-L144

Is my understanding correct? @chinmaydd @xintin @yxsamliu

Good catch @yoonseoch. Yes, that is correct. I think I misread the original code as .contains().

The == check against the MCII.getName() would not have worked due to the _gfx12 suffix string. And I think the current test doesnt really exercise that path.

The mnemonic check is okay for now but we are still baking some assumptions in regarding the type of v_nop. There is also v_nop_e64 no?

grep "V_NOP" path/llvm-project/build/lib/Target/AMDGPU/AMDGPUGenInstrInfo.inc | head -10 V_NOP_dpp = 10392, // VOP1Instructions.td:127 V_NOP_e32 = 10393, // VOP1Instructions.td:116 V_NOP_e64 = 10394, // VOP1Instructions.td:120 V_NOP_sdwa = 10395, // VOP1Instructions.td:124 V_NOP_dpp8_gfx10 = 50146, // VOP1Instructions.td:1314 V_NOP_dpp_gfx10 = 50147, // VOP1Instructions.td:1310 V_NOP_dpp_vi = 50148, // VOP1Instructions.td:1558 V_NOP_e32_gfx10 = 50149, // VOP1Instructions.td:1293 V_NOP_e32_gfx11 = 50150, // VOP1Instructions.td:1011 V_NOP_e32_gfx12 = 50151, // VOP1Instructions.td:1011

V_NOP_e64 does exist as an instruction variant.

But, both V_NOP_e32 and V_NOP_e64 have the same assembly mnemonic "v_nop"
The _e32/_e64 are encoding variants (VOP1 32-bit vs 64-bit encoding), not different instructions. So DI.Mnemonic == "v_nop" correctly matches all v_nop encoding variants.

@yoonseoch, nice catch.

z1-cciauto · 2026-04-28T21:55:54Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5547

z1-cciauto · 2026-04-28T22:03:35Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5548

z1-cciauto · 2026-04-29T00:17:56Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5553

z1-cciauto · 2026-04-29T16:49:56Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5571

Move the WMMA co-exec hazard pass into its own patch translation\nunit so the GFX1250 dispatcher only keeps the weak declaration and\nguarded call site. Rework instruction classification to use MC\ninstruction flags instead of mnemonic prefix checks, and derive VGPR\noverlap from explicit MC defs and uses.\n\nThis keeps the hotswap patch-family structure consistent while\nreusing existing AMDGPU opcode metadata at the MC layer.\n\nTests: ../llvm-build/tools/amd-comgr/unittests/HotswapTests\nTests: ctest --test-dir ../llvm-build/tools/amd-comgr --output-on-failure -j8\nTests: cmake --build ../llvm-build --target test-lit -j8\nAssisted-by: OpenAI Codex

Signed-off-by: xintin <gaurav.verma@amd.com>

chinmaydd

LGTM overall, thanks

Signed-off-by: xintin <gaurav.verma@amd.com>

z1-cciauto · 2026-04-29T17:27:58Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5573

~Stacked on #2203.~ Merged! ~picked from #2226 Closed. ~Depends on #2331 (cherry-picked for now).~ Merged. ## Summary Hotswap patch for the GFX1250 A0 WMMA/SWMMAC co-execution hazard. Detects WMMA/SWMMAC instructions that lack sufficient v_nop separation before the first overlapping co-executable VALU on A0 silicon, and inserts the required v_nop padding via trampoline. On A0, certain WMMA forms need more v_nop slots than B0 before an overlapping VALU can safely co-execute. The patch scans the decoded instruction stream, classifies each WMMA by its A0/B0 nop requirement, counts existing safe slots (v_nop, non-overlapping VALU, non-terminating SALU), and emits a trampoline with the deficit v_nops before the hazardous VALU. ## Changes - **comgr-hotswap-patch-wmma-hazard.cpp**: Strong-symbol override for applyWmmaHazardPatch. Classifies WMMA/SWMMAC via MCInstrDesc::TSFlags, detects VGPR overlap via MCRegisterInfo::regsOverlap, emits v_nop padding via buildTrampoline. Includes TSFlags sync comment, classifyWmmaNops ordering documentation, printInst round-trip documentation, duplicate VALU guard (DenseSet), buildTrampoline result check, dead variable removal. - **comgr-hotswap-internal.h**: Add WmmaNopReq struct and classifyWmmaNops declaration. - **comgr-hotswap-elf.cpp**: Remove duplicate refuse check from cherry-pick merge of #2331. - **hotswap-rewrite.c**: Remove output-size check from --output path (trampoline patches grow the ELF). - **CMakeLists.txt**: Add WMMA patch source to library build. - **hotswap-wmma-hazard.s**: Lit test: v_wmma_i32_16x16x64_iu8 followed by overlapping v_add_f32, verifies v_nop padding and idempotency. - **HotswapMCTest.cpp**: 10 new classifyWmmaNops unit tests covering all WMMA variants and ordering. - **test-unit/CMakeLists.txt**: Link WMMA patch source into HotswapMCTests. ### Test plan - lit, ctest, and unit test passes - ASan lit tests: no memory errors --------- Signed-off-by: xintin <gaurav.verma@amd.com> Co-authored-by: Sanket Pandit <sanket.pandit@amd.com>

Resolves two textual conflicts that surfaced after upstream landed new hotswap patches (#2265 WMMA hazard, #2287 barrier rewrite, etc.): * amd/comgr/CMakeLists.txt Both #2265 (`comgr-hotswap-patch-wmma-hazard.cpp`) and this branch (`comgr-hotswap-patch-wmma-split.cpp`) add a source file at the same position in the SOURCES list. Resolved by including both in alphabetical order (hazard, then split). * amd/comgr/test-lit/comgr-sources/hotswap-rewrite.c This branch added an opt-in `--allow-grow` flag to bypass the output-size check on the `--output` path (PR #1 commit eacb7dc). In #2265, upstream went the opposite direction and removed the size check on the `--output` path entirely. Resolved by taking upstream's version verbatim, since their direction is the consensus and it makes the `--allow-grow` flag redundant. This effectively reverts the eacb7dc commit's diff for this file. The driver-side `--allow-grow` flag parsing is gone after the merge; my LIT tests are updated below to drop the flag from their RUN lines so they invoke the driver as upstream now expects. Also stripped `--allow-grow` from the two WMMA-split LIT tests (`hotswap-wmma-split.s`, `hotswap-wmma-split-imm-src2.s`) so they match the new driver signature. Verified: all 15 hotswap LIT tests pass after the merge -- the 6 new upstream ones (hotswap-barrier-{,no-}isfirst.s, hotswap-wmma-hazard.s, etc.) plus the existing 7 plus my 2.

Addresses all six inline review comments from chinmaydd on PR #2379: * line 138 (Avoid auto throughout): replaced the one `auto It = Cache.find(...)` with the explicit DenseMap<unsigned, const MCRegisterClass *>::iterator type. No other `auto` remains in the file. * line 182 (use AMDGPU named operand metadata for src2, src2_modifiers, neg_lo, neg_hi -- duplicating the layout is fragile): replaced the "walk the first four register operands" pattern with a per-SplitKind VOP3PWmmaLayout table that names each MCInst slot (vdst, src0, src1, src2_modifiers, src2, plus any trailing modifier slots present in the profile). The K=128 fp8/bf8 family and the f4 form have empirically-different operand counts (7 vs 5 ops at MCInst level), so a single shared positional table would have been wrong; splitting the table by SplitKind keeps each layout self-documenting and the structural validation in extractWmmaOps() (operand count + per-slot reg/imm kind check) catches any TableGen drift loudly rather than silently miscompiling. The AMDGPU backend's getNamedOperandIdx() and OpName enum live in llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h, which is a backend-private header (not installed in the LLVM dist), so we follow the same mirror-and-document pattern that comgr-hotswap-patch-wmma-hazard.cpp (#2265) uses for SIInstrFlags. * line 318 (Does this handle fp operands [for src2 inline imm]): no -- and now refuses explicitly. validateSplitOperands() rejects any non-zero src2 inline immediate with a log message. Only integer 0 (the compiler-folded zero accumulator -- the common shape for `C = wmma(A, B, 0)`-style code) is accepted. Non-zero / FP inline constants would need MCInstPrinter-mediated formatting to emit canonically; MCInstPrinter has no public single-operand entry point, so emitting via itostr would mis-encode FP inline constants like 1.0 (which encode at HW slot 242, not as a 32-bit literal). * line 342 (how are modifiers like neg_lo handled for src2): not preserved -- and now refuses explicitly. The K=128 fp8/bf8 family and the f4 form do NOT accept any modifier asm syntax (op_sel, neg_lo, neg_hi, clamp, matrix_*_reuse) per the AMDGPU asm parser (verified empirically by attempting to assemble each modifier; all rejected at clang -c time). The TableGen profile reserves slots for some modifiers (e.g. HasMatrixReuse=1 on F32_FP8BF8X128_WMMA_w32) but the asm grammar does not expose them at this opcode level. extractWmmaOps() defensively scans every trailing operand from src2_modifiers onward; if any is non-zero, the splitter refuses the rewrite. This means a future opcode evolution that exposes modifier syntax fails loudly here rather than silently dropping the modifier. * line 342 (what about cases where src2 is an inline constant [with modifiers]): both the non-zero src2 imm path and the modifier path are now rejected, so the inline-constant-with-modifier case is doubly refused. * line 1 of the imm-src2 test (test coverage is weak): expanded hotswap-wmma-split.s to add exact-register-slicing checks for the fp8_fp8 K-split (verifies first-half A_lo=v[0:7], B_lo=v[8:15], src2=original v[16:23], and second-half A_hi=v[8:15], B_hi=v[16:23], src2=dst v[16:23] -- catches off-by-one slicing) and for the f4 M-split (verifies dst slices v[4:11]/v[12:19], A slices v[0:7]/ v[8:15], B broadcast v[2:9], and the literal MATRIX_FMT_FP4 modifier on both halves). Added a new hotswap-wmma-split-refuse.s LIT test that exercises the non-zero src2 imm refusal path (the original K=128 opcode must remain unchanged in the output, and idempotency still holds across two rewrites). Modifier-bearing tests are not written: the assembler refuses all modifier syntax for these 9 opcodes, so a modifier-bearing input cannot be produced by clang to feed the splitter -- the defensive refusal in extractWmmaOps() is unreachable from any LIT-constructible input, which is itself the test that the contract holds. Verified end-to-end on amd-staging amd-llvm dist with all 16 hotswap LIT tests pass (the 13 pre-existing tests plus the two WMMA-split tests with the new exact-slicing assertions plus the new refusal test).

chinmaydd flagged on PR #2379 that the shared dispatcher API in comgr-hotswap-b0a0.cpp returns uint32_t where 0 means EITHER "this patch did not match the instruction" (correct dispatcher fall-through) OR "matched but failed to apply" (silent miscompile risk -- the rewriter returns SUCCESS with the original A0-incompatible opcode left in .text). They identified this as a cross-cutting concern across all hotswap patches (in-place #2222, WMMA hazard #2265, WMMA split, future patches). A proper fix is a separate follow-up upstream PR that refactors the dispatcher to distinguish the two cases (e.g. an enum NoMatch / Patched / Failed return type, or a `bool *Aborted` field threaded through PatchContext that the dispatcher checks and uses to short-circuit the rewrite with AMD_COMGR_STATUS_ERROR). For now: classify each return-0 path in the splitter as either "did not match" (correct) or "matched but failed" (currently silent), add a header comment to applyWmmaSplitPatches documenting the semantics + the cross-PR refactor concern, and tag every matched-but-failed exit with an inline comment so it's clear which return is the silent-miscompile risk. The behavior is unchanged; this is documentation of an existing limitation pending the cross-PR API refactor.

xintin added comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature ci:skip Skip all CI builds/tests for this PR labels Apr 20, 2026

xintin mentioned this pull request Apr 22, 2026

[AMDGPU] comgr: split non-stride64 DS 2-address ops for A0 #2281

Open

yxsamliu reviewed Apr 24, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-b0a0.cpp Outdated

yxsamliu reviewed Apr 25, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-llvm.cpp Outdated

xintin force-pushed the wmma-hazard-port-v2 branch from e51aab0 to 50bedd0 Compare April 27, 2026 19:01

xintin mentioned this pull request Apr 27, 2026

[AMDGPU] comgr: Fix ELF trampoline growth for clang/lld-produced HSACOs #2331

Merged

xintin force-pushed the wmma-hazard-port-v2 branch from 50bedd0 to 5174971 Compare April 28, 2026 00:05

xintin marked this pull request as ready for review April 28, 2026 04:12

xintin requested review from chinmaydd and lamb-j as code owners April 28, 2026 04:12

xintin requested a review from yxsamliu April 28, 2026 15:20

xintin force-pushed the wmma-hazard-port-v2 branch from ea85fd9 to be0c579 Compare April 28, 2026 15:36

xintin requested review from jmmartinez and yoonseoch April 28, 2026 15:40