[AMDGPU] comgr: add GFX1250 A0 WMMA hazard patch#2265
Conversation
|
@yxsamliu, this is a draft atm. |
|
Got it, thanks. I’ll treat this as early feedback for now and can take another pass once it is ready for detailed review. |
e51aab0 to
50bedd0
Compare
50bedd0 to
5174971
Compare
|
Hi @yxsamliu, thanks for your early feedback. |
|
Lets rebase this |
ea85fd9 to
be0c579
Compare
| } | ||
|
|
||
| static bool isVNop(const MCInst &Inst, const MCInstrInfo &MCII) { | ||
| return MCII.getName(Inst.getOpcode()) == "V_NOP_e32"; |
There was a problem hiding this comment.
But looking at the caller site of isVNop, the information in the LHS might be already in InternalDecodeInst type's Mnemonic field. Also, is it safe to use string "V_NOP_e23" directly in RHS? no _gfx12 suffix?
I don't have knowlege about cached SNopOpcode.. don't know if my question is related to it.
There was a problem hiding this comment.
SNopOpcode caches s_nop (scalar NOP), but the WMMA co-execution hazard requires v_nop (vector NOP)
they are different instructions on different execution units. The hazard is specifically about VALU co-execution timing with WMMA, so only v_nop provides the required separation.
That said, we could cache a VNopInst in LLVMState at init time (same pattern as SNopOpcode/SNopBytes) to avoid resolving it per patch call.
@chinmaydd , do you want to add that?
There was a problem hiding this comment.
Ah, gotcha. Yeah I think thats a good idea.
There was a problem hiding this comment.
@yxsamliu , you have valid point. we can switch isVNop to use DI.Mnemonic == "v_nop" at the InternalDecodedInst call sites.
There was a problem hiding this comment.
Hi @yoonseoch.
the information in the LHS might be already in InternalDecodeInst type's Mnemonic field
Right. I think my point was to avoid the hardcoding of the V_NOP mnemonic string here and doing it like we did it for S_NOP
Also, is it safe to use string "V_NOP_e23" directly in RHS? no _gfx12 suffix?
I believe thats okay, we have only a single V_NOP opcode
There was a problem hiding this comment.
Made it Unresolved for now to make the question visible. It is about outdated code and not relevant to the current code. Just want to make sure if we are on the same page about opcodes. The relevant outdated code was
static bool isVNop(const MCInst &Inst, const MCInstrInfo &MCII) {
return MCII.getName(Inst.getOpcode()) == "V_NOP_e32";
Also, is it safe to use string "V_NOP_e23" directly in RHS? no _gfx12 suffix?
I believe thats okay, we have only a single V_NOP opcode
I asked cursor and it claimed V_NOP_e32 is only pseudo and assembled/dissembled MCInst's opcode doesn't have that string, "V_NOP_e32". For gfx1250, MCII.getName(Inst.getOpcode()) will be V_NOP_e32_gfx12.
I tested with llvm-mc.
yoonchoi@banff-ccs-aus-g01-14:/work1/yoonchoi/llvm-project$ echo "v_nop" | ./build/bin/llvm-mc -arch=amdgcn -mcpu=gfx1250 -show-inst -show-encoding
v_nop ; encoding: [0x00,0x00,0x00,0x7e]
; <MCInst #50151 V_NOP_e32_gfx12>
yoonchoi@banff-ccs-aus-g01-14:/work1/yoonchoi/llvm-project$ echo "v_nop" | ./build/bin/llvm-mc -arch=amdgcn -show-inst -show-encoding
v_nop ; encoding: [0x00,0x00,0x00,0x7e]
; <MCInst #50153 V_NOP_e32_gfx6_gfx7>
So, my understanding of the outdated code is that it would have been returning false.
Anyway, the key seems to be avoiding comparison against hardcoded string. It was addressed in the comments for parseASMToMCInsts :
https://github.com/ROCm/llvm-project/pull/2265/changes#diff-becfadabf2911e7ffce0f46469c09abc36bea67dcf6d0fdd02d282cea2a148d4L138-L144
Is my understanding correct? @chinmaydd @xintin @yxsamliu
There was a problem hiding this comment.
Good catch @yoonseoch. Yes, that is correct. I think I misread the original code as .contains().
The == check against the MCII.getName() would not have worked due to the _gfx12 suffix string. And I think the current test doesnt really exercise that path.
The mnemonic check is okay for now but we are still baking some assumptions in regarding the type of v_nop. There is also v_nop_e64 no?
There was a problem hiding this comment.
grep "V_NOP" path/llvm-project/build/lib/Target/AMDGPU/AMDGPUGenInstrInfo.inc | head -10
V_NOP_dpp = 10392, // VOP1Instructions.td:127
V_NOP_e32 = 10393, // VOP1Instructions.td:116
V_NOP_e64 = 10394, // VOP1Instructions.td:120
V_NOP_sdwa = 10395, // VOP1Instructions.td:124
V_NOP_dpp8_gfx10 = 50146, // VOP1Instructions.td:1314
V_NOP_dpp_gfx10 = 50147, // VOP1Instructions.td:1310
V_NOP_dpp_vi = 50148, // VOP1Instructions.td:1558
V_NOP_e32_gfx10 = 50149, // VOP1Instructions.td:1293
V_NOP_e32_gfx11 = 50150, // VOP1Instructions.td:1011
V_NOP_e32_gfx12 = 50151, // VOP1Instructions.td:1011
V_NOP_e64 does exist as an instruction variant.
But, both V_NOP_e32 and V_NOP_e64 have the same assembly mnemonic "v_nop"
The _e32/_e64 are encoding variants (VOP1 32-bit vs 64-bit encoding), not different instructions. So DI.Mnemonic == "v_nop" correctly matches all v_nop encoding variants.
c286c3c to
b5c924e
Compare
b5c924e to
33680fb
Compare
33680fb to
2344a54
Compare
Move the WMMA co-exec hazard pass into its own patch translation\nunit so the GFX1250 dispatcher only keeps the weak declaration and\nguarded call site. Rework instruction classification to use MC\ninstruction flags instead of mnemonic prefix checks, and derive VGPR\noverlap from explicit MC defs and uses.\n\nThis keeps the hotswap patch-family structure consistent while\nreusing existing AMDGPU opcode metadata at the MC layer.\n\nTests: ../llvm-build/tools/amd-comgr/unittests/HotswapTests\nTests: ctest --test-dir ../llvm-build/tools/amd-comgr --output-on-failure -j8\nTests: cmake --build ../llvm-build --target test-lit -j8\nAssisted-by: OpenAI Codex
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
2344a54 to
b89d9bb
Compare
~Stacked on #2203.~ Merged! ~picked from #2226 Closed. ~Depends on #2331 (cherry-picked for now).~ Merged. ## Summary Hotswap patch for the GFX1250 A0 WMMA/SWMMAC co-execution hazard. Detects WMMA/SWMMAC instructions that lack sufficient v_nop separation before the first overlapping co-executable VALU on A0 silicon, and inserts the required v_nop padding via trampoline. On A0, certain WMMA forms need more v_nop slots than B0 before an overlapping VALU can safely co-execute. The patch scans the decoded instruction stream, classifies each WMMA by its A0/B0 nop requirement, counts existing safe slots (v_nop, non-overlapping VALU, non-terminating SALU), and emits a trampoline with the deficit v_nops before the hazardous VALU. ## Changes - **comgr-hotswap-patch-wmma-hazard.cpp**: Strong-symbol override for applyWmmaHazardPatch. Classifies WMMA/SWMMAC via MCInstrDesc::TSFlags, detects VGPR overlap via MCRegisterInfo::regsOverlap, emits v_nop padding via buildTrampoline. Includes TSFlags sync comment, classifyWmmaNops ordering documentation, printInst round-trip documentation, duplicate VALU guard (DenseSet), buildTrampoline result check, dead variable removal. - **comgr-hotswap-internal.h**: Add WmmaNopReq struct and classifyWmmaNops declaration. - **comgr-hotswap-elf.cpp**: Remove duplicate refuse check from cherry-pick merge of #2331. - **hotswap-rewrite.c**: Remove output-size check from --output path (trampoline patches grow the ELF). - **CMakeLists.txt**: Add WMMA patch source to library build. - **hotswap-wmma-hazard.s**: Lit test: v_wmma_i32_16x16x64_iu8 followed by overlapping v_add_f32, verifies v_nop padding and idempotency. - **HotswapMCTest.cpp**: 10 new classifyWmmaNops unit tests covering all WMMA variants and ordering. - **test-unit/CMakeLists.txt**: Link WMMA patch source into HotswapMCTests. ### Test plan - lit, ctest, and unit test passes - ASan lit tests: no memory errors --------- Signed-off-by: xintin <gaurav.verma@amd.com> Co-authored-by: Sanket Pandit <sanket.pandit@amd.com>
Resolves two textual conflicts that surfaced after upstream landed new hotswap patches (#2265 WMMA hazard, #2287 barrier rewrite, etc.): * amd/comgr/CMakeLists.txt Both #2265 (`comgr-hotswap-patch-wmma-hazard.cpp`) and this branch (`comgr-hotswap-patch-wmma-split.cpp`) add a source file at the same position in the SOURCES list. Resolved by including both in alphabetical order (hazard, then split). * amd/comgr/test-lit/comgr-sources/hotswap-rewrite.c This branch added an opt-in `--allow-grow` flag to bypass the output-size check on the `--output` path (PR #1 commit eacb7dc). In #2265, upstream went the opposite direction and removed the size check on the `--output` path entirely. Resolved by taking upstream's version verbatim, since their direction is the consensus and it makes the `--allow-grow` flag redundant. This effectively reverts the eacb7dc commit's diff for this file. The driver-side `--allow-grow` flag parsing is gone after the merge; my LIT tests are updated below to drop the flag from their RUN lines so they invoke the driver as upstream now expects. Also stripped `--allow-grow` from the two WMMA-split LIT tests (`hotswap-wmma-split.s`, `hotswap-wmma-split-imm-src2.s`) so they match the new driver signature. Verified: all 15 hotswap LIT tests pass after the merge -- the 6 new upstream ones (hotswap-barrier-{,no-}isfirst.s, hotswap-wmma-hazard.s, etc.) plus the existing 7 plus my 2.
Addresses all six inline review comments from chinmaydd on PR #2379: * line 138 (Avoid auto throughout): replaced the one `auto It = Cache.find(...)` with the explicit DenseMap<unsigned, const MCRegisterClass *>::iterator type. No other `auto` remains in the file. * line 182 (use AMDGPU named operand metadata for src2, src2_modifiers, neg_lo, neg_hi -- duplicating the layout is fragile): replaced the "walk the first four register operands" pattern with a per-SplitKind VOP3PWmmaLayout table that names each MCInst slot (vdst, src0, src1, src2_modifiers, src2, plus any trailing modifier slots present in the profile). The K=128 fp8/bf8 family and the f4 form have empirically-different operand counts (7 vs 5 ops at MCInst level), so a single shared positional table would have been wrong; splitting the table by SplitKind keeps each layout self-documenting and the structural validation in extractWmmaOps() (operand count + per-slot reg/imm kind check) catches any TableGen drift loudly rather than silently miscompiling. The AMDGPU backend's getNamedOperandIdx() and OpName enum live in llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h, which is a backend-private header (not installed in the LLVM dist), so we follow the same mirror-and-document pattern that comgr-hotswap-patch-wmma-hazard.cpp (#2265) uses for SIInstrFlags. * line 318 (Does this handle fp operands [for src2 inline imm]): no -- and now refuses explicitly. validateSplitOperands() rejects any non-zero src2 inline immediate with a log message. Only integer 0 (the compiler-folded zero accumulator -- the common shape for `C = wmma(A, B, 0)`-style code) is accepted. Non-zero / FP inline constants would need MCInstPrinter-mediated formatting to emit canonically; MCInstPrinter has no public single-operand entry point, so emitting via itostr would mis-encode FP inline constants like 1.0 (which encode at HW slot 242, not as a 32-bit literal). * line 342 (how are modifiers like neg_lo handled for src2): not preserved -- and now refuses explicitly. The K=128 fp8/bf8 family and the f4 form do NOT accept any modifier asm syntax (op_sel, neg_lo, neg_hi, clamp, matrix_*_reuse) per the AMDGPU asm parser (verified empirically by attempting to assemble each modifier; all rejected at clang -c time). The TableGen profile reserves slots for some modifiers (e.g. HasMatrixReuse=1 on F32_FP8BF8X128_WMMA_w32) but the asm grammar does not expose them at this opcode level. extractWmmaOps() defensively scans every trailing operand from src2_modifiers onward; if any is non-zero, the splitter refuses the rewrite. This means a future opcode evolution that exposes modifier syntax fails loudly here rather than silently dropping the modifier. * line 342 (what about cases where src2 is an inline constant [with modifiers]): both the non-zero src2 imm path and the modifier path are now rejected, so the inline-constant-with-modifier case is doubly refused. * line 1 of the imm-src2 test (test coverage is weak): expanded hotswap-wmma-split.s to add exact-register-slicing checks for the fp8_fp8 K-split (verifies first-half A_lo=v[0:7], B_lo=v[8:15], src2=original v[16:23], and second-half A_hi=v[8:15], B_hi=v[16:23], src2=dst v[16:23] -- catches off-by-one slicing) and for the f4 M-split (verifies dst slices v[4:11]/v[12:19], A slices v[0:7]/ v[8:15], B broadcast v[2:9], and the literal MATRIX_FMT_FP4 modifier on both halves). Added a new hotswap-wmma-split-refuse.s LIT test that exercises the non-zero src2 imm refusal path (the original K=128 opcode must remain unchanged in the output, and idempotency still holds across two rewrites). Modifier-bearing tests are not written: the assembler refuses all modifier syntax for these 9 opcodes, so a modifier-bearing input cannot be produced by clang to feed the splitter -- the defensive refusal in extractWmmaOps() is unreachable from any LIT-constructible input, which is itself the test that the contract holds. Verified end-to-end on amd-staging amd-llvm dist with all 16 hotswap LIT tests pass (the 13 pre-existing tests plus the two WMMA-split tests with the new exact-slicing assertions plus the new refusal test).
chinmaydd flagged on PR #2379 that the shared dispatcher API in comgr-hotswap-b0a0.cpp returns uint32_t where 0 means EITHER "this patch did not match the instruction" (correct dispatcher fall-through) OR "matched but failed to apply" (silent miscompile risk -- the rewriter returns SUCCESS with the original A0-incompatible opcode left in .text). They identified this as a cross-cutting concern across all hotswap patches (in-place #2222, WMMA hazard #2265, WMMA split, future patches). A proper fix is a separate follow-up upstream PR that refactors the dispatcher to distinguish the two cases (e.g. an enum NoMatch / Patched / Failed return type, or a `bool *Aborted` field threaded through PatchContext that the dispatcher checks and uses to short-circuit the rewrite with AMD_COMGR_STATUS_ERROR). For now: classify each return-0 path in the splitter as either "did not match" (correct) or "matched but failed" (currently silent), add a header comment to applyWmmaSplitPatches documenting the semantics + the cross-PR refactor concern, and tag every matched-but-failed exit with an inline comment so it's clear which return is the silent-miscompile risk. The behavior is unchanged; this is documentation of an existing limitation pending the cross-PR API refactor.
Stacked on #2203.Merged!picked from #2226Closed.Depends on #2331 (cherry-picked for now).Merged.Summary
Hotswap patch for the GFX1250 A0 WMMA/SWMMAC co-execution hazard. Detects WMMA/SWMMAC instructions that lack sufficient v_nop separation before the first overlapping co-executable VALU on A0 silicon, and inserts the required v_nop padding via trampoline.
On A0, certain WMMA forms need more v_nop slots than B0 before an overlapping VALU can safely co-execute.
The patch scans the decoded instruction stream, classifies each WMMA by its A0/B0 nop requirement, counts existing safe slots (v_nop, non-overlapping VALU, non-terminating SALU), and emits a trampoline with the deficit v_nops before the hazardous VALU.
Changes
Test plan