[AMDGPU] comgr: add WMMA instruction splitting for GFX1250 B0-to-A0 compatibility by jammm · Pull Request #2379 · ROCm/llvm-project

jammm · 2026-04-30T18:26:33Z

(recreated and added on top of #2242. Reposting the PR description here)

Summary

Adds the K=128 WMMA splitter as the next B0-to-A0 silicon-stepping
workaround in the COMGR HotSwap subsystem, alongside the in-place
patches landed in #2222. The splitter rewrites K=128 WMMA variants
that exist on GFX1250 B0 but not A0 into pairs of narrower WMMAs
that exist on both steppings, emitted as trampolines appended to
.text by the existing hotswap rewriter.

This PR plugs into the applyWmmaSplitPatches weak-symbol stub
already declared in comgr-hotswap-b0a0.cpp (from the policy/API
3-of-3 series #2203). Same plug-in pattern as
comgr-hotswap-patch-inplace.cpp (#2222): a single new
comgr-hotswap-patch-wmma-split.cpp provides the strong override,
and the integration is one CMakeLists.txt source-list line. No
changes to the existing hotswap infrastructure files.

Splitter scope

Two split families are handled (per
comgr-hotswap-patch-wmma-split.cpp's SplitTable):

K-dimension split — v_wmma_*_16x16x128_{fp8,bf8}_{fp8,bf8}
(8 type variants: 4 sign combinations × {f16,f32} dest) → two
v_wmma_*_16x16x64_*_* halves. Source matrices are split along
K; the destination accumulator threads through both halves.
M-dimension split + format coercion — v_wmma_f32_32x16x128_f4
→ two v_wmma_f32_16x16x128_f8f6f4 halves with explicit FP4
matrix-format modifiers (the f8f6f4 opcode + FP4 modifiers
express the same matrix-data layout as the standalone f4 form).

Inline-immediate $src2 (compiler-folded zero accumulator —
acc = {0,…,0} is a common shape) is supported via a 3-VGPR +
immediate match path; without this, the splitter would refuse the
rewrite and the kernel would keep the original A0-incompatible
opcode.

VGPR base/count extraction:

Base: low 10 bits of MRI.getEncodingValue() (the high bits are
HW flags like IS_VGPR = 1 << 10 per SIDefines.h's
AMDGPU::HWEncoding). The mask 0x3ff is the stable AMDGPU HW
encoding, named locally because SIDefines.h is target-internal.
Count: bit width of the smallest enclosing MCRegisterClass
divided by 32, the same pattern AMDGPUDisassembler:: CheckVGPROverflow uses. MCRegisterInfo::regunits() is NOT the
right primitive: on GFX12+ each VGPR_32 has two regunits
(lo16 + hi16 sub-register slots), so a VReg_256 (8 VGPRs, 256
bits) yields 16 regunits, not 8. The smallest-class lookup is
memoized per MCRegister in a thread_local DenseMap,
invalidated when the MCRegisterInfo * changes between calls
(each retargetCodeObjectB0A0 constructs a fresh
MCRegisterInfo via initLLVM, so a static cross-call cache
would dangle).

Test driver tweak

Trampoline-based hotswap rewrites grow .text, so the existing
OutSize == ElfSize check in hotswap-rewrite.c's --output path
(kept on yxsamliu's review on #2222 to catch accidental in-place
growth) needed an opt-out. Added an --allow-grow flag that skips
the size check on the --output path; default behavior is unchanged
so the in-place LIT tests still get the strict check. The flag is
rejected when --output is not also given (otherwise it would be a
silent no-op).

Tests

Two new LIT tests under amd/comgr/test-lit/:

hotswap-wmma-split.s — exercises all 9 splittable variants (8
K=128 fp8/bf8 = 4 sign combinations × {f16,f32} dest, plus the
v_wmma_f32_32x16x128_f4 M-split case) and confirms a
non-splittable WMMA round-trips unchanged. Uses --allow-grow
with hotswap-rewrite, asserts RESULT: SUCCESS from the API,
asserts the original mnemonics are gone (DISASM-NOT) and the
replacement mnemonics appear (DISASM-DAG), verifies idempotency
via cmp between two consecutive rewrites, and asserts via
llvm-readelf -S that the .text section header sh_size
actually grew on the wire (catches the silent-data-corruption
case where a buggy rewriter appends bytes but forgets to update
the section header).
hotswap-wmma-split-imm-src2.s — same shape, exercises the
inline-immediate $src2 path on 4 representative variants.

Both reuse the existing hotswap-rewrite driver and follow the
test pattern from hotswap-inplace-mixed.s: compile with
%clang -target amdgcn-amd-amdhsa -mcpu=gfx1250 -nostdlib, drive
through hotswap-rewrite ... --output ... --allow-grow, FileCheck
the API result and the %llvm-objdump -d output without passing
--mcpu (the ELF carries the gfx1250 ISA in its e_flags).

Per amd/comgr/AGENT_CONVENTIONS.md:

LIT inputs are compiled with %clang directly, not through Comgr
actions.
The 9-entry split table dispatches by mnemonic string, consistent
with patch-inplace.cpp's Mnemonic == "s_clause" pattern. The
match runs once per WMMA instruction, not in the
buildNopSledMap-style hot loop where cached MC opcodes (per
chinmaydd's review on [AMDGPU] comgr: add HotSwap B0-to-A0 policy and public API (3/3) #2203) matter.
Reuses upstream MC layer (getEncodingValue, smallest-class
search, MCRegister, buildTrampoline); no hardcoded opcodes;
replacement assembly is generated as text and round-tripped
through the asm parser by buildTrampoline.
log() goes through the shared hotswap log() (env-gated
stream), not a parallel logging path.
LLVM_ATTRIBUTE_WEAK for the strong-override pattern; guarded by
#if !defined(_MSC_VER) for the same MSVC weak-symbol reason
documented in patch-inplace.cpp ([Comgr] Fix Windows build: use LLVM_ATTRIBUTE_WEAK for hotswap stubs #2294 / Comgr hotswap: replace hardcoded AMDGPU VGPR/SGPR granule constants with MC-derived values #2285).

Test plan

amd-comgr builds clean on amd-staging base.
All 9 hotswap LIT tests pass (the 7 pre-existing
hotswap-rewrite.c / hotswap-elf-growth.s /
hotswap-inplace-{noop,b64,async,mixed}.s /
hotswap-rewrite-e2e.hip continue to pass; both new
hotswap-wmma-split{,-imm-src2}.s pass end-to-end).
Verify under AddressSanitizer (-DADDRESS_SANITIZER=On build,
ASAN_OPTIONS=halt_on_error=1:detect_leaks=1:detect_stack_use_after_return=1:strict_init_order=1)
— all hotswap LIT tests pass with no ASAN errors or leak
reports (per AGENT_CONVENTIONS.md).

Commits

[Comgr] hotswap-rewrite test driver: add --allow-grow flag for trampoline-growth tests — preempts yxsamliu's review concern by
keeping the strict size check by default and exposing an opt-in
flag for trampoline-growth tests.
[AMDGPU] comgr: HotSwap K=128 WMMA split patches for B0-to-A0
— the splitter, plugging into the applyWmmaSplitPatches weak
stub from [AMDGPU] comgr: add HotSwap B0-to-A0 policy and public API (3/3) #2203.
[AMDGPU] comgr: WMMA splitter LIT tests: assert .text actually grew on the wire — llvm-readelf -S size assertion in both
LIT tests.
[AMDGPU] comgr: WMMA splitter: cache findSmallestEnclosingClass per MCRegister — thread_local DenseMap memoization with
MRI-pointer invalidation; replaces the O(num_classes) per-operand
walk with O(1) hash lookups for repeated VGPR ranges.

…line-growth tests Trampoline-based hotswap rewrites (e.g. the WMMA splitter landing in the next commit) append to .text, so OutSize > ElfSize. The existing output-size check on the --output path correctly catches accidental .text growth in in-place tests (per yxsamliu's review on #2222) but blocks any test that legitimately exercises the trampoline path. Add an opt-in --allow-grow flag that skips the OutSize == ElfSize check on the --output path. The default behavior is unchanged: tests that omit --allow-grow still get the strict check. The flag is rejected when --output is not also given (no other path inspects OutSize, so --allow-grow alone would be a no-op invariant worth flagging at the CLI rather than letting it pass silently). Verified locally: all six existing hotswap LIT tests (hotswap-rewrite.c, hotswap-elf-growth.s, hotswap-inplace-{noop,b64,async,mixed}.s, hotswap-rewrite-e2e.hip) continue to pass with this change (none of them pass --allow-grow, so they exercise the strict-check path as before).

Decomposes WMMA variants present on GFX1250 B0 but not on A0 into pairs of narrower WMMAs that exist on both steppings, emitted as trampolines appended to .text by the existing hotswap rewriter: - v_wmma_*_16x16x128_{fp8,bf8}_{fp8,bf8} -> two 16x16x64 halves (K dimension split, accumulator threads through) - v_wmma_f32_32x16x128_f4 -> two 16x16x128_f8f6f4 halves (M dimension split, both halves use MATRIX_FMT_FP4 modifiers) Plug-in implementation: the new comgr-hotswap-patch-wmma-split.cpp provides a strong override of the applyWmmaSplitPatches weak stub declared in comgr-hotswap-b0a0.cpp (matches the cluster_load / s_clause pattern from comgr-hotswap-patch-inplace.cpp #2222), so the integration is a single CMakeLists.txt source-list line. No changes to the existing hotswap infrastructure files. Guarded by `#if !defined(_MSC_VER)` for the same MSVC weak-symbol reason as patch-inplace.cpp. VOP3P inline-immediate $src2 is supported: when clang folds a zero-initialized accumulator into an inline imm 0 (the common shape for `C = wmma(A, B, 0)`-style code), extractWmmaRegOperands accepts the 3-VGPR-plus-immediate form and propagates the immediate into the first split's $src2 slot, with the dst register threading the carry into the second split. VGPR base/count extraction: - Base: low 10 bits of MRI.getEncodingValue() (= the VGPR/SGPR index; the high bits encode hardware flags like IS_VGPR = 1 << 10 per SIDefines.h's AMDGPU::HWEncoding). The mask 0x3ff is the stable AMDGPU HW encoding -- target-internal, not exposed to comgr, so the value is named locally. - Count: bit width of the smallest enclosing MCRegisterClass, divided by 32 -- the same pattern AMDGPUDisassembler:: CheckVGPROverflow uses. Note: MCRegisterInfo::regunits() is NOT the right primitive here; on GFX12+ each VGPR_32 has two regunits (lo16 + hi16 sub-register slots), so a VReg_256 (8 VGPRs, 256 bits) yields 16 regunits, not 8. Per AGENT_CONVENTIONS.md: - Reuses upstream MC layer: getEncodingValue, getMinimalPhysReg- Class-equivalent (manual smallest-class search), MCRegister, buildTrampoline. No hardcoded opcodes; no parallel implementations. - Replacement assembly is generated as text and round-tripped through the asm parser by buildTrampoline -- the splitter never encodes WMMA bytes directly. - log() goes through the shared hotswap log() (env-gated stream), not a parallel logging path. - The 9-entry split table uses mnemonic-string match (consistent with patch-inplace.cpp's `Mnemonic == "s_clause"` pattern). Performance is irrelevant here -- match runs once per WMMA instruction, not in the buildNopSledMap-style hot loop where cached MC opcodes (per chinmaydd's review on #2203) matter. Tests: hotswap-wmma-split.s exercises all 9 splittable variants (8 K=128 fp8/bf8 = 4 sign combinations x {f16,f32} dest, plus the 32x16x128_f4 M-split case) and confirms a non-splittable WMMA round-trips unchanged. hotswap-wmma-split-imm-src2.s exercises the inline-immediate $src2 path on 4 representative variants. Both LIT tests reuse the existing hotswap-rewrite driver (test-lit/comgr- sources/hotswap-rewrite.c, with the --allow-grow flag from the prior commit), follow the upstream test pattern from hotswap-inplace-mixed.s (compile with `%clang -nostdlib`, drive through `--output ... --allow-grow`, FileCheck the API result and the disassembly, verify idempotency with cmp), and assert against %llvm-objdump output without passing --mcpu (the ELF carries the gfx1250 ISA in its e_flags). Verified end-to-end: all six pre-existing hotswap LIT tests (hotswap-rewrite.c, hotswap-elf-growth.s, hotswap-inplace-{noop,b64,async,mixed}.s, hotswap-rewrite-e2e.hip) continue to pass alongside both new WMMA-split LIT tests.

…n the wire The DISASM checks already prove the new mnemonics appear in the disassembled output, but the disassembler walks raw .text bytes regardless of the section header's sh_size field. A buggy rewriter that appended trampoline bytes but forgot to update the section header would still pass the disassembly check, while every downstream tool that respects section headers (the HSA loader's relocation pass, ELF strippers, debuggers) would silently miss the appended trampolines. Use llvm-readelf -S to capture the .text Size column for both the input and output ELFs, then assert in shell that the output is strictly larger. The assertion uses bash arithmetic ($((16#...))) on the hex Size field; the existing LIT tests already use bash for shell pipelines, so no new portability surface. Verified locally on amd-staging: both WMMA-split LIT tests pass with the new assertion (input .text 0x910 -> output 0x9c8 for the 10-WMMA hotswap-wmma-split.s case; comparable growth for the 4-WMMA hotswap-wmma-split-imm-src2.s case).

…CRegister Each VGPR operand lookup walked all ~100 AMDGPU register classes; for kernels with many WMMAs reusing the same VGPR ranges (the common shape for tile-multiply loops), this is wasted work. Memoize the smallest-enclosing class per MCRegister in a thread_local DenseMap. The cache is invalidated when the MCRegisterInfo pointer changes between calls -- necessary because each retargetCodeObjectB0A0 call constructs a fresh MCRegisterInfo via initLLVM, so the MCRegisterClass pointers are NOT process-stable. A static cache across calls would dangle when the next rewrite gets a new MRI. thread_local is single-threaded by construction, so no mutex is needed; the small construction/destruction overhead is paid once per thread on first use, not per lookup. Functional behavior unchanged: all 9 hotswap LIT tests still pass.

Resolves two textual conflicts that surfaced after upstream landed new hotswap patches (#2265 WMMA hazard, #2287 barrier rewrite, etc.): * amd/comgr/CMakeLists.txt Both #2265 (`comgr-hotswap-patch-wmma-hazard.cpp`) and this branch (`comgr-hotswap-patch-wmma-split.cpp`) add a source file at the same position in the SOURCES list. Resolved by including both in alphabetical order (hazard, then split). * amd/comgr/test-lit/comgr-sources/hotswap-rewrite.c This branch added an opt-in `--allow-grow` flag to bypass the output-size check on the `--output` path (PR #1 commit eacb7dc). In #2265, upstream went the opposite direction and removed the size check on the `--output` path entirely. Resolved by taking upstream's version verbatim, since their direction is the consensus and it makes the `--allow-grow` flag redundant. This effectively reverts the eacb7dc commit's diff for this file. The driver-side `--allow-grow` flag parsing is gone after the merge; my LIT tests are updated below to drop the flag from their RUN lines so they invoke the driver as upstream now expects. Also stripped `--allow-grow` from the two WMMA-split LIT tests (`hotswap-wmma-split.s`, `hotswap-wmma-split-imm-src2.s`) so they match the new driver signature. Verified: all 15 hotswap LIT tests pass after the merge -- the 6 new upstream ones (hotswap-barrier-{,no-}isfirst.s, hotswap-wmma-hazard.s, etc.) plus the existing 7 plus my 2.

z1-cciauto · 2026-04-30T18:33:54Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5619

chinmaydd · 2026-05-03T22:23:58Z

+  // When src2 is an inline immediate (compiler-folded zero accumulator), both
+  // halves use the same immediate. When it's a VGPR range, each half takes
+  // its own slice (lower / upper half of M).
+  std::string CLo = R.Src2IsImm ? itostr(R.Src2Imm)


Does this handle fp operands

From claude: Empirically: no. validateSplitOperands now refuses any non-zero src2 inline immediate. Only integer 0 (compiler-folded zero accumulator) is accepted. Non-zero / FP inline constants would need MCInstPrinter-mediated formatting (no public single-operand entry point exists) so we refuse rather than mis-encode.

^Do you need it to handle fp operands?

I would expect so ?

I'm not involved closely into the choice of what kernels we are supporting so you would have a better idea if we are to reject those with fp instruction operands. Either decision should be explicitly documented

Sounds good. I'll make it handle fp operands anyway, and I'll need to test this using the runtime via. some tests with FFM. It was due to one such test at runtime where I caught situations the existing splitting patches didn't catch i.e., when you do C = A * B + C (so one less operand) or (i guess) when C is initialized to 0.

Gotcha, thanks

Addresses all six inline review comments from chinmaydd on PR #2379: * line 138 (Avoid auto throughout): replaced the one `auto It = Cache.find(...)` with the explicit DenseMap<unsigned, const MCRegisterClass *>::iterator type. No other `auto` remains in the file. * line 182 (use AMDGPU named operand metadata for src2, src2_modifiers, neg_lo, neg_hi -- duplicating the layout is fragile): replaced the "walk the first four register operands" pattern with a per-SplitKind VOP3PWmmaLayout table that names each MCInst slot (vdst, src0, src1, src2_modifiers, src2, plus any trailing modifier slots present in the profile). The K=128 fp8/bf8 family and the f4 form have empirically-different operand counts (7 vs 5 ops at MCInst level), so a single shared positional table would have been wrong; splitting the table by SplitKind keeps each layout self-documenting and the structural validation in extractWmmaOps() (operand count + per-slot reg/imm kind check) catches any TableGen drift loudly rather than silently miscompiling. The AMDGPU backend's getNamedOperandIdx() and OpName enum live in llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h, which is a backend-private header (not installed in the LLVM dist), so we follow the same mirror-and-document pattern that comgr-hotswap-patch-wmma-hazard.cpp (#2265) uses for SIInstrFlags. * line 318 (Does this handle fp operands [for src2 inline imm]): no -- and now refuses explicitly. validateSplitOperands() rejects any non-zero src2 inline immediate with a log message. Only integer 0 (the compiler-folded zero accumulator -- the common shape for `C = wmma(A, B, 0)`-style code) is accepted. Non-zero / FP inline constants would need MCInstPrinter-mediated formatting to emit canonically; MCInstPrinter has no public single-operand entry point, so emitting via itostr would mis-encode FP inline constants like 1.0 (which encode at HW slot 242, not as a 32-bit literal). * line 342 (how are modifiers like neg_lo handled for src2): not preserved -- and now refuses explicitly. The K=128 fp8/bf8 family and the f4 form do NOT accept any modifier asm syntax (op_sel, neg_lo, neg_hi, clamp, matrix_*_reuse) per the AMDGPU asm parser (verified empirically by attempting to assemble each modifier; all rejected at clang -c time). The TableGen profile reserves slots for some modifiers (e.g. HasMatrixReuse=1 on F32_FP8BF8X128_WMMA_w32) but the asm grammar does not expose them at this opcode level. extractWmmaOps() defensively scans every trailing operand from src2_modifiers onward; if any is non-zero, the splitter refuses the rewrite. This means a future opcode evolution that exposes modifier syntax fails loudly here rather than silently dropping the modifier. * line 342 (what about cases where src2 is an inline constant [with modifiers]): both the non-zero src2 imm path and the modifier path are now rejected, so the inline-constant-with-modifier case is doubly refused. * line 1 of the imm-src2 test (test coverage is weak): expanded hotswap-wmma-split.s to add exact-register-slicing checks for the fp8_fp8 K-split (verifies first-half A_lo=v[0:7], B_lo=v[8:15], src2=original v[16:23], and second-half A_hi=v[8:15], B_hi=v[16:23], src2=dst v[16:23] -- catches off-by-one slicing) and for the f4 M-split (verifies dst slices v[4:11]/v[12:19], A slices v[0:7]/ v[8:15], B broadcast v[2:9], and the literal MATRIX_FMT_FP4 modifier on both halves). Added a new hotswap-wmma-split-refuse.s LIT test that exercises the non-zero src2 imm refusal path (the original K=128 opcode must remain unchanged in the output, and idempotency still holds across two rewrites). Modifier-bearing tests are not written: the assembler refuses all modifier syntax for these 9 opcodes, so a modifier-bearing input cannot be produced by clang to feed the splitter -- the defensive refusal in extractWmmaOps() is unreachable from any LIT-constructible input, which is itself the test that the contract holds. Verified end-to-end on amd-staging amd-llvm dist with all 16 hotswap LIT tests pass (the 13 pre-existing tests plus the two WMMA-split tests with the new exact-slicing assertions plus the new refusal test).

…iates via printer round-trip Addresses chinmaydd's follow-up review on PR #2379 (line 480 "Why are we rejecting them rather than supporting them" / "This seems plain wrong" pointing to llvm/test/MC/AMDGPU/gfx1250_asm_wmma_w32.s, and line 456 on FP operand handling). The previous revision refused any input with a non-zero src2 inline immediate or any non-zero modifier slot, on the (incorrect) empirical claim that the AMDGPU asm parser does not accept modifiers for these 9 opcodes. Upstream's MC test gfx1250_asm_wmma_w32.s shows that all of: - src2 inline FP constant (e.g. `1.0`) - `neg_lo:[0,0,1]` and `neg_hi:[0,0,1]` (src2 NEG bits) - `matrix_a_reuse` and `matrix_b_reuse` (HW data-reuse hints) are accepted on every splittable opcode and produce real (distinct) encodings. The earlier rejection would mean the rewriter leaves the original A0-incompatible opcode in place for any kernel using these forms, which the runtime would then refuse to load. New design: print the source instruction in canonical asm form via LLVM's MCInstPrinter::printInst, then perform textual surgery on the result to produce each split half. The splitter never has to reproduce the printer's per-operand formatting decisions (FP inline constant 1.0 vs integer 1 -- which encode at distinct inline-const slots 242 vs 1 -- nor modifier suffix ordering). Any input the printer accepts is preserved verbatim modulo the per-half transformations: - K-split first half: original operand list with src0/src1 sliced to the lower halves; src2 and modifier suffix preserved verbatim. - K-split second half: src0/src1 sliced to the upper halves; src2 replaced with the dst register (the accumulator carry); modifier suffix has the src2-bit cleared in neg_lo:[X,Y,Z] / neg_hi:[X,Y,Z] (the operand at the src2 slot is no longer the original src2). matrix_a_reuse / matrix_b_reuse are stripped on both halves because they assert HW reuse-buffer validity that no longer holds across an instruction-stream rewrite. - M-split halves: dst, src0, src2 (when VGPR) sliced to lower / upper halves; src1 broadcast; modifier suffix preserved on both halves with matrix_a_reuse / matrix_b_reuse stripped; the splitter-added MATRIX_FMT_FP4 modifiers stay (the f8f6f4 destination opcode requires them and the f4 source does not carry them). The modifier-suffix tokenizer and packed-modifier parser (parsePackedModifier, transformModifierSuffix) understand the `<name>:[X,Y,Z]` shape used by neg_lo / neg_hi / op_sel and the bareword shape used by matrix_a_reuse / matrix_b_reuse / clamp. Other unrecognized modifier tokens (op_sel, clamp, future additions) are passed through unchanged on first half and on M-split halves; the K-split second half passes them through too (only neg_lo / neg_hi are clearable). The defensive operand-count and operand-kind validation in extractWmmaOps remains, so any TableGen-driven layout drift fails loudly with "VOP3P layout drift -- update the table" rather than silently miscompiling. New LIT test hotswap-wmma-split-modifiers.s exercises the full contract: K-splits with FP 1.0, neg_lo:[0,0,1], neg_hi:[0,0,1], matrix_a_reuse, matrix_b_reuse; M-splits with FP 1.0 and neg_lo:[0,0,1]. Each input form is from the upstream MC test's known-supported syntax for these opcodes; each DAG check verifies the exact post-split operand and modifier shape per the documented transformation rules. The obsolete hotswap-wmma-split-refuse.s LIT test (which asserted refusal of non-zero src2 imm) is removed -- non-zero src2 imm is now correctly handled. Verified: all 16 hotswap LIT tests pass.

z1-cciauto · 2026-05-04T16:55:13Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5679

chinmaydd · 2026-05-04T18:30:52Z

+// Each CHECK-DAG below asserts both the replacement mnemonic AND that the
+// assembler emitted at least one form with the inline immediate 0 as $src2
+// -- the trailing literal `, 0` is what we fed in. For K-splits the second
+// half uses dst as $src2 (the accumulator carry), so each pattern matches
+// exactly the first half of its split. For the M-split (FP4) both halves
+// use the same imm, but one match per mnemonic is enough to prove the
+// splitter propagated the imm into the rewritten code.


I would prefer explicit tests that check for this behavior. You can separate out individual tests and check that the various aspects of the splitting work as intended. For instance using the src2 as the dst.

The block of DISASM-DAG check lines is not something I am a fan of

chinmaydd · 2026-05-04T18:32:37Z

+  s_endpgm
+.size test_m_neg_lo, .-test_m_neg_lo
+
+// -- Trampoline region: assertions on the rewritten output ------------------


Same for these tests

chinmaydd · 2026-05-04T18:34:58Z

+  if (T.Bytes.empty()) {
+    log() << "hotswap: error: WMMA split: trampoline assembly failed for "
+          << DI.Mnemonic << "\n";
+    return 0;


I think this is a general problem across all patch PRs that we return "0" on failed patching. Error logging is okay, but we need to refactor these APIs to return success/failure more cleanly

For instance, there is no clear way to know if -

a) we found the instruction that needs to be patched but patch was not applied
b) the problematic instruction was never found

Addresses chinmaydd's review on PR #2379: > "I would prefer explicit tests that check for this behavior. You > can separate out individual tests and check that the various > aspects of the splitting work as intended. For instance using the > src2 as the dst." > "The block of DISASM-DAG check lines is not something I am a fan of" Replaces the two consolidated tests (hotswap-wmma-split-{imm-src2,modifiers}.s -- which packed multiple aspects into one file with a trailing DISASM-DAG block) with nine focused per-aspect tests, each exercising ONE source kernel and asserting EXACT post-split operand+modifier shape via ordered DISASM: / DISASM-NEXT: checks (no DAG): - hotswap-wmma-split-int-imm.s -- src2 = inline int 0 (compiler-folded zero accumulator) - hotswap-wmma-split-fp-imm.s -- src2 = inline FP 1.0 in K-split (printer-round-trip preserves the FP slot vs. integer slot distinction in VOP3P inline-const encoding) - hotswap-wmma-split-neg-lo-k.s -- neg_lo:[0,0,1] preserved on first half, dropped on second half (src2 := dst, no negation) - hotswap-wmma-split-neg-hi-k.s -- same shape for neg_hi - hotswap-wmma-split-matrix-a-reuse.s -- matrix_a_reuse stripped on both halves (HW reuse-buffer assertion no longer holds after the data is sliced) - hotswap-wmma-split-matrix-b-reuse.s -- matrix_b_reuse stripped, same - hotswap-wmma-split-msplit-fp-imm.s -- M-split with FP 1.0 src2: BOTH halves preserve the imm (no carry on the M axis) plus the splitter-added MATRIX_FMT_FP4 modifiers required by the f8f6f4 destination opcode - hotswap-wmma-split-msplit-neg-lo.s -- M-split with neg_lo: BOTH halves preserve the modifier - hotswap-wmma-split-src2-eq-dst.s -- the C += A*B accumulator-reuse pattern (vdst == src2 at the source level), which exercises the second-half src2 := dst transformation as a no-op (chinmaydd's specific suggestion) Each new file: ONE source kernel, ONE aspect, exact-operand DISASM checks rather than a DAG block. The trampoline body's two halves are emitted back-to-back, so DISASM-NEXT works for the second half. Verified locally: all 9 new per-aspect tests pass, and the existing hotswap-wmma-split.s smoke test (which legitimately remains a DAG- style multi-variant coverage check, since asserting all 9 splittable mnemonics appear regardless of emission order is what that test is about) continues to pass.

chinmaydd flagged on PR #2379 that the shared dispatcher API in comgr-hotswap-b0a0.cpp returns uint32_t where 0 means EITHER "this patch did not match the instruction" (correct dispatcher fall-through) OR "matched but failed to apply" (silent miscompile risk -- the rewriter returns SUCCESS with the original A0-incompatible opcode left in .text). They identified this as a cross-cutting concern across all hotswap patches (in-place #2222, WMMA hazard #2265, WMMA split, future patches). A proper fix is a separate follow-up upstream PR that refactors the dispatcher to distinguish the two cases (e.g. an enum NoMatch / Patched / Failed return type, or a `bool *Aborted` field threaded through PatchContext that the dispatcher checks and uses to short-circuit the rewrite with AMD_COMGR_STATUS_ERROR). For now: classify each return-0 path in the splitter as either "did not match" (correct) or "matched but failed" (currently silent), add a header comment to applyWmmaSplitPatches documenting the semantics + the cross-PR refactor concern, and tag every matched-but-failed exit with an inline comment so it's clear which return is the silent-miscompile risk. The behavior is unchanged; this is documentation of an existing limitation pending the cross-PR API refactor.

z1-cciauto · 2026-05-07T17:21:28Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5775

jammm · 2026-05-07T18:42:50Z

@chinmaydd updated PR based on your comments. PTAL ^^

chinmaydd · 2026-05-08T02:44:30Z

Disclaimer: I've used Claude Opus 4.7 for my review. Its inspired from some of my constraints, but I believe most of them should be a part of AGENT_CONVENTIONS.md.

They should be actionable by another agent IMO.

Mostly looks good. We discussed in today's meetings that adding tests should be considered important here.

The longer term plan involves a broad refactor and tests will help us get there faster and with more confidence.

Blocker

hotswap-wmma-split.s fails deterministically (exit 141) when run in isolation. The awk '/\.text /{print $7; exit}' calls (lines 31–33) close stdin before llvm-readelf finishes writing, and LIT's pipefail shell propagates the SIGPIPE. Reproduces 9/10 in standalone runs; it only passes in the bulk LIT run because output buffering shifts the race.

$ for i in $(seq 10); do build/bin/llvm-lit build/tools/comgr/test-lit/hotswap-wmma-split.s 2>&1 | grep -E "Failed|Passed" | tail -1; done
Failed: 1 (×9), Passed: 1 (×1)

Fix: drop the exit from awk (the file has exactly one .text section so the result is unchanged), or follow the FileCheck pattern used by hotswap-elf-growth.s.

Test-data quality

Almost every K-split test uses input that violates @earlyclobber $vdst on the source WMMA pseudo (VOP3PInstructions.td:1444), so it can't be compiler-generated:

Test	dst	src1	overlap
`hotswap-wmma-split.s` lines 46/85/98/111	`v[16:23]`	`v[8:23]`	`v[16:23]`
`hotswap-wmma-split.s` lines 59/124/137/150	`v[16:19]`	`v[8:23]`	`v[16:19]`
`hotswap-wmma-split.s` line 72 (M-split)	`v[4:19]`	`v[2:9]`	`v[4:9]`
`hotswap-wmma-split-{src2-eq-dst,fp-imm,int-imm}.s`	`v[16:23]`	`v[8:23]`	`v[16:23]`
`hotswap-wmma-split-neg-hi-k.s`	`v[20:23]`	`v[8:23]`	`v[20:23]`

The K-split second half is WMMA dst, A_hi, B_hi, dst where B_hi = upper half of original src1. When that upper half coincides with dst (the case the failing tests pick), the second half reads B_hi from registers the first half just clobbered with the partial product — and the CHECK-DAGs on lines 184–185 lock that in as the contract. The exact-slicing checks would still pass on a real off-by-one in the slicer because the overlap makes the operands incidentally textually identical. Please re-pick disjoint VGPRs (the matrix_a_reuse / neg_lo_k / msplit_* tests already use the right pattern).

Smaller items

transformModifierSuffix only knows about matrix_*_reuse / neg_lo / neg_hi. Any future modifier (op_sel, byte_sel, clamp) on these opcodes would silently pass through unchanged on both halves. Either assert the modifier set is closed or table-drive it per SplitKind.
parsePackedModifier allocates a std::string Prefix per call — the prefix is statically known, hoist as a StringLiteral.
Ctx.OutTrampolines.emplace_back(std::move(T)) (line 644) — push_back for an already-constructed object per LLVM style; the hazard patch uses push_back for the same operation.
buildTrampoline and fixupTrampolineBranches both encode the branch-back; the second overwrites identical bytes. Same redundancy in the hazard patch — likely a buildTrampoline cleanup that should reserve MinInstSize zero bytes the way emitToTrampoline does.
The PR description mentions a --allow-grow flag but hotswap-rewrite.c still has only --zero-size / --output. Stale or removed?

Coverage gaps worth adding:

A multi-WMMA-per-kernel test to exercise TrampTextOffset accumulation.
WMMA-split + WMMA-hazard interaction on the same kernel.

…anups Address PR #2379 review (chinmaydd): - transformModifierSuffix: enforce a closed set of modifier tokens (matrix_a_reuse, matrix_b_reuse, neg_lo:[X,Y,Z], neg_hi:[X,Y,Z]) and return std::nullopt on anything else, so a future K=128/M=32 source mnemonic that grew an unaudited modifier surfaces as a fail-soft matched-but-failed instead of silently carrying the modifier through both halves. Callers in buildSplit128to64Asm / buildSplit32x16Asm propagate by returning an empty replacement vector. - parsePackedModifier: drop the per-call (Name + ":[").str() heap allocation in favor of piecewise StringRef checks; this runs once per modifier per split half, and the prefix is always a literal ("neg_lo" / "neg_hi") so the materialization was pure overhead. - applyWmmaSplitPatches: replace the OutTrampolines.emplace_back(...) call with push_back(std::move(T)) since the argument is already a Trampoline rvalue and there's no in-place construction to optimize.

…rlyclobber, add multi-WMMA Address PR #2379 review (chinmaydd): - hotswap-wmma-split.s: drop `exit` from the `awk '/\.text /{print $7}'` one-liner that captures the .text section size from llvm-readelf -S. With `exit`, awk closes its stdin before llvm-readelf finishes writing; LIT's pipefail shell propagates the resulting SIGPIPE and the size-grew assertion fails non-deterministically in standalone runs (only passes in the bulk LIT run because output buffering shifts the race). The `\.text ` pattern matches at most one section header per ELF, so removing `exit` does not change the captured value. - hotswap-wmma-split{,-src2-eq-dst,-fp-imm,-int-imm,-neg-hi-k}.s: re-pick disjoint VGPR ranges (src0=v[0:15], src1=v[16:31], dst at v[32:39] f32 / v[32:35] f16) so dst no longer overlaps src1. The source pseudo carries `@earlyclobber $vdst` (VOP3PInstructions.td:1444), so compiler-generated WMMAs cannot land in the previous shape; the test inputs now mirror that contract. The matrix_*_reuse / neg_lo_k / msplit_* tests already used the right pattern and are unchanged. hotswap-wmma-split.s gets a header comment explaining the contract so the disjoint-VGPR choice is self-documenting. - hotswap-wmma-split-multi-wmma.s (new): two K=128 WMMAs in one kernel, asserts both trampolines land in source order with the expected K=64 mnemonics. Smallest input that exercises the >1-trampoline path (TrampTextOffset accumulation in applyWmmaSplitPatches) -- a regression in the offset accumulation would land the second trampoline's s_branch-back at the wrong target.

…on coverage Address PR #2379 review (chinmaydd) coverage gap: the WMMA-split and WMMA-hazard passes both append to Ctx.OutTrampolines and both compute their trampoline's `.text` offset by walking the previously appended trampolines (patch-wmma-split.cpp:664-666 and patch-wmma-hazard.cpp:184-186). If either pass forgets to walk OutTrampolines when computing TrampolineTextOffset, the s_branch-back at the tail of one trampoline lands at the wrong target. The two passes never overlap on a single instruction (the splitter only fires on K=128 fp8/bf8/f4 *float* WMMAs; the hazard pass only fires on WMMA *integer* opcodes), but both can fire in the same kernel. Putting both in one kernel and asserting -- in trampoline emission order -- that the split halves land first and the hazard trampoline's 8 v_nops + relocated VALU land *after* exercises the cross-pass offset accumulation. Per-instruction passes run before applyWmmaHazardPatch (comgr-hotswap-b0a0.cpp:323-361), so the split trampoline is appended first and the hazard trampoline second. Asserting in order (DISASM, not DISASM-DAG) catches a swap or a missing trampoline that DAG would mask.

…anups Address PR #2379 review (chinmaydd): - transformModifierSuffix: enforce a closed set of modifier tokens (matrix_a_reuse, matrix_b_reuse, neg_lo:[X,Y,Z], neg_hi:[X,Y,Z]) and return std::nullopt on anything else, so a future K=128/M=32 source mnemonic that grew an unaudited modifier surfaces as a fail-soft matched-but-failed instead of silently carrying the modifier through both halves. Callers in buildSplit128to64Asm / buildSplit32x16Asm propagate by returning an empty replacement vector. - parsePackedModifier: drop the per-call (Name + ":[").str() heap allocation in favor of piecewise StringRef checks; this runs once per modifier per split half, and the prefix is always a literal ("neg_lo" / "neg_hi") so the materialization was pure overhead. - applyWmmaSplitPatches: replace the OutTrampolines.emplace_back(...) call with push_back(std::move(T)) since the argument is already a Trampoline rvalue and there's no in-place construction to optimize. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

…rlyclobber, add multi-WMMA Address PR #2379 review (chinmaydd): - hotswap-wmma-split.s: drop `exit` from the `awk '/\.text /{print $7}'` one-liner that captures the .text section size from llvm-readelf -S. With `exit`, awk closes its stdin before llvm-readelf finishes writing; LIT's pipefail shell propagates the resulting SIGPIPE and the size-grew assertion fails non-deterministically in standalone runs (only passes in the bulk LIT run because output buffering shifts the race). The `\.text ` pattern matches at most one section header per ELF, so removing `exit` does not change the captured value. - hotswap-wmma-split{,-src2-eq-dst,-fp-imm,-int-imm,-neg-hi-k}.s: re-pick disjoint VGPR ranges (src0=v[0:15], src1=v[16:31], dst at v[32:39] f32 / v[32:35] f16) so dst no longer overlaps src1. The source pseudo carries `@earlyclobber $vdst` (VOP3PInstructions.td:1444), so compiler-generated WMMAs cannot land in the previous shape; the test inputs now mirror that contract. The matrix_*_reuse / neg_lo_k / msplit_* tests already used the right pattern and are unchanged. hotswap-wmma-split.s gets a header comment explaining the contract so the disjoint-VGPR choice is self-documenting. - hotswap-wmma-split-multi-wmma.s (new): two K=128 WMMAs in one kernel, asserts both trampolines land in source order with the expected K=64 mnemonics. Smallest input that exercises the >1-trampoline path (TrampTextOffset accumulation in applyWmmaSplitPatches) -- a regression in the offset accumulation would land the second trampoline's s_branch-back at the wrong target. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

…on coverage Address PR #2379 review (chinmaydd) coverage gap: the WMMA-split and WMMA-hazard passes both append to Ctx.OutTrampolines and both compute their trampoline's `.text` offset by walking the previously appended trampolines (patch-wmma-split.cpp:664-666 and patch-wmma-hazard.cpp:184-186). If either pass forgets to walk OutTrampolines when computing TrampolineTextOffset, the s_branch-back at the tail of one trampoline lands at the wrong target. The two passes never overlap on a single instruction (the splitter only fires on K=128 fp8/bf8/f4 *float* WMMAs; the hazard pass only fires on WMMA *integer* opcodes), but both can fire in the same kernel. Putting both in one kernel and asserting -- in trampoline emission order -- that the split halves land first and the hazard trampoline's 8 v_nops + relocated VALU land *after* exercises the cross-pass offset accumulation. Per-instruction passes run before applyWmmaHazardPatch (comgr-hotswap-b0a0.cpp:323-361), so the split trampoline is appended first and the hazard trampoline second. Asserting in order (DISASM, not DISASM-DAG) catches a swap or a missing trampoline that DAG would mask. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

z1-cciauto · 2026-05-08T12:25:00Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5806

chinmaydd

LGTM with nits, thanks !

…mment) Two nits from chinmaydd's 2026-05-08T15:32 review: - Spell out std::optional<std::string> at every use-site in buildSplit128to64Asm / buildSplit32x16Asm instead of `auto`. The `auto` had crept back in when I propagated transformModifierSuffix's std::optional return through the callers; the explicit type makes the deref on the next line unambiguous to the reader. - Refresh the matched-but-failed comment on the AsmLines.empty() check in applyWmmaSplitPatches: build*Asm now legitimately returns empty when transformModifierSuffix rejects an unsupported modifier (the closed-set check from the prior cleanup commit), so calling that branch "defensive" is misleading.

z1-cciauto · 2026-05-08T16:47:08Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5812

jmmartinez · 2026-05-11T09:41:47Z

+constexpr SplitRow SplitTable[] = {
+    {"v_wmma_f16_16x16x128_fp8_fp8", SplitKind::Split128to64FP8BF8,
+     "v_wmma_f16_16x16x64_fp8_fp8"},
+    {"v_wmma_f16_16x16x128_fp8_bf8", SplitKind::Split128to64FP8BF8,
+     "v_wmma_f16_16x16x64_fp8_bf8"},
+    {"v_wmma_f16_16x16x128_bf8_fp8", SplitKind::Split128to64FP8BF8,
+     "v_wmma_f16_16x16x64_bf8_fp8"},
+    {"v_wmma_f16_16x16x128_bf8_bf8", SplitKind::Split128to64FP8BF8,
+     "v_wmma_f16_16x16x64_bf8_bf8"},
+    {"v_wmma_f32_16x16x128_fp8_fp8", SplitKind::Split128to64FP8BF8,
+     "v_wmma_f32_16x16x64_fp8_fp8"},
+    {"v_wmma_f32_16x16x128_fp8_bf8", SplitKind::Split128to64FP8BF8,
+     "v_wmma_f32_16x16x64_fp8_bf8"},
+    {"v_wmma_f32_16x16x128_bf8_fp8", SplitKind::Split128to64FP8BF8,
+     "v_wmma_f32_16x16x64_bf8_fp8"},
+    {"v_wmma_f32_16x16x128_bf8_bf8", SplitKind::Split128to64FP8BF8,
+     "v_wmma_f32_16x16x64_bf8_bf8"},
+    {"v_wmma_f32_32x16x128_f4", SplitKind::Split32x16to16x16F4,
+     "v_wmma_f32_16x16x128_f8f6f4"},
+};


This is used as a map to do a lookup. I think we should use a StringMap that maps Mnemonic to {Kind, Replacement}.

jmmartinez · 2026-05-11T09:44:05Z

+struct SplitMatch {
+  SplitKind Kind;
+  StringRef Replacement;
+  bool Matched = false;
+};


Is there any case where Kind and Replacement are set and Match is false ?

I think this should be an optional<{Kind, Replacement}>.

jmmartinez · 2026-05-11T09:48:46Z

+  std::pair<int, int> Src1{-1, 0};
+  std::pair<int, int> Src2{-1, 0}; // valid only when Src2IsImm == false
+  bool Src2IsImm = false;
+  bool Valid = false;


Do we use the any of the other fields on an invalid WmmaOps? Cause otherwise, I'd remove the Valid field and use an optional instead (then, a WmmaOps would always be valid, or be a nullopt optional).

…td::optional, switch lookup table to StringMap Address jmmartinez's three structural-cleanup nits on PR #2379: - Convert SplitTable from a constexpr array of {Mnemonic, Kind, Replacement} rows + linear search to a StringMap<SplitRule> built once per process via a function-local static. The map is the natural representation for "look up by mnemonic" and reads more clearly at the call site. - Drop the redundant `Matched` flag on the lookup result. lookupSplitRule now returns std::optional<SplitRule>; absence means no match, presence means a valid {Kind, Replacement} pair. The caller's `if (!Match.Matched)` becomes `if (!Match)` and field accesses become `Match->Kind` / `Match->Replacement`. - Drop the redundant `Valid` flag on WmmaOps. extractWmmaOps now returns std::optional<WmmaOps>; every previous "return R;" on a failure path becomes "return std::nullopt;" and the success path returns the (now always-valid) struct directly. Caller's `if (!Ops.Valid)` becomes `if (!Ops)` and downstream uses pass `*Ops` to validateSplitOperands / buildSplit*Asm (those helpers take `const WmmaOps &` and never read the dropped Valid field). No behavior change: the lookup table contents are identical, the failure paths log the same messages, and the same instructions get patched in the same way. All 25 hotswap LIT tests still pass.

jammm · 2026-05-12T07:58:38Z

@jmmartinez just addressed your comments in the latest push. PTAL :)
@chinmaydd the PR should be good to merge.

…ctor Address jmmartinez's follow-up nit on PR #2379: StringMap has an initializer-list constructor (StringMap.h:151), so the lambda + try_emplace pattern from the previous commit was unnecessarily verbose. Switch to direct brace-initialization of the function-local static. No behavior change. All 25 hotswap LIT tests still pass.

jmmartinez

LGTM

…ound On GFX1250 A0 silicon, an async trap fired between LD_SCALE and the WMMA half of a VOP3PX2 pair leaves the wave with corrupted scale state. The trap handler can rewind the PC for known-paired forms (via the trap-handler PC-rewind path that recognizes 0xCC33 and 0xCC88 opcode high-bytes, landed in rocm-systems), but a standalone v_wmma_f32_16x16x128_f8f6f4 (no preceding LD_SCALE) cannot be safely rewound -- the trap handler unconditionally walks back 8 bytes, which on an unpaired f8f6f4 lands in arbitrary preceding-instruction bytes. This patch prepends an inline-0 LD_SCALE prefix to every standalone f8f6f4 WMMA, turning it into a fused VOP3PX2 with scale=1.0 (a no-op for the WMMA result). The 8-byte prefix also has scale_src2 = VGPR0 (0x100) baked in -- the VOP3PX2 scale_src2 field is architecturally unused, but if left at 0 the SQ mis-decodes it as an SGPR reference and stalls the SALU for 3 cycles; setting it to a VGPR encoding eliminates the false dependency. Same workaround the in-place vop3px2-src2 patch applies to user-emitted VOP3PX2 instructions; baking it into the wrap pass's prefix bytes keeps wrap-emitted trampolines stall-free at creation. Two-pass operation: 1. Decoded[] scan -- wraps user-written standalone WMMAs. 2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in trampoline bytes (the K=128 32x16x128_f4 splitter emits f8f6f4 into trampolines per #2379). Defensive checks: - Refuses to retarget if any 0xCC88 (v_wmma_f32_32x16x128_f4) leftover exists in Decoded[]; the K=128 splitter must eliminate these before the wrap pass runs since V_WMMA_SCALE_F32_32X16X128_F4 doesn't exist on A0 (only on B0+ per MI400 SPG p.145). - Already-wrapped detection skips wrapping if the preceding decoded instruction is a SCALE form (0xCC35 or 0xCC3A); per ISA doc p.158 the SCALE prefix must be contiguous with the WMMA. Wired into b0a0.cpp dispatcher AFTER all per-instruction patches and after applyVop3px2Src2Fix, so the wrap pass sees the final post-split trampoline layout. The wrap pass produces VOP3PX2 instructions with scale_src2 already set to 0x100, so applyVop3px2Src2Fix has no work on the trampoline-emitted forms. LIT tests: - hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4, already-wrapped, K=128 split-then-wrap chain). - hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines updated to expect the post-wrap disassembly (v_wmma_scale_* instead of bare f8f6f4) since the wrap pass now wraps splitter trampoline outputs. - hotswap-vop3px2-src2-noop.s: comments updated to describe the new wrap interaction; the in-place vop3px2-src2 patch still has nothing to do here (the wrap pass's prefix bytes already carry scale_src2 = VGPR0).

Wraps every standalone v_wmma_f32_16x16x128_f8f6f4 into a fused VOP3PX2 by prepending an inline-0 LD_SCALE prefix (numerically a no-op, scale=1.0). The 8-byte prefix is byte-identical for every wrap site, so the WMMA portion stays bit-identical and modifier-rich operand layouts (matrix_a_fmt, matrix_b_fmt, neg_lo, neg_hi, matrix_a_reuse, matrix_b_reuse, ...) are preserved without re-encoding. The prefix bytes also carry scale_src2 = VGPR0 (0x100), matching the encoding the in-place vop3px2-src2 patch produces, so no second pass is needed on wrap-emitted forms. Two-pass operation: 1. Decoded[] scan -- wraps user-written standalone WMMAs. 2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in trampoline bytes (the K=128 32x16x128_f4 splitter from #2379 emits f8f6f4 into trampolines). Defensive checks: - Refuses to retarget if any v_wmma_f32_32x16x128_f4 (0xCC88) leftover exists in Decoded[] -- the K=128 splitter must eliminate these before the wrap pass runs. - Already-wrapped detection: skips wrapping if the preceding decoded instruction is a SCALE form (per ISA doc p.158, the SCALE prefix must be contiguous with the WMMA). Wired into b0a0.cpp dispatcher AFTER all per-instruction patches and after applyVop3px2Src2Fix, so the wrap pass sees the final post-split trampoline layout. Pairs with the existing trap-handler split path in rocm-systems (commit 74c647e6605, "rocr: GFX12.5 : VOP3PX instruction split in trap handler"). LIT tests: - hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4, already-wrapped, K=128 split-then-wrap chain). - hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines updated to expect post-wrap disassembly (v_wmma_scale_* instead of bare f8f6f4) since the wrap pass now wraps splitter outputs. - hotswap-vop3px2-src2-noop.s: comments updated to describe the new wrap interaction.

jammm added 4 commits April 30, 2026 04:46

jammm added hotswap Related to the Comgr Hotswap feature ci:skip Skip all CI builds/tests for this PR labels Apr 30, 2026

xintin added the comgr Related to Code Object Manager label Apr 30, 2026

jammm marked this pull request as ready for review May 1, 2026 07:46

jammm requested review from chinmaydd and lamb-j as code owners May 1, 2026 07:46

chinmaydd requested review from jmmartinez, yoonseoch and yxsamliu May 1, 2026 16:44

chinmaydd reviewed May 3, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp

chinmaydd reviewed May 3, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp Outdated

chinmaydd reviewed May 3, 2026

View reviewed changes

Comment thread amd/comgr/test-lit/hotswap-wmma-split-imm-src2.s Outdated

chinmaydd reviewed May 3, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp Outdated

jammm added 2 commits May 3, 2026 23:36

chinmaydd reviewed May 4, 2026

View reviewed changes

jammm added 2 commits May 7, 2026 09:47

jammm added 3 commits May 8, 2026 02:32

jammm force-pushed the users/jam/wmma-split-gfx1250-staging branch from 5cc6549 to 7165ca7 Compare May 8, 2026 12:23

jammm requested a review from chinmaydd May 8, 2026 12:24

chinmaydd reviewed May 8, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp Outdated

chinmaydd reviewed May 8, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp Outdated

chinmaydd approved these changes May 8, 2026

View reviewed changes

jmmartinez reviewed May 11, 2026

View reviewed changes

xintin mentioned this pull request May 11, 2026

[AMDGPU] comgr: Hotswap dispatcher vtable + .def registry, fix Windows support #2491

Merged

2 tasks

jammm requested a review from jmmartinez May 12, 2026 07:58

jmmartinez reviewed May 12, 2026

View reviewed changes

Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp Outdated

jmmartinez approved these changes May 12, 2026

View reviewed changes

chinmaydd enabled auto-merge (squash) May 12, 2026 14:27

chinmaydd merged commit 75c6512 into amd-staging May 12, 2026
24 checks passed

chinmaydd deleted the users/jam/wmma-split-gfx1250-staging branch May 12, 2026 15:01

jammm mentioned this pull request May 13, 2026

[AMDGPU] comgr: VOP3PX2 wrap and s_sethalt rewrite for GFX1250 B0-to-A0 hotswap #2519

Draft

4 tasks

Conversation

jammm commented Apr 30, 2026

Summary

Splitter scope

Test driver tweak

Tests

Test plan

Commits

Uh oh!

z1-cciauto commented Apr 30, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jammm May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

z1-cciauto commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

z1-cciauto commented May 7, 2026

Uh oh!

jammm commented May 7, 2026

Uh oh!

chinmaydd commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Blocker

Test-data quality

Smaller items

Uh oh!

z1-cciauto commented May 8, 2026

Uh oh!

Uh oh!

Uh oh!

chinmaydd left a comment

Choose a reason for hiding this comment

Uh oh!

z1-cciauto commented May 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jammm commented May 12, 2026

Uh oh!

Uh oh!

jmmartinez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

jammm May 4, 2026 •

edited

Loading

chinmaydd commented May 8, 2026 •

edited

Loading