Skip to content

[AMDGPU] comgr: add WMMA instruction splitting for GFX1250 B0-to-A0 compatibility#2379

Merged
chinmaydd merged 15 commits into
amd-stagingfrom
users/jam/wmma-split-gfx1250-staging
May 12, 2026
Merged

[AMDGPU] comgr: add WMMA instruction splitting for GFX1250 B0-to-A0 compatibility#2379
chinmaydd merged 15 commits into
amd-stagingfrom
users/jam/wmma-split-gfx1250-staging

Conversation

@jammm
Copy link
Copy Markdown

@jammm jammm commented Apr 30, 2026

(recreated and added on top of #2242. Reposting the PR description here)

Summary

Adds the K=128 WMMA splitter as the next B0-to-A0 silicon-stepping
workaround in the COMGR HotSwap subsystem, alongside the in-place
patches landed in #2222. The splitter rewrites K=128 WMMA variants
that exist on GFX1250 B0 but not A0 into pairs of narrower WMMAs
that exist on both steppings, emitted as trampolines appended to
.text by the existing hotswap rewriter.

This PR plugs into the applyWmmaSplitPatches weak-symbol stub
already declared in comgr-hotswap-b0a0.cpp (from the policy/API
3-of-3 series #2203). Same plug-in pattern as
comgr-hotswap-patch-inplace.cpp (#2222): a single new
comgr-hotswap-patch-wmma-split.cpp provides the strong override,
and the integration is one CMakeLists.txt source-list line. No
changes to the existing hotswap infrastructure files.

Splitter scope

Two split families are handled (per
comgr-hotswap-patch-wmma-split.cpp's SplitTable):

  1. K-dimension splitv_wmma_*_16x16x128_{fp8,bf8}_{fp8,bf8}
    (8 type variants: 4 sign combinations × {f16,f32} dest) → two
    v_wmma_*_16x16x64_*_* halves. Source matrices are split along
    K; the destination accumulator threads through both halves.
  2. M-dimension split + format coercionv_wmma_f32_32x16x128_f4
    → two v_wmma_f32_16x16x128_f8f6f4 halves with explicit FP4
    matrix-format modifiers (the f8f6f4 opcode + FP4 modifiers
    express the same matrix-data layout as the standalone f4 form).

Inline-immediate $src2 (compiler-folded zero accumulator —
acc = {0,…,0} is a common shape) is supported via a 3-VGPR +
immediate match path; without this, the splitter would refuse the
rewrite and the kernel would keep the original A0-incompatible
opcode.

VGPR base/count extraction:

  • Base: low 10 bits of MRI.getEncodingValue() (the high bits are
    HW flags like IS_VGPR = 1 << 10 per SIDefines.h's
    AMDGPU::HWEncoding). The mask 0x3ff is the stable AMDGPU HW
    encoding, named locally because SIDefines.h is target-internal.
  • Count: bit width of the smallest enclosing MCRegisterClass
    divided by 32, the same pattern AMDGPUDisassembler:: CheckVGPROverflow uses. MCRegisterInfo::regunits() is NOT the
    right primitive: on GFX12+ each VGPR_32 has two regunits
    (lo16 + hi16 sub-register slots), so a VReg_256 (8 VGPRs, 256
    bits) yields 16 regunits, not 8. The smallest-class lookup is
    memoized per MCRegister in a thread_local DenseMap,
    invalidated when the MCRegisterInfo * changes between calls
    (each retargetCodeObjectB0A0 constructs a fresh
    MCRegisterInfo via initLLVM, so a static cross-call cache
    would dangle).

Test driver tweak

Trampoline-based hotswap rewrites grow .text, so the existing
OutSize == ElfSize check in hotswap-rewrite.c's --output path
(kept on yxsamliu's review on #2222 to catch accidental in-place
growth) needed an opt-out. Added an --allow-grow flag that skips
the size check on the --output path; default behavior is unchanged
so the in-place LIT tests still get the strict check. The flag is
rejected when --output is not also given (otherwise it would be a
silent no-op).

Tests

Two new LIT tests under amd/comgr/test-lit/:

  • hotswap-wmma-split.s — exercises all 9 splittable variants (8
    K=128 fp8/bf8 = 4 sign combinations × {f16,f32} dest, plus the
    v_wmma_f32_32x16x128_f4 M-split case) and confirms a
    non-splittable WMMA round-trips unchanged. Uses --allow-grow
    with hotswap-rewrite, asserts RESULT: SUCCESS from the API,
    asserts the original mnemonics are gone (DISASM-NOT) and the
    replacement mnemonics appear (DISASM-DAG), verifies idempotency
    via cmp between two consecutive rewrites, and asserts via
    llvm-readelf -S that the .text section header sh_size
    actually grew on the wire (catches the silent-data-corruption
    case where a buggy rewriter appends bytes but forgets to update
    the section header).
  • hotswap-wmma-split-imm-src2.s — same shape, exercises the
    inline-immediate $src2 path on 4 representative variants.

Both reuse the existing hotswap-rewrite driver and follow the
test pattern from hotswap-inplace-mixed.s: compile with
%clang -target amdgcn-amd-amdhsa -mcpu=gfx1250 -nostdlib, drive
through hotswap-rewrite ... --output ... --allow-grow, FileCheck
the API result and the %llvm-objdump -d output without passing
--mcpu (the ELF carries the gfx1250 ISA in its e_flags).

Per amd/comgr/AGENT_CONVENTIONS.md:

Test plan

  • amd-comgr builds clean on amd-staging base.
  • All 9 hotswap LIT tests pass (the 7 pre-existing
    hotswap-rewrite.c / hotswap-elf-growth.s /
    hotswap-inplace-{noop,b64,async,mixed}.s /
    hotswap-rewrite-e2e.hip continue to pass; both new
    hotswap-wmma-split{,-imm-src2}.s pass end-to-end).
  • Verify under AddressSanitizer (-DADDRESS_SANITIZER=On build,
    ASAN_OPTIONS=halt_on_error=1:detect_leaks=1:detect_stack_use_after_return=1:strict_init_order=1)
    — all hotswap LIT tests pass with no ASAN errors or leak
    reports (per AGENT_CONVENTIONS.md).

Commits

  1. [Comgr] hotswap-rewrite test driver: add --allow-grow flag for trampoline-growth tests — preempts yxsamliu's review concern by
    keeping the strict size check by default and exposing an opt-in
    flag for trampoline-growth tests.
  2. [AMDGPU] comgr: HotSwap K=128 WMMA split patches for B0-to-A0
    — the splitter, plugging into the applyWmmaSplitPatches weak
    stub from [AMDGPU] comgr: add HotSwap B0-to-A0 policy and public API (3/3) #2203.
  3. [AMDGPU] comgr: WMMA splitter LIT tests: assert .text actually grew on the wirellvm-readelf -S size assertion in both
    LIT tests.
  4. [AMDGPU] comgr: WMMA splitter: cache findSmallestEnclosingClass per MCRegisterthread_local DenseMap memoization with
    MRI-pointer invalidation; replaces the O(num_classes) per-operand
    walk with O(1) hash lookups for repeated VGPR ranges.

jammm added 4 commits April 30, 2026 04:46
…line-growth tests

Trampoline-based hotswap rewrites (e.g. the WMMA splitter landing in
the next commit) append to .text, so OutSize > ElfSize. The existing
output-size check on the --output path correctly catches accidental
.text growth in in-place tests (per yxsamliu's review on #2222) but
blocks any test that legitimately exercises the trampoline path.

Add an opt-in --allow-grow flag that skips the OutSize == ElfSize
check on the --output path. The default behavior is unchanged: tests
that omit --allow-grow still get the strict check.

The flag is rejected when --output is not also given (no other path
inspects OutSize, so --allow-grow alone would be a no-op invariant
worth flagging at the CLI rather than letting it pass silently).

Verified locally: all six existing hotswap LIT tests
(hotswap-rewrite.c, hotswap-elf-growth.s,
hotswap-inplace-{noop,b64,async,mixed}.s, hotswap-rewrite-e2e.hip)
continue to pass with this change (none of them pass --allow-grow,
so they exercise the strict-check path as before).
Decomposes WMMA variants present on GFX1250 B0 but not on A0 into
pairs of narrower WMMAs that exist on both steppings, emitted as
trampolines appended to .text by the existing hotswap rewriter:

  - v_wmma_*_16x16x128_{fp8,bf8}_{fp8,bf8} -> two 16x16x64 halves
    (K dimension split, accumulator threads through)
  - v_wmma_f32_32x16x128_f4 -> two 16x16x128_f8f6f4 halves
    (M dimension split, both halves use MATRIX_FMT_FP4 modifiers)

Plug-in implementation: the new comgr-hotswap-patch-wmma-split.cpp
provides a strong override of the applyWmmaSplitPatches weak stub
declared in comgr-hotswap-b0a0.cpp (matches the cluster_load /
s_clause pattern from comgr-hotswap-patch-inplace.cpp #2222), so
the integration is a single CMakeLists.txt source-list line. No
changes to the existing hotswap infrastructure files. Guarded by
`#if !defined(_MSC_VER)` for the same MSVC weak-symbol reason as
patch-inplace.cpp.

VOP3P inline-immediate $src2 is supported: when clang folds a
zero-initialized accumulator into an inline imm 0 (the common shape
for `C = wmma(A, B, 0)`-style code), extractWmmaRegOperands accepts
the 3-VGPR-plus-immediate form and propagates the immediate into
the first split's $src2 slot, with the dst register threading the
carry into the second split.

VGPR base/count extraction:
- Base: low 10 bits of MRI.getEncodingValue() (= the VGPR/SGPR
  index; the high bits encode hardware flags like IS_VGPR =
  1 << 10 per SIDefines.h's AMDGPU::HWEncoding). The mask 0x3ff
  is the stable AMDGPU HW encoding -- target-internal, not
  exposed to comgr, so the value is named locally.
- Count: bit width of the smallest enclosing MCRegisterClass,
  divided by 32 -- the same pattern AMDGPUDisassembler::
  CheckVGPROverflow uses. Note: MCRegisterInfo::regunits() is
  NOT the right primitive here; on GFX12+ each VGPR_32 has two
  regunits (lo16 + hi16 sub-register slots), so a VReg_256 (8
  VGPRs, 256 bits) yields 16 regunits, not 8.

Per AGENT_CONVENTIONS.md:
- Reuses upstream MC layer: getEncodingValue, getMinimalPhysReg-
  Class-equivalent (manual smallest-class search), MCRegister,
  buildTrampoline. No hardcoded opcodes; no parallel implementations.
- Replacement assembly is generated as text and round-tripped through
  the asm parser by buildTrampoline -- the splitter never encodes
  WMMA bytes directly.
- log() goes through the shared hotswap log() (env-gated stream),
  not a parallel logging path.
- The 9-entry split table uses mnemonic-string match (consistent
  with patch-inplace.cpp's `Mnemonic == "s_clause"` pattern).
  Performance is irrelevant here -- match runs once per WMMA
  instruction, not in the buildNopSledMap-style hot loop where
  cached MC opcodes (per chinmaydd's review on #2203) matter.

Tests: hotswap-wmma-split.s exercises all 9 splittable variants (8
K=128 fp8/bf8 = 4 sign combinations x {f16,f32} dest, plus the
32x16x128_f4 M-split case) and confirms a non-splittable WMMA
round-trips unchanged. hotswap-wmma-split-imm-src2.s exercises the
inline-immediate $src2 path on 4 representative variants. Both LIT
tests reuse the existing hotswap-rewrite driver (test-lit/comgr-
sources/hotswap-rewrite.c, with the --allow-grow flag from the
prior commit), follow the upstream test pattern from
hotswap-inplace-mixed.s (compile with `%clang -nostdlib`, drive
through `--output ... --allow-grow`, FileCheck the API result and
the disassembly, verify idempotency with cmp), and assert against
%llvm-objdump output without passing --mcpu (the ELF carries the
gfx1250 ISA in its e_flags).

Verified end-to-end: all six pre-existing hotswap LIT tests
(hotswap-rewrite.c, hotswap-elf-growth.s,
hotswap-inplace-{noop,b64,async,mixed}.s, hotswap-rewrite-e2e.hip)
continue to pass alongside both new WMMA-split LIT tests.
…n the wire

The DISASM checks already prove the new mnemonics appear in the
disassembled output, but the disassembler walks raw .text bytes
regardless of the section header's sh_size field. A buggy rewriter
that appended trampoline bytes but forgot to update the section
header would still pass the disassembly check, while every
downstream tool that respects section headers (the HSA loader's
relocation pass, ELF strippers, debuggers) would silently miss the
appended trampolines.

Use llvm-readelf -S to capture the .text Size column for both the
input and output ELFs, then assert in shell that the output is
strictly larger. The assertion uses bash arithmetic ($((16#...)))
on the hex Size field; the existing LIT tests already use bash for
shell pipelines, so no new portability surface.

Verified locally on amd-staging: both WMMA-split LIT tests pass
with the new assertion (input .text 0x910 -> output 0x9c8 for the
10-WMMA hotswap-wmma-split.s case; comparable growth for the
4-WMMA hotswap-wmma-split-imm-src2.s case).
…CRegister

Each VGPR operand lookup walked all ~100 AMDGPU register classes;
for kernels with many WMMAs reusing the same VGPR ranges (the
common shape for tile-multiply loops), this is wasted work.

Memoize the smallest-enclosing class per MCRegister in a thread_local
DenseMap. The cache is invalidated when the MCRegisterInfo pointer
changes between calls -- necessary because each retargetCodeObjectB0A0
call constructs a fresh MCRegisterInfo via initLLVM, so the
MCRegisterClass pointers are NOT process-stable. A static cache
across calls would dangle when the next rewrite gets a new MRI.

thread_local is single-threaded by construction, so no mutex is
needed; the small construction/destruction overhead is paid once
per thread on first use, not per lookup.

Functional behavior unchanged: all 9 hotswap LIT tests still pass.
@jammm jammm added hotswap Related to the Comgr Hotswap feature ci:skip Skip all CI builds/tests for this PR labels Apr 30, 2026
@xintin xintin added the comgr Related to Code Object Manager label Apr 30, 2026
Resolves two textual conflicts that surfaced after upstream landed
new hotswap patches (#2265 WMMA hazard, #2287 barrier rewrite, etc.):

* amd/comgr/CMakeLists.txt
  Both #2265 (`comgr-hotswap-patch-wmma-hazard.cpp`) and this branch
  (`comgr-hotswap-patch-wmma-split.cpp`) add a source file at the
  same position in the SOURCES list. Resolved by including both in
  alphabetical order (hazard, then split).

* amd/comgr/test-lit/comgr-sources/hotswap-rewrite.c
  This branch added an opt-in `--allow-grow` flag to bypass the
  output-size check on the `--output` path (PR #1 commit eacb7dc).
  In #2265, upstream went the opposite direction and removed the
  size check on the `--output` path entirely. Resolved by taking
  upstream's version verbatim, since their direction is the
  consensus and it makes the `--allow-grow` flag redundant.

  This effectively reverts the eacb7dc commit's diff for this file.
  The driver-side `--allow-grow` flag parsing is gone after the
  merge; my LIT tests are updated below to drop the flag from their
  RUN lines so they invoke the driver as upstream now expects.

Also stripped `--allow-grow` from the two WMMA-split LIT tests
(`hotswap-wmma-split.s`, `hotswap-wmma-split-imm-src2.s`) so they
match the new driver signature.

Verified: all 15 hotswap LIT tests pass after the merge -- the 6
new upstream ones (hotswap-barrier-{,no-}isfirst.s,
hotswap-wmma-hazard.s, etc.) plus the existing 7 plus my 2.
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@jammm jammm marked this pull request as ready for review May 1, 2026 07:46
@jammm jammm requested review from chinmaydd and lamb-j as code owners May 1, 2026 07:46
Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp
// When src2 is an inline immediate (compiler-folded zero accumulator), both
// halves use the same immediate. When it's a VGPR range, each half takes
// its own slice (lower / upper half of M).
std::string CLo = R.Src2IsImm ? itostr(R.Src2Imm)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle fp operands

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From claude: Empirically: no. validateSplitOperands now refuses any non-zero src2 inline immediate. Only integer 0 (compiler-folded zero accumulator) is accepted. Non-zero / FP inline constants would need MCInstPrinter-mediated formatting (no public single-operand entry point exists) so we refuse rather than mis-encode.

^Do you need it to handle fp operands?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect so ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not involved closely into the choice of what kernels we are supporting so you would have a better idea if we are to reject those with fp instruction operands. Either decision should be explicitly documented

Copy link
Copy Markdown
Author

@jammm jammm May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll make it handle fp operands anyway, and I'll need to test this using the runtime via. some tests with FFM. It was due to one such test at runtime where I caught situations the existing splitting patches didn't catch i.e., when you do C = A * B + C (so one less operand) or (i guess) when C is initialized to 0.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks

Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp Outdated
Comment thread amd/comgr/test-lit/hotswap-wmma-split-imm-src2.s Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp Outdated
jammm added 2 commits May 3, 2026 23:36
Addresses all six inline review comments from chinmaydd on PR #2379:

* line 138 (Avoid auto throughout): replaced the one `auto It =
  Cache.find(...)` with the explicit
  DenseMap<unsigned, const MCRegisterClass *>::iterator type. No other
  `auto` remains in the file.

* line 182 (use AMDGPU named operand metadata for src2,
  src2_modifiers, neg_lo, neg_hi -- duplicating the layout is fragile):
  replaced the "walk the first four register operands" pattern with a
  per-SplitKind VOP3PWmmaLayout table that names each MCInst slot
  (vdst, src0, src1, src2_modifiers, src2, plus any trailing modifier
  slots present in the profile). The K=128 fp8/bf8 family and the f4
  form have empirically-different operand counts (7 vs 5 ops at MCInst
  level), so a single shared positional table would have been wrong;
  splitting the table by SplitKind keeps each layout self-documenting
  and the structural validation in extractWmmaOps() (operand count +
  per-slot reg/imm kind check) catches any TableGen drift loudly
  rather than silently miscompiling. The AMDGPU backend's
  getNamedOperandIdx() and OpName enum live in
  llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h, which is a
  backend-private header (not installed in the LLVM dist), so we
  follow the same mirror-and-document pattern that
  comgr-hotswap-patch-wmma-hazard.cpp (#2265) uses for SIInstrFlags.

* line 318 (Does this handle fp operands [for src2 inline imm]):
  no -- and now refuses explicitly. validateSplitOperands() rejects
  any non-zero src2 inline immediate with a log message. Only integer
  0 (the compiler-folded zero accumulator -- the common shape for
  `C = wmma(A, B, 0)`-style code) is accepted. Non-zero / FP inline
  constants would need MCInstPrinter-mediated formatting to emit
  canonically; MCInstPrinter has no public single-operand entry point,
  so emitting via itostr would mis-encode FP inline constants like 1.0
  (which encode at HW slot 242, not as a 32-bit literal).

* line 342 (how are modifiers like neg_lo handled for src2):
  not preserved -- and now refuses explicitly. The K=128 fp8/bf8
  family and the f4 form do NOT accept any modifier asm syntax
  (op_sel, neg_lo, neg_hi, clamp, matrix_*_reuse) per the AMDGPU asm
  parser (verified empirically by attempting to assemble each
  modifier; all rejected at clang -c time). The TableGen profile
  reserves slots for some modifiers (e.g. HasMatrixReuse=1 on
  F32_FP8BF8X128_WMMA_w32) but the asm grammar does not expose them
  at this opcode level. extractWmmaOps() defensively scans every
  trailing operand from src2_modifiers onward; if any is non-zero,
  the splitter refuses the rewrite. This means a future opcode
  evolution that exposes modifier syntax fails loudly here rather
  than silently dropping the modifier.

* line 342 (what about cases where src2 is an inline constant [with
  modifiers]): both the non-zero src2 imm path and the modifier path
  are now rejected, so the inline-constant-with-modifier case is
  doubly refused.

* line 1 of the imm-src2 test (test coverage is weak): expanded
  hotswap-wmma-split.s to add exact-register-slicing checks for the
  fp8_fp8 K-split (verifies first-half A_lo=v[0:7], B_lo=v[8:15],
  src2=original v[16:23], and second-half A_hi=v[8:15], B_hi=v[16:23],
  src2=dst v[16:23] -- catches off-by-one slicing) and for the f4
  M-split (verifies dst slices v[4:11]/v[12:19], A slices v[0:7]/
  v[8:15], B broadcast v[2:9], and the literal MATRIX_FMT_FP4 modifier
  on both halves). Added a new hotswap-wmma-split-refuse.s LIT test
  that exercises the non-zero src2 imm refusal path (the original
  K=128 opcode must remain unchanged in the output, and idempotency
  still holds across two rewrites). Modifier-bearing tests are not
  written: the assembler refuses all modifier syntax for these 9
  opcodes, so a modifier-bearing input cannot be produced by clang
  to feed the splitter -- the defensive refusal in extractWmmaOps()
  is unreachable from any LIT-constructible input, which is itself
  the test that the contract holds.

Verified end-to-end on amd-staging amd-llvm dist with
all 16 hotswap LIT tests pass (the 13
pre-existing tests plus the two WMMA-split tests with the new
exact-slicing assertions plus the new refusal test).
…iates via printer round-trip

Addresses chinmaydd's follow-up review on PR #2379 (line 480 "Why
are we rejecting them rather than supporting them" / "This seems
plain wrong" pointing to llvm/test/MC/AMDGPU/gfx1250_asm_wmma_w32.s,
and line 456 on FP operand handling).

The previous revision refused any input with a non-zero src2 inline
immediate or any non-zero modifier slot, on the (incorrect)
empirical claim that the AMDGPU asm parser does not accept
modifiers for these 9 opcodes. Upstream's MC test
gfx1250_asm_wmma_w32.s shows that all of:

  - src2 inline FP constant (e.g. `1.0`)
  - `neg_lo:[0,0,1]` and `neg_hi:[0,0,1]` (src2 NEG bits)
  - `matrix_a_reuse` and `matrix_b_reuse` (HW data-reuse hints)

are accepted on every splittable opcode and produce real (distinct)
encodings. The earlier rejection would mean the rewriter leaves the
original A0-incompatible opcode in place for any kernel using these
forms, which the runtime would then refuse to load.

New design: print the source instruction in canonical asm form via
LLVM's MCInstPrinter::printInst, then perform textual surgery on the
result to produce each split half. The splitter never has to
reproduce the printer's per-operand formatting decisions (FP inline
constant 1.0 vs integer 1 -- which encode at distinct inline-const
slots 242 vs 1 -- nor modifier suffix ordering). Any input the
printer accepts is preserved verbatim modulo the per-half
transformations:

  - K-split first half: original operand list with src0/src1 sliced
    to the lower halves; src2 and modifier suffix preserved verbatim.
  - K-split second half: src0/src1 sliced to the upper halves; src2
    replaced with the dst register (the accumulator carry); modifier
    suffix has the src2-bit cleared in neg_lo:[X,Y,Z] / neg_hi:[X,Y,Z]
    (the operand at the src2 slot is no longer the original src2).
    matrix_a_reuse / matrix_b_reuse are stripped on both halves
    because they assert HW reuse-buffer validity that no longer
    holds across an instruction-stream rewrite.
  - M-split halves: dst, src0, src2 (when VGPR) sliced to lower /
    upper halves; src1 broadcast; modifier suffix preserved on both
    halves with matrix_a_reuse / matrix_b_reuse stripped; the
    splitter-added MATRIX_FMT_FP4 modifiers stay (the f8f6f4
    destination opcode requires them and the f4 source does not
    carry them).

The modifier-suffix tokenizer and packed-modifier parser
(parsePackedModifier, transformModifierSuffix) understand the
`<name>:[X,Y,Z]` shape used by neg_lo / neg_hi / op_sel and the
bareword shape used by matrix_a_reuse / matrix_b_reuse / clamp.
Other unrecognized modifier tokens (op_sel, clamp, future
additions) are passed through unchanged on first half and on
M-split halves; the K-split second half passes them through too
(only neg_lo / neg_hi are clearable).

The defensive operand-count and operand-kind validation in
extractWmmaOps remains, so any TableGen-driven layout drift fails
loudly with "VOP3P layout drift -- update the table" rather than
silently miscompiling.

New LIT test hotswap-wmma-split-modifiers.s exercises the full
contract: K-splits with FP 1.0, neg_lo:[0,0,1], neg_hi:[0,0,1],
matrix_a_reuse, matrix_b_reuse; M-splits with FP 1.0 and
neg_lo:[0,0,1]. Each input form is from the upstream MC test's
known-supported syntax for these opcodes; each DAG check verifies
the exact post-split operand and modifier shape per the documented
transformation rules. The obsolete hotswap-wmma-split-refuse.s LIT
test (which asserted refusal of non-zero src2 imm) is removed --
non-zero src2 imm is now correctly handled.

Verified: all 16 hotswap LIT tests pass.
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Comment on lines +90 to +96
// Each CHECK-DAG below asserts both the replacement mnemonic AND that the
// assembler emitted at least one form with the inline immediate 0 as $src2
// -- the trailing literal `, 0` is what we fed in. For K-splits the second
// half uses dst as $src2 (the accumulator carry), so each pattern matches
// exactly the first half of its split. For the M-split (FP4) both halves
// use the same imm, but one match per mnemonic is enough to prove the
// splitter propagated the imm into the rewritten code.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer explicit tests that check for this behavior. You can separate out individual tests and check that the various aspects of the splitting work as intended. For instance using the src2 as the dst.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block of DISASM-DAG check lines is not something I am a fan of

s_endpgm
.size test_m_neg_lo, .-test_m_neg_lo

// -- Trampoline region: assertions on the rewritten output ------------------
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for these tests

if (T.Bytes.empty()) {
log() << "hotswap: error: WMMA split: trampoline assembly failed for "
<< DI.Mnemonic << "\n";
return 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a general problem across all patch PRs that we return "0" on failed patching. Error logging is okay, but we need to refactor these APIs to return success/failure more cleanly

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, there is no clear way to know if -

a) we found the instruction that needs to be patched but patch was not applied
b) the problematic instruction was never found

jammm added 2 commits May 7, 2026 09:47
Addresses chinmaydd's review on PR #2379:
  > "I would prefer explicit tests that check for this behavior. You
  > can separate out individual tests and check that the various
  > aspects of the splitting work as intended. For instance using the
  > src2 as the dst."
  > "The block of DISASM-DAG check lines is not something I am a fan of"

Replaces the two consolidated tests
(hotswap-wmma-split-{imm-src2,modifiers}.s -- which packed multiple
aspects into one file with a trailing DISASM-DAG block) with nine
focused per-aspect tests, each exercising ONE source kernel and
asserting EXACT post-split operand+modifier shape via ordered
DISASM: / DISASM-NEXT: checks (no DAG):

  - hotswap-wmma-split-int-imm.s     -- src2 = inline int 0
                                        (compiler-folded zero accumulator)
  - hotswap-wmma-split-fp-imm.s      -- src2 = inline FP 1.0 in K-split
                                        (printer-round-trip preserves the
                                        FP slot vs. integer slot
                                        distinction in VOP3P inline-const
                                        encoding)
  - hotswap-wmma-split-neg-lo-k.s    -- neg_lo:[0,0,1] preserved on
                                        first half, dropped on second
                                        half (src2 := dst, no negation)
  - hotswap-wmma-split-neg-hi-k.s    -- same shape for neg_hi
  - hotswap-wmma-split-matrix-a-reuse.s -- matrix_a_reuse stripped on
                                        both halves (HW reuse-buffer
                                        assertion no longer holds after
                                        the data is sliced)
  - hotswap-wmma-split-matrix-b-reuse.s -- matrix_b_reuse stripped, same
  - hotswap-wmma-split-msplit-fp-imm.s -- M-split with FP 1.0 src2:
                                        BOTH halves preserve the imm
                                        (no carry on the M axis) plus
                                        the splitter-added MATRIX_FMT_FP4
                                        modifiers required by the f8f6f4
                                        destination opcode
  - hotswap-wmma-split-msplit-neg-lo.s -- M-split with neg_lo: BOTH
                                        halves preserve the modifier
  - hotswap-wmma-split-src2-eq-dst.s -- the C += A*B accumulator-reuse
                                        pattern (vdst == src2 at the
                                        source level), which exercises
                                        the second-half src2 := dst
                                        transformation as a no-op
                                        (chinmaydd's specific suggestion)

Each new file: ONE source kernel, ONE aspect, exact-operand DISASM
checks rather than a DAG block. The trampoline body's two halves are
emitted back-to-back, so DISASM-NEXT works for the second half.

Verified locally: all 9 new per-aspect tests pass, and the existing
hotswap-wmma-split.s smoke test (which legitimately remains a DAG-
style multi-variant coverage check, since asserting all 9 splittable
mnemonics appear regardless of emission order is what that test is
about) continues to pass.
chinmaydd flagged on PR #2379 that the shared dispatcher API in
comgr-hotswap-b0a0.cpp returns uint32_t where 0 means EITHER
"this patch did not match the instruction" (correct dispatcher
fall-through) OR "matched but failed to apply" (silent miscompile
risk -- the rewriter returns SUCCESS with the original
A0-incompatible opcode left in .text). They identified this as a
cross-cutting concern across all hotswap patches (in-place #2222,
WMMA hazard #2265, WMMA split, future patches).

A proper fix is a separate follow-up upstream PR that refactors the
dispatcher to distinguish the two cases (e.g. an enum NoMatch /
Patched / Failed return type, or a `bool *Aborted` field threaded
through PatchContext that the dispatcher checks and uses to
short-circuit the rewrite with AMD_COMGR_STATUS_ERROR).

For now: classify each return-0 path in the splitter as either
"did not match" (correct) or "matched but failed" (currently silent),
add a header comment to applyWmmaSplitPatches documenting the
semantics + the cross-PR refactor concern, and tag every
matched-but-failed exit with an inline comment so it's clear which
return is the silent-miscompile risk.

The behavior is unchanged; this is documentation of an existing
limitation pending the cross-PR API refactor.
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@jammm
Copy link
Copy Markdown
Author

jammm commented May 7, 2026

@chinmaydd updated PR based on your comments. PTAL ^^

@chinmaydd
Copy link
Copy Markdown

chinmaydd commented May 8, 2026

Disclaimer: I've used Claude Opus 4.7 for my review. Its inspired from some of my constraints, but I believe most of them should be a part of AGENT_CONVENTIONS.md.

They should be actionable by another agent IMO.

Mostly looks good. We discussed in today's meetings that adding tests should be considered important here.

The longer term plan involves a broad refactor and tests will help us get there faster and with more confidence.


Blocker

hotswap-wmma-split.s fails deterministically (exit 141) when run in isolation. The awk '/\.text /{print $7; exit}' calls (lines 31–33) close stdin before llvm-readelf finishes writing, and LIT's pipefail shell propagates the SIGPIPE. Reproduces 9/10 in standalone runs; it only passes in the bulk LIT run because output buffering shifts the race.

$ for i in $(seq 10); do build/bin/llvm-lit build/tools/comgr/test-lit/hotswap-wmma-split.s 2>&1 | grep -E "Failed|Passed" | tail -1; done
Failed: 1 (×9), Passed: 1 (×1)

Fix: drop the exit from awk (the file has exactly one .text section so the result is unchanged), or follow the FileCheck pattern used by hotswap-elf-growth.s.

Test-data quality

Almost every K-split test uses input that violates @earlyclobber $vdst on the source WMMA pseudo (VOP3PInstructions.td:1444), so it can't be compiler-generated:

Test dst src1 overlap
hotswap-wmma-split.s lines 46/85/98/111 v[16:23] v[8:23] v[16:23]
hotswap-wmma-split.s lines 59/124/137/150 v[16:19] v[8:23] v[16:19]
hotswap-wmma-split.s line 72 (M-split) v[4:19] v[2:9] v[4:9]
hotswap-wmma-split-{src2-eq-dst,fp-imm,int-imm}.s v[16:23] v[8:23] v[16:23]
hotswap-wmma-split-neg-hi-k.s v[20:23] v[8:23] v[20:23]

The K-split second half is WMMA dst, A_hi, B_hi, dst where B_hi = upper half of original src1. When that upper half coincides with dst (the case the failing tests pick), the second half reads B_hi from registers the first half just clobbered with the partial product — and the CHECK-DAGs on lines 184–185 lock that in as the contract. The exact-slicing checks would still pass on a real off-by-one in the slicer because the overlap makes the operands incidentally textually identical. Please re-pick disjoint VGPRs (the matrix_a_reuse / neg_lo_k / msplit_* tests already use the right pattern).

Smaller items

  • transformModifierSuffix only knows about matrix_*_reuse / neg_lo / neg_hi. Any future modifier (op_sel, byte_sel, clamp) on these opcodes would silently pass through unchanged on both halves. Either assert the modifier set is closed or table-drive it per SplitKind.
  • parsePackedModifier allocates a std::string Prefix per call — the prefix is statically known, hoist as a StringLiteral.
  • Ctx.OutTrampolines.emplace_back(std::move(T)) (line 644) — push_back for an already-constructed object per LLVM style; the hazard patch uses push_back for the same operation.
  • buildTrampoline and fixupTrampolineBranches both encode the branch-back; the second overwrites identical bytes. Same redundancy in the hazard patch — likely a buildTrampoline cleanup that should reserve MinInstSize zero bytes the way emitToTrampoline does.
  • The PR description mentions a --allow-grow flag but hotswap-rewrite.c still has only --zero-size / --output. Stale or removed?

Coverage gaps worth adding:

  • A multi-WMMA-per-kernel test to exercise TrampTextOffset accumulation.
  • WMMA-split + WMMA-hazard interaction on the same kernel.

jammm added 3 commits May 8, 2026 02:32
…anups

Address PR #2379 review (chinmaydd):

- transformModifierSuffix: enforce a closed set of modifier tokens
  (matrix_a_reuse, matrix_b_reuse, neg_lo:[X,Y,Z], neg_hi:[X,Y,Z]) and
  return std::nullopt on anything else, so a future K=128/M=32 source
  mnemonic that grew an unaudited modifier surfaces as a fail-soft
  matched-but-failed instead of silently carrying the modifier through
  both halves. Callers in buildSplit128to64Asm / buildSplit32x16Asm
  propagate by returning an empty replacement vector.
- parsePackedModifier: drop the per-call (Name + ":[").str() heap
  allocation in favor of piecewise StringRef checks; this runs once per
  modifier per split half, and the prefix is always a literal
  ("neg_lo" / "neg_hi") so the materialization was pure overhead.
- applyWmmaSplitPatches: replace the OutTrampolines.emplace_back(...)
  call with push_back(std::move(T)) since the argument is already a
  Trampoline rvalue and there's no in-place construction to optimize.
…rlyclobber, add multi-WMMA

Address PR #2379 review (chinmaydd):

- hotswap-wmma-split.s: drop `exit` from the
  `awk '/\.text /{print $7}'` one-liner that captures the .text section
  size from llvm-readelf -S. With `exit`, awk closes its stdin before
  llvm-readelf finishes writing; LIT's pipefail shell propagates the
  resulting SIGPIPE and the size-grew assertion fails
  non-deterministically in standalone runs (only passes in the bulk LIT
  run because output buffering shifts the race). The `\.text ` pattern
  matches at most one section header per ELF, so removing `exit` does
  not change the captured value.
- hotswap-wmma-split{,-src2-eq-dst,-fp-imm,-int-imm,-neg-hi-k}.s:
  re-pick disjoint VGPR ranges (src0=v[0:15], src1=v[16:31], dst at
  v[32:39] f32 / v[32:35] f16) so dst no longer overlaps src1. The
  source pseudo carries `@earlyclobber $vdst` (VOP3PInstructions.td:1444),
  so compiler-generated WMMAs cannot land in the previous shape; the
  test inputs now mirror that contract. The matrix_*_reuse / neg_lo_k
  / msplit_* tests already used the right pattern and are unchanged.
  hotswap-wmma-split.s gets a header comment explaining the contract
  so the disjoint-VGPR choice is self-documenting.
- hotswap-wmma-split-multi-wmma.s (new): two K=128 WMMAs in one
  kernel, asserts both trampolines land in source order with the
  expected K=64 mnemonics. Smallest input that exercises the
  >1-trampoline path (TrampTextOffset accumulation in
  applyWmmaSplitPatches) -- a regression in the offset accumulation
  would land the second trampoline's s_branch-back at the wrong target.
…on coverage

Address PR #2379 review (chinmaydd) coverage gap: the WMMA-split and
WMMA-hazard passes both append to Ctx.OutTrampolines and both compute
their trampoline's `.text` offset by walking the previously appended
trampolines (patch-wmma-split.cpp:664-666 and
patch-wmma-hazard.cpp:184-186). If either pass forgets to walk
OutTrampolines when computing TrampolineTextOffset, the s_branch-back
at the tail of one trampoline lands at the wrong target.

The two passes never overlap on a single instruction (the splitter
only fires on K=128 fp8/bf8/f4 *float* WMMAs; the hazard pass only
fires on WMMA *integer* opcodes), but both can fire in the same
kernel. Putting both in one kernel and asserting -- in trampoline
emission order -- that the split halves land first and the hazard
trampoline's 8 v_nops + relocated VALU land *after* exercises the
cross-pass offset accumulation.

Per-instruction passes run before applyWmmaHazardPatch
(comgr-hotswap-b0a0.cpp:323-361), so the split trampoline is appended
first and the hazard trampoline second. Asserting in order (DISASM,
not DISASM-DAG) catches a swap or a missing trampoline that DAG would
mask.
jammm added a commit that referenced this pull request May 8, 2026
…anups

Address PR #2379 review (chinmaydd):

- transformModifierSuffix: enforce a closed set of modifier tokens
  (matrix_a_reuse, matrix_b_reuse, neg_lo:[X,Y,Z], neg_hi:[X,Y,Z]) and
  return std::nullopt on anything else, so a future K=128/M=32 source
  mnemonic that grew an unaudited modifier surfaces as a fail-soft
  matched-but-failed instead of silently carrying the modifier through
  both halves. Callers in buildSplit128to64Asm / buildSplit32x16Asm
  propagate by returning an empty replacement vector.
- parsePackedModifier: drop the per-call (Name + ":[").str() heap
  allocation in favor of piecewise StringRef checks; this runs once per
  modifier per split half, and the prefix is always a literal
  ("neg_lo" / "neg_hi") so the materialization was pure overhead.
- applyWmmaSplitPatches: replace the OutTrampolines.emplace_back(...)
  call with push_back(std::move(T)) since the argument is already a
  Trampoline rvalue and there's no in-place construction to optimize.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
jammm added a commit that referenced this pull request May 8, 2026
…rlyclobber, add multi-WMMA

Address PR #2379 review (chinmaydd):

- hotswap-wmma-split.s: drop `exit` from the
  `awk '/\.text /{print $7}'` one-liner that captures the .text section
  size from llvm-readelf -S. With `exit`, awk closes its stdin before
  llvm-readelf finishes writing; LIT's pipefail shell propagates the
  resulting SIGPIPE and the size-grew assertion fails
  non-deterministically in standalone runs (only passes in the bulk LIT
  run because output buffering shifts the race). The `\.text ` pattern
  matches at most one section header per ELF, so removing `exit` does
  not change the captured value.
- hotswap-wmma-split{,-src2-eq-dst,-fp-imm,-int-imm,-neg-hi-k}.s:
  re-pick disjoint VGPR ranges (src0=v[0:15], src1=v[16:31], dst at
  v[32:39] f32 / v[32:35] f16) so dst no longer overlaps src1. The
  source pseudo carries `@earlyclobber $vdst` (VOP3PInstructions.td:1444),
  so compiler-generated WMMAs cannot land in the previous shape; the
  test inputs now mirror that contract. The matrix_*_reuse / neg_lo_k
  / msplit_* tests already used the right pattern and are unchanged.
  hotswap-wmma-split.s gets a header comment explaining the contract
  so the disjoint-VGPR choice is self-documenting.
- hotswap-wmma-split-multi-wmma.s (new): two K=128 WMMAs in one
  kernel, asserts both trampolines land in source order with the
  expected K=64 mnemonics. Smallest input that exercises the
  >1-trampoline path (TrampTextOffset accumulation in
  applyWmmaSplitPatches) -- a regression in the offset accumulation
  would land the second trampoline's s_branch-back at the wrong target.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
jammm added a commit that referenced this pull request May 8, 2026
…on coverage

Address PR #2379 review (chinmaydd) coverage gap: the WMMA-split and
WMMA-hazard passes both append to Ctx.OutTrampolines and both compute
their trampoline's `.text` offset by walking the previously appended
trampolines (patch-wmma-split.cpp:664-666 and
patch-wmma-hazard.cpp:184-186). If either pass forgets to walk
OutTrampolines when computing TrampolineTextOffset, the s_branch-back
at the tail of one trampoline lands at the wrong target.

The two passes never overlap on a single instruction (the splitter
only fires on K=128 fp8/bf8/f4 *float* WMMAs; the hazard pass only
fires on WMMA *integer* opcodes), but both can fire in the same
kernel. Putting both in one kernel and asserting -- in trampoline
emission order -- that the split halves land first and the hazard
trampoline's 8 v_nops + relocated VALU land *after* exercises the
cross-pass offset accumulation.

Per-instruction passes run before applyWmmaHazardPatch
(comgr-hotswap-b0a0.cpp:323-361), so the split trampoline is appended
first and the hazard trampoline second. Asserting in order (DISASM,
not DISASM-DAG) catches a swap or a missing trampoline that DAG would
mask.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
@jammm jammm force-pushed the users/jam/wmma-split-gfx1250-staging branch from 5cc6549 to 7165ca7 Compare May 8, 2026 12:23
@jammm jammm requested a review from chinmaydd May 8, 2026 12:24
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp Outdated
Copy link
Copy Markdown

@chinmaydd chinmaydd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits, thanks !

…mment)

Two nits from chinmaydd's 2026-05-08T15:32 review:

- Spell out std::optional<std::string> at every use-site in
  buildSplit128to64Asm / buildSplit32x16Asm instead of `auto`. The
  `auto` had crept back in when I propagated transformModifierSuffix's
  std::optional return through the callers; the explicit type makes
  the deref on the next line unambiguous to the reader.
- Refresh the matched-but-failed comment on the AsmLines.empty() check
  in applyWmmaSplitPatches: build*Asm now legitimately returns empty
  when transformModifierSuffix rejects an unsupported modifier (the
  closed-set check from the prior cleanup commit), so calling that
  branch "defensive" is misleading.
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Comment on lines +115 to +134
constexpr SplitRow SplitTable[] = {
{"v_wmma_f16_16x16x128_fp8_fp8", SplitKind::Split128to64FP8BF8,
"v_wmma_f16_16x16x64_fp8_fp8"},
{"v_wmma_f16_16x16x128_fp8_bf8", SplitKind::Split128to64FP8BF8,
"v_wmma_f16_16x16x64_fp8_bf8"},
{"v_wmma_f16_16x16x128_bf8_fp8", SplitKind::Split128to64FP8BF8,
"v_wmma_f16_16x16x64_bf8_fp8"},
{"v_wmma_f16_16x16x128_bf8_bf8", SplitKind::Split128to64FP8BF8,
"v_wmma_f16_16x16x64_bf8_bf8"},
{"v_wmma_f32_16x16x128_fp8_fp8", SplitKind::Split128to64FP8BF8,
"v_wmma_f32_16x16x64_fp8_fp8"},
{"v_wmma_f32_16x16x128_fp8_bf8", SplitKind::Split128to64FP8BF8,
"v_wmma_f32_16x16x64_fp8_bf8"},
{"v_wmma_f32_16x16x128_bf8_fp8", SplitKind::Split128to64FP8BF8,
"v_wmma_f32_16x16x64_bf8_fp8"},
{"v_wmma_f32_16x16x128_bf8_bf8", SplitKind::Split128to64FP8BF8,
"v_wmma_f32_16x16x64_bf8_bf8"},
{"v_wmma_f32_32x16x128_f4", SplitKind::Split32x16to16x16F4,
"v_wmma_f32_16x16x128_f8f6f4"},
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used as a map to do a lookup. I think we should use a StringMap that maps Mnemonic to {Kind, Replacement}.

Comment on lines +136 to +140
struct SplitMatch {
SplitKind Kind;
StringRef Replacement;
bool Matched = false;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where Kind and Replacement are set and Match is false ?

I think this should be an optional<{Kind, Replacement}>.

std::pair<int, int> Src1{-1, 0};
std::pair<int, int> Src2{-1, 0}; // valid only when Src2IsImm == false
bool Src2IsImm = false;
bool Valid = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use the any of the other fields on an invalid WmmaOps? Cause otherwise, I'd remove the Valid field and use an optional instead (then, a WmmaOps would always be valid, or be a nullopt optional).

…td::optional, switch lookup table to StringMap

Address jmmartinez's three structural-cleanup nits on PR #2379:

- Convert SplitTable from a constexpr array of {Mnemonic, Kind,
  Replacement} rows + linear search to a StringMap<SplitRule> built
  once per process via a function-local static. The map is the natural
  representation for "look up by mnemonic" and reads more clearly at
  the call site.
- Drop the redundant `Matched` flag on the lookup result. lookupSplitRule
  now returns std::optional<SplitRule>; absence means no match, presence
  means a valid {Kind, Replacement} pair. The caller's
  `if (!Match.Matched)` becomes `if (!Match)` and field accesses
  become `Match->Kind` / `Match->Replacement`.
- Drop the redundant `Valid` flag on WmmaOps. extractWmmaOps now
  returns std::optional<WmmaOps>; every previous "return R;" on a
  failure path becomes "return std::nullopt;" and the success path
  returns the (now always-valid) struct directly. Caller's
  `if (!Ops.Valid)` becomes `if (!Ops)` and downstream uses pass `*Ops`
  to validateSplitOperands / buildSplit*Asm (those helpers take
  `const WmmaOps &` and never read the dropped Valid field).

No behavior change: the lookup table contents are identical, the
failure paths log the same messages, and the same instructions get
patched in the same way. All 25 hotswap LIT tests still pass.
@jammm
Copy link
Copy Markdown
Author

jammm commented May 12, 2026

@jmmartinez just addressed your comments in the latest push. PTAL :)
@chinmaydd the PR should be good to merge.

@jammm jammm requested a review from jmmartinez May 12, 2026 07:58
Comment thread amd/comgr/src/comgr-hotswap-patch-wmma-split.cpp Outdated
…ctor

Address jmmartinez's follow-up nit on PR #2379: StringMap has an
initializer-list constructor (StringMap.h:151), so the lambda +
try_emplace pattern from the previous commit was unnecessarily
verbose. Switch to direct brace-initialization of the function-local
static.

No behavior change. All 25 hotswap LIT tests still pass.
Copy link
Copy Markdown

@jmmartinez jmmartinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chinmaydd chinmaydd enabled auto-merge (squash) May 12, 2026 14:27
@chinmaydd chinmaydd merged commit 75c6512 into amd-staging May 12, 2026
24 checks passed
@chinmaydd chinmaydd deleted the users/jam/wmma-split-gfx1250-staging branch May 12, 2026 15:01
jammm added a commit that referenced this pull request May 13, 2026
…ound

On GFX1250 A0 silicon, an async trap fired between LD_SCALE and the
WMMA half of a VOP3PX2 pair leaves the wave with corrupted scale
state. The trap handler can rewind the PC for known-paired forms
(via the trap-handler PC-rewind path that recognizes 0xCC33 and
0xCC88 opcode high-bytes, landed in rocm-systems), but a standalone
v_wmma_f32_16x16x128_f8f6f4 (no preceding LD_SCALE) cannot be safely
rewound -- the trap handler unconditionally walks back 8 bytes, which
on an unpaired f8f6f4 lands in arbitrary preceding-instruction bytes.

This patch prepends an inline-0 LD_SCALE prefix to every standalone
f8f6f4 WMMA, turning it into a fused VOP3PX2 with scale=1.0 (a no-op
for the WMMA result). The 8-byte prefix also has scale_src2 = VGPR0
(0x100) baked in -- the VOP3PX2 scale_src2 field is architecturally
unused, but if left at 0 the SQ mis-decodes it as an SGPR reference
and stalls the SALU for 3 cycles; setting it to a VGPR encoding
eliminates the false dependency. Same workaround the in-place
vop3px2-src2 patch applies to user-emitted VOP3PX2 instructions;
baking it into the wrap pass's prefix bytes keeps wrap-emitted
trampolines stall-free at creation.

Two-pass operation:
  1. Decoded[] scan -- wraps user-written standalone WMMAs.
  2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in
     trampoline bytes (the K=128 32x16x128_f4 splitter emits f8f6f4
     into trampolines per #2379).

Defensive checks:
  - Refuses to retarget if any 0xCC88 (v_wmma_f32_32x16x128_f4)
    leftover exists in Decoded[]; the K=128 splitter must eliminate
    these before the wrap pass runs since
    V_WMMA_SCALE_F32_32X16X128_F4 doesn't exist on A0 (only on B0+
    per MI400 SPG p.145).
  - Already-wrapped detection skips wrapping if the preceding decoded
    instruction is a SCALE form (0xCC35 or 0xCC3A); per ISA doc
    p.158 the SCALE prefix must be contiguous with the WMMA.

Wired into b0a0.cpp dispatcher AFTER all per-instruction patches and
after applyVop3px2Src2Fix, so the wrap pass sees the final
post-split trampoline layout. The wrap pass produces VOP3PX2
instructions with scale_src2 already set to 0x100, so applyVop3px2Src2Fix
has no work on the trampoline-emitted forms.

LIT tests:
  - hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4,
    already-wrapped, K=128 split-then-wrap chain).
  - hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines
    updated to expect the post-wrap disassembly (v_wmma_scale_*
    instead of bare f8f6f4) since the wrap pass now wraps splitter
    trampoline outputs.
  - hotswap-vop3px2-src2-noop.s: comments updated to describe the
    new wrap interaction; the in-place vop3px2-src2 patch still has
    nothing to do here (the wrap pass's prefix bytes already carry
    scale_src2 = VGPR0).
jammm added a commit that referenced this pull request May 13, 2026
Wraps every standalone v_wmma_f32_16x16x128_f8f6f4 into a fused
VOP3PX2 by prepending an inline-0 LD_SCALE prefix (numerically a
no-op, scale=1.0). The 8-byte prefix is byte-identical for every
wrap site, so the WMMA portion stays bit-identical and modifier-rich
operand layouts (matrix_a_fmt, matrix_b_fmt, neg_lo, neg_hi,
matrix_a_reuse, matrix_b_reuse, ...) are preserved without
re-encoding. The prefix bytes also carry scale_src2 = VGPR0 (0x100),
matching the encoding the in-place vop3px2-src2 patch produces, so
no second pass is needed on wrap-emitted forms.

Two-pass operation:
  1. Decoded[] scan -- wraps user-written standalone WMMAs.
  2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in
     trampoline bytes (the K=128 32x16x128_f4 splitter from #2379
     emits f8f6f4 into trampolines).

Defensive checks:
  - Refuses to retarget if any v_wmma_f32_32x16x128_f4 (0xCC88)
    leftover exists in Decoded[] -- the K=128 splitter must
    eliminate these before the wrap pass runs.
  - Already-wrapped detection: skips wrapping if the preceding
    decoded instruction is a SCALE form (per ISA doc p.158, the
    SCALE prefix must be contiguous with the WMMA).

Wired into b0a0.cpp dispatcher AFTER all per-instruction patches
and after applyVop3px2Src2Fix, so the wrap pass sees the final
post-split trampoline layout.

Pairs with the existing trap-handler split path in rocm-systems
(commit 74c647e6605, "rocr: GFX12.5 : VOP3PX instruction split in
trap handler").

LIT tests:
  - hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4,
    already-wrapped, K=128 split-then-wrap chain).
  - hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines
    updated to expect post-wrap disassembly (v_wmma_scale_* instead
    of bare f8f6f4) since the wrap pass now wraps splitter outputs.
  - hotswap-vop3px2-src2-noop.s: comments updated to describe the
    new wrap interaction.
jammm added a commit that referenced this pull request May 13, 2026
Wraps every standalone v_wmma_f32_16x16x128_f8f6f4 into a fused
VOP3PX2 by prepending an inline-0 LD_SCALE prefix (numerically a
no-op, scale=1.0). The 8-byte prefix is byte-identical for every
wrap site, so the WMMA portion stays bit-identical and modifier-rich
operand layouts (matrix_a_fmt, matrix_b_fmt, neg_lo, neg_hi,
matrix_a_reuse, matrix_b_reuse, ...) are preserved without
re-encoding. The prefix bytes also carry scale_src2 = VGPR0 (0x100),
matching the encoding the in-place vop3px2-src2 patch produces, so
no second pass is needed on wrap-emitted forms.

Two-pass operation:
  1. Decoded[] scan -- wraps user-written standalone WMMAs.
  2. Trampoline scan -- wraps splitter-emitted WMMAs sitting in
     trampoline bytes (the K=128 32x16x128_f4 splitter from #2379
     emits f8f6f4 into trampolines).

Defensive checks:
  - Refuses to retarget if any v_wmma_f32_32x16x128_f4 (0xCC88)
    leftover exists in Decoded[] -- the K=128 splitter must
    eliminate these before the wrap pass runs.
  - Already-wrapped detection: skips wrapping if the preceding
    decoded instruction is a SCALE form (per ISA doc p.158, the
    SCALE prefix must be contiguous with the WMMA).

Wired into b0a0.cpp dispatcher AFTER all per-instruction patches
and after applyVop3px2Src2Fix, so the wrap pass sees the final
post-split trampoline layout.

Pairs with the existing trap-handler split path in rocm-systems
(commit 74c647e6605, "rocr: GFX12.5 : VOP3PX instruction split in
trap handler").

LIT tests:
  - hotswap-vop3px-wrap.s: 5 cases (bare, FP8/FP8 explicit, FP6/FP4,
    already-wrapped, K=128 split-then-wrap chain).
  - hotswap-wmma-split{,-msplit-fp-imm,-msplit-neg-lo}.s: CHECK lines
    updated to expect post-wrap disassembly (v_wmma_scale_* instead
    of bare f8f6f4) since the wrap pass now wraps splitter outputs.
  - hotswap-vop3px2-src2-noop.s: comments updated to describe the
    new wrap interaction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:skip Skip all CI builds/tests for this PR comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants