[AMDGPU] comgr: HotSwap in-place patches for B0-to-A0 rewriting#2222
Conversation
66d85c8 to
2d2d988
Compare
|
this has ci:skip tag, so i cancelled the npsdb and rockCI. if you want it tested please remove the tag |
2d2d988 to
d3483bc
Compare
d3483bc to
abe66e1
Compare
abe66e1 to
d1a2e17
Compare
|
Note: We will wait for #2291 to land. Rebase, refactor and wait for ci to be green before landing this. |
|
For the test, can we update the LIT file to be a .s file? We can create multiple .s files if needed. Also, those can be compiled with clang, and run through the existing hotswap-rewrite, instead of creating a separate hotswap-inplace.c driver Roughly like this — the test file becomes a self-contained .s with RUN lines at the top: A few practical notes:
For the 3 test cases (cluster_load_b32+b128+s_clause, no-op kernel, cluster_load_b64), you'd have three separate .s files — hotswap-inplace-mixed.s, hotswap-inplace-noop.s, hotswap-inplace-b64.s — each with their own RUN/CHECK blocks. |
d1a2e17 to
ca7d489
Compare
This is a good suggestion. PR #2291 introduces the infra for this. After rebase on #2291, the only additional change needed to hotswap-rewrite.c will be to move the dump before the memcmp and skipping equality checks when Update: addressed after landing #2291. |
f0565f9 to
fddac1e
Compare
PR #2203 introduced 15 weak-symbol patch / liveness / DWARF stubs in comgr-hotswap-b0a0.cpp using GCC/Clang attribute syntax: __attribute__((weak)) uint32_t applyInPlacePatches(PatchContext &, size_t) { return 0; } MSVC does not understand this. It tokenizes __attribute__ as an identifier and `weak` as undeclared, then errors on every weak stub: comgr-hotswap-b0a0.cpp(95): error C2065: 'weak': undeclared identifier comgr-hotswap-b0a0.cpp(95): error C2374: '__attribute__': redefinition [...repeats for every weak stub at lines 95, 98, 101, 104, 107, 117, 123, 135, 141, 143, 152, 157, 162, 164, 166...] Replace with LLVM's portable LLVM_ATTRIBUTE_WEAK macro from llvm/Support/Compiler.h. On Linux/macOS it expands to __attribute__((__weak__)) (current behaviour preserved); on Windows it expands to nothing (compiles fine, no linker conflict because b0a0.cpp is currently the only TU defining these names). This is the immediate Windows unblock, not a complete fix for the underlying override design. When follow-up patch PRs (#2222 in-place, future trampoline / WMMA / scratch) land their strong-symbol overrides, Windows will need a different mechanism (e.g. function-pointer registration in a static initializer, or __declspec(selectany), or #ifdef-out the patch overrides on MSVC). Tracked separately. Failing Windows job from PR #2203: https://github.com/ROCm/llvm-project/actions/runs/24758866527/job/72438109210
fddac1e to
79dc77a
Compare
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
79dc77a to
1ad8ed9
Compare
|
Do we need a restart of other build jobs ? Though we should have the ability to merge once the PSDB passes |
|
|
1ad8ed9 to
4340ade
Compare
Signed-off-by: xintin <gaurav.verma@amd.com>
4340ade to
72c5252
Compare
… A0 (#2287) ## Summary A0 silicon has a race condition where `s_barrier_signal_isfirst` with a user cluster barrier ID may return garbage in SCC before the barrier completes. The non-isfirst variant (`s_barrier_signal`) has the same encoding size and operand layout but does not write SCC. This patch adds the `s_barrier_signal_isfirst` -> `s_barrier_signal` in-place opcode swap to the existing `applyInPlacePatches` dispatcher alongside the `cluster_load -> global_load` and `s_clause -> s_nop` rules from #2222. The swap is unconditional for all `s_barrier_signal_isfirst` instructions. Non-isfirst `s_barrier_signal` sites are left unchanged. ## Changes - **comgr-hotswap-patch-inplace.cpp**: Add `s_barrier_signal_isfirst` -> `s_barrier_signal` swap using the existing `resolveOpcode` / `swapOpcode` helpers. Same pattern as the `cluster_load` swap -- no new infrastructure. - **hotswap-barrier-isfirst.s**: Positive lit test - kernel with 2x `s_barrier_signal_isfirst -1`. Verifies both are swapped to `s_barrier_signal`, waits are unchanged, no `s_barrier_signal_isfirst` remains anywhere in disassembly. Idempotency check via second rewrite + `cmp`. - **hotswap-barrier-no-isfirst.s**: Passthrough lit test - kernel with only `s_barrier_signal` (workgroup -1 and user cluster -3). Verifies layout is structurally unchanged and no `s_barrier_signal_isfirst` appears. Idempotency check. - **hotswap-barrier-mixed.s**: Selectivity lit test - kernel with both `s_barrier_signal_isfirst` and `s_barrier_signal` interleaved. Verifies only the isfirst sites are swapped; the pre-existing non-isfirst site is preserved. Guards against a `starts_with` / `contains` regression in the mnemonic match. ## Tests - 3 lit tests (positive / passthrough / mixed selectivity) with strict CHECK-NEXT chains and CHECK-NOT brackets - All existing lit, ctest, and unit tests pass ## Follow up (already tracked) Cache all in-place target opcodes in LLVMState at init time (covers cluster_load, s_barrier_signal, and future in-place swaps). Currently each swap calls resolveOpcode per instruction site; caching at init eliminates redundant assemble+decode on every match. This is tracked as part of the hotswap code-refactoring effort alongside the others (#2253) --------- Signed-off-by: xintin <gaurav.verma@amd.com>
… A0 (#2287) ## Summary A0 silicon has a race condition where `s_barrier_signal_isfirst` with a user cluster barrier ID may return garbage in SCC before the barrier completes. The non-isfirst variant (`s_barrier_signal`) has the same encoding size and operand layout but does not write SCC. This patch adds the `s_barrier_signal_isfirst` -> `s_barrier_signal` in-place opcode swap to the existing `applyInPlacePatches` dispatcher alongside the `cluster_load -> global_load` and `s_clause -> s_nop` rules from #2222. The swap is unconditional for all `s_barrier_signal_isfirst` instructions. Non-isfirst `s_barrier_signal` sites are left unchanged. ## Changes - **comgr-hotswap-patch-inplace.cpp**: Add `s_barrier_signal_isfirst` -> `s_barrier_signal` swap using the existing `resolveOpcode` / `swapOpcode` helpers. Same pattern as the `cluster_load` swap -- no new infrastructure. - **hotswap-barrier-isfirst.s**: Positive lit test - kernel with 2x `s_barrier_signal_isfirst -1`. Verifies both are swapped to `s_barrier_signal`, waits are unchanged, no `s_barrier_signal_isfirst` remains anywhere in disassembly. Idempotency check via second rewrite + `cmp`. - **hotswap-barrier-no-isfirst.s**: Passthrough lit test - kernel with only `s_barrier_signal` (workgroup -1 and user cluster -3). Verifies layout is structurally unchanged and no `s_barrier_signal_isfirst` appears. Idempotency check. - **hotswap-barrier-mixed.s**: Selectivity lit test - kernel with both `s_barrier_signal_isfirst` and `s_barrier_signal` interleaved. Verifies only the isfirst sites are swapped; the pre-existing non-isfirst site is preserved. Guards against a `starts_with` / `contains` regression in the mnemonic match. ## Tests - 3 lit tests (positive / passthrough / mixed selectivity) with strict CHECK-NEXT chains and CHECK-NOT brackets - All existing lit, ctest, and unit tests pass ## Follow up (already tracked) Cache all in-place target opcodes in LLVMState at init time (covers cluster_load, s_barrier_signal, and future in-place swaps). Currently each swap calls resolveOpcode per instruction site; caching at init eliminates redundant assemble+decode on every match. This is tracked as part of the hotswap code-refactoring effort alongside the others (#2253) --------- Signed-off-by: xintin <gaurav.verma@amd.com>
chinmaydd flagged on PR #2379 that the shared dispatcher API in comgr-hotswap-b0a0.cpp returns uint32_t where 0 means EITHER "this patch did not match the instruction" (correct dispatcher fall-through) OR "matched but failed to apply" (silent miscompile risk -- the rewriter returns SUCCESS with the original A0-incompatible opcode left in .text). They identified this as a cross-cutting concern across all hotswap patches (in-place #2222, WMMA hazard #2265, WMMA split, future patches). A proper fix is a separate follow-up upstream PR that refactors the dispatcher to distinguish the two cases (e.g. an enum NoMatch / Patched / Failed return type, or a `bool *Aborted` field threaded through PatchContext that the dispatcher checks and uses to short-circuit the rewrite with AMD_COMGR_STATUS_ERROR). For now: classify each return-0 path in the splitter as either "did not match" (correct) or "matched but failed" (currently silent), add a header comment to applyWmmaSplitPatches documenting the semantics + the cross-PR refactor concern, and tag every matched-but-failed exit with an inline comment so it's clear which return is the silent-miscompile risk. The behavior is unchanged; this is documentation of an existing limitation pending the cross-PR API refactor.
…ompatibility (#2379) Adds the K=128 WMMA splitter as the next B0-to-A0 silicon-stepping workaround in the COMGR HotSwap subsystem, alongside the in-place patches landed in #2222. The splitter rewrites K=128 WMMA variants that exist on GFX1250 B0 but not A0 into pairs of narrower WMMAs that exist on both steppings, emitted as trampolines appended to `.text` by the existing hotswap rewriter. This PR plugs into the `applyWmmaSplitPatches` weak-symbol stub already declared in `comgr-hotswap-b0a0.cpp` (from the policy/API 3-of-3 series #2203). Same plug-in pattern as `comgr-hotswap-patch-inplace.cpp` (#2222): a single new `comgr-hotswap-patch-wmma-split.cpp` provides the strong override, and the integration is one CMakeLists.txt source-list line. No changes to the existing hotswap infrastructure files.
Provide the strong-symbol override for
applyInPlacePatches, handling two B0 instruction rewrites that fit in the original encoding size:No trampolines, ELF growth, or extra VGPRs required.
Changes
getClusterLoadReplacementAsmmaps 7 cluster_load mnemonics to global_load equivalents viaStringSwitch.applyInPlacePatchesdispatches cluster_load toswapOpcode(MCInst::setOpcode + MCCodeEmitter) and s_clause toapplyByteReplace.resolveOpcodereturnsstd::optional<unsigned>instead of a magic sentinel, addressing the error-handling consistency concern. Guarded with#if !defined(_MSC_VER)for Windows compatibility ([Comgr] Fix Windows build: use LLVM_ATTRIBUTE_WEAK for hotswap stubs #2294).--outputis specified, verify output size matches input size (in-place patches must preserve ELF size) then dump the rewritten ELF, skipping the memcmp equality check (patched content will differ from input). Backward-compatible with existing negative-path tests.hotswap-inplacelit binary..sto suffixes so lit discovers assembly-based test files..slit tests using%clangto assemble,hotswap-rewrite --outputto rewrite,%llvm-objdump -d | %FileCheckto verify, andcmpfor idempotency.test-lit/hotswap-inplace.c(old lit test) andcomgr-sources/hotswap-inplace.c(old C test driver). Tests now use the sharedhotswap-rewritebinary from [Comgr] Add end-to-end LIT coverage for amd_comgr_hotswap_rewrite #2291.Test plan
Lit tests (4 cases):
cmp.