Skip to content

[AMDGPU] comgr: HotSwap in-place patches for B0-to-A0 rewriting#2222

Merged
chinmaydd merged 4 commits into
ROCm:amd-stagingfrom
xintin:hotswap-inplace-patches
Apr 27, 2026
Merged

[AMDGPU] comgr: HotSwap in-place patches for B0-to-A0 rewriting#2222
chinmaydd merged 4 commits into
ROCm:amd-stagingfrom
xintin:hotswap-inplace-patches

Conversation

@xintin
Copy link
Copy Markdown

@xintin xintin commented Apr 16, 2026

Provide the strong-symbol override for applyInPlacePatches, handling two B0 instruction rewrites that fit in the original encoding size:

  • cluster_load_{b32,b64,b128} -> global_load_{b32,b64,b128} (opcode swap via MCInst::setOpcode + MCCodeEmitter)
  • cluster_load_async_to_lds_{b8,b32,b64,b128} -> global_load_async_to_lds_{b8,b32,b64,b128} (same)
  • s_clause -> s_nop (byte-level overwrite via applyByteReplace + cached SNopBytes)
    No trampolines, ELF growth, or extra VGPRs required.

Changes

  • comgr-hotswap-patch-inplace.cpp: getClusterLoadReplacementAsm maps 7 cluster_load mnemonics to global_load equivalents via StringSwitch. applyInPlacePatches dispatches cluster_load to swapOpcode (MCInst::setOpcode + MCCodeEmitter) and s_clause to applyByteReplace. resolveOpcode returns std::optional<unsigned> instead of a magic sentinel, addressing the error-handling consistency concern. Guarded with #if !defined(_MSC_VER) for Windows compatibility ([Comgr] Fix Windows build: use LLVM_ATTRIBUTE_WEAK for hotswap stubs #2294).
  • comgr-sources/hotswap-rewrite.c: when --output is specified, verify output size matches input size (in-place patches must preserve ELF size) then dump the rewritten ELF, skipping the memcmp equality check (patched content will differ from input). Backward-compatible with existing negative-path tests.
  • CMakeLists.txt: add new source to library build; remove hotswap-inplace lit binary.
  • lit.cfg.py: add .s to suffixes so lit discovers assembly-based test files.
  • test-lit/hotswap-inplace-{mixed,noop,b64,async}.s: 4 self-contained .s lit tests using %clang to assemble, hotswap-rewrite --output to rewrite, %llvm-objdump -d | %FileCheck to verify, and cmp for idempotency.
  • Removed: test-lit/hotswap-inplace.c (old lit test) and comgr-sources/hotswap-inplace.c (old C test driver). Tests now use the shared hotswap-rewrite binary from [Comgr] Add end-to-end LIT coverage for amd_comgr_hotswap_rewrite #2291.

Test plan

Lit tests (4 cases):

  • hotswap-inplace-mixed.s: cluster_load_b32 + cluster_load_b128 + s_clause. Verifies via objdump that all three are replaced (global_load_b32, global_load_b128, s_nop), original global_load instructions preserved, rewrite succeeds, idempotent via cmp.
  • hotswap-inplace-noop.s: kernel with no patchable instructions (passthrough). Verifies nothing is changed, no cluster_load or s_clause in output, idempotent.
  • hotswap-inplace-b64.s: cluster_load_b64 swapped to global_load_b64, idempotent.
  • hotswap-inplace-async.s: cluster_load_async_to_lds_{b8,b32,b64,b128} -- all four async widths swapped to global_load_async_to_lds equivalents, idempotent.

@z1-cciauto
Copy link
Copy Markdown
Collaborator

@lamb-j lamb-j added the hotswap Related to the Comgr Hotswap feature label Apr 16, 2026
@chinmaydd chinmaydd added the comgr Related to Code Object Manager label Apr 17, 2026
@lamb-j lamb-j added the ci:skip Skip all CI builds/tests for this PR label Apr 17, 2026
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@xintin xintin force-pushed the hotswap-inplace-patches branch from 66d85c8 to 2d2d988 Compare April 22, 2026 12:01
@xintin xintin changed the title [AMDGPU] comgr: HotSwap inplace patches [AMDGPU] comgr: HotSwap in-place patches for B0-to-A0 rewriting Apr 22, 2026
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@ronlieb
Copy link
Copy Markdown
Collaborator

ronlieb commented Apr 22, 2026

this has ci:skip tag, so i cancelled the npsdb and rockCI. if you want it tested please remove the tag

@xintin xintin force-pushed the hotswap-inplace-patches branch from 2d2d988 to d3483bc Compare April 22, 2026 16:33
@xintin xintin marked this pull request as ready for review April 22, 2026 16:35
@xintin xintin requested review from chinmaydd and lamb-j as code owners April 22, 2026 16:35
@xintin xintin removed the ci:skip Skip all CI builds/tests for this PR label Apr 22, 2026
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@xintin xintin force-pushed the hotswap-inplace-patches branch from d3483bc to abe66e1 Compare April 22, 2026 17:13
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Comment thread amd/comgr/src/comgr-hotswap-patch-inplace.cpp Outdated
Comment thread amd/comgr/test-lit/hotswap-inplace.c Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-inplace.cpp Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-inplace.cpp
Comment thread amd/comgr/src/comgr-hotswap-patch-inplace.cpp Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-inplace.cpp Outdated
@xintin xintin force-pushed the hotswap-inplace-patches branch from abe66e1 to d1a2e17 Compare April 22, 2026 21:59
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown

@chinmaydd chinmaydd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xintin
Copy link
Copy Markdown
Author

xintin commented Apr 22, 2026

Note: We will wait for #2291 to land. Rebase, refactor and wait for ci to be green before landing this.

@lamb-j
Copy link
Copy Markdown
Collaborator

lamb-j commented Apr 22, 2026

For the test, can we update the LIT file to be a .s file? We can create multiple .s files if needed.

Also, those can be compiled with clang, and run through the existing hotswap-rewrite, instead of creating a separate hotswap-inplace.c driver


Roughly like this — the test file becomes a self-contained .s with RUN lines at the top:

# RUN: %clang -target amdgcn-amd-amdhsa -mcpu=gfx1250 %s -o %t.elf
# RUN: hotswap-rewrite %t.elf \
# RUN:     amdgcn-amd-amdhsa--gfx1250 amdgcn-amd-amdhsa--gfx1250 \
# RUN:     --output %t.out.elf \
# RUN:   | %FileCheck --check-prefix=API %s
# API: RESULT: SUCCESS

# RUN: %llvm-objdump -d %t.out.elf | %FileCheck --check-prefix=DISASM %s
# DISASM-NOT: cluster_load_b32
# DISASM-NOT: cluster_load_b128
# DISASM-NOT: s_clause
# DISASM-DAG: global_load_b32 v0
# DISASM-DAG: global_load_b128 v[4:7]
# DISASM-DAG: s_nop

.amdgcn_target "amdgcn-amd-amdhsa--gfx1250"
.text
.globl test_inplace_kernel
.p2align 8
.type test_inplace_kernel,@function
test_inplace_kernel:
 cluster_load_b32 v0, v[2:3], off
 s_wait_loadcnt 0x0
 cluster_load_b128 v[4:7], v[8:9], off
 s_wait_loadcnt 0x0
 s_clause 0x1
 global_load_b32 v10, v[2:3], off
 ...
 s_endpgm
.Ltest_inplace_kernel_end:
.size test_inplace_kernel, .Ltest_inplace_kernel_end-test_inplace_kernel
...

A few practical notes:

  • .s files use # for line comments, so RUN/CHECK lines start with # instead of //
  • %clang -target amdgcn-amd-amdhsa -mcpu=gfx1250 %s -o %t.elf should both assemble and link via lld in one call
  • The current hotswap-inplace.c shim also does an idempotency check. To preserve that in LIT, you'd run hotswap-rewrite a second time and cmp the outputs:
# RUN: hotswap-rewrite %t.out.elf ... --output %t.out2.elf | FileCheck ...
# RUN: cmp %t.out.elf %t.out2.elf

For the 3 test cases (cluster_load_b32+b128+s_clause, no-op kernel, cluster_load_b64), you'd have three separate .s files — hotswap-inplace-mixed.s, hotswap-inplace-noop.s, hotswap-inplace-b64.s — each with their own RUN/CHECK blocks.

@xintin xintin force-pushed the hotswap-inplace-patches branch from d1a2e17 to ca7d489 Compare April 22, 2026 23:42
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@xintin
Copy link
Copy Markdown
Author

xintin commented Apr 23, 2026

For the test, can we update the LIT file to be a .s file? We can create multiple .s files if needed.

This is a good suggestion. PR #2291 introduces the infra for this. After rebase on #2291, the only additional change needed to hotswap-rewrite.c will be to move the dump before the memcmp and skipping equality checks when --output is set.

Update: addressed after landing #2291.

@xintin xintin force-pushed the hotswap-inplace-patches branch 2 times, most recently from f0565f9 to fddac1e Compare April 23, 2026 01:55
@z1-cciauto
Copy link
Copy Markdown
Collaborator

ronlieb pushed a commit that referenced this pull request Apr 23, 2026
PR #2203 introduced 15 weak-symbol patch / liveness / DWARF stubs
in comgr-hotswap-b0a0.cpp using GCC/Clang attribute syntax:

  __attribute__((weak)) uint32_t applyInPlacePatches(PatchContext &,
                                                     size_t) {
    return 0;
  }

MSVC does not understand this. It tokenizes __attribute__ as an
identifier and `weak` as undeclared, then errors on every weak stub:

  comgr-hotswap-b0a0.cpp(95): error C2065: 'weak': undeclared identifier
  comgr-hotswap-b0a0.cpp(95): error C2374: '__attribute__': redefinition
  [...repeats for every weak stub at lines 95, 98, 101, 104, 107, 117,
   123, 135, 141, 143, 152, 157, 162, 164, 166...]

Replace with LLVM's portable LLVM_ATTRIBUTE_WEAK macro from
llvm/Support/Compiler.h. On Linux/macOS it expands to
__attribute__((__weak__)) (current behaviour preserved); on Windows
it expands to nothing (compiles fine, no linker conflict because
b0a0.cpp is currently the only TU defining these names).

This is the immediate Windows unblock, not a complete fix for the
underlying override design. When follow-up patch PRs (#2222 in-place,
future trampoline / WMMA / scratch) land their strong-symbol overrides,
Windows will need a different mechanism (e.g. function-pointer
registration in a static initializer, or __declspec(selectany), or
#ifdef-out the patch overrides on MSVC). Tracked separately.

Failing Windows job from PR #2203:
https://github.com/ROCm/llvm-project/actions/runs/24758866527/job/72438109210
Comment thread amd/comgr/src/comgr-hotswap-patch-inplace.cpp Outdated
@xintin xintin force-pushed the hotswap-inplace-patches branch from fddac1e to 79dc77a Compare April 24, 2026 17:46
@z1-cciauto
Copy link
Copy Markdown
Collaborator

xintin added 3 commits April 24, 2026 17:47
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
@xintin xintin force-pushed the hotswap-inplace-patches branch from 79dc77a to 1ad8ed9 Compare April 24, 2026 17:56
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@chinmaydd
Copy link
Copy Markdown

Do we need a restart of other build jobs ? Though we should have the ability to merge once the PSDB passes

@xintin
Copy link
Copy Markdown
Author

xintin commented Apr 24, 2026

We need a restart but seems like I do not have permissions. Can you check once?

@xintin xintin force-pushed the hotswap-inplace-patches branch from 1ad8ed9 to 4340ade Compare April 26, 2026 22:17
@z1-cciauto
Copy link
Copy Markdown
Collaborator

Comment thread amd/comgr/test-lit/comgr-sources/hotswap-rewrite.c
Comment thread amd/comgr/src/comgr-hotswap-patch-inplace.cpp
Signed-off-by: xintin <gaurav.verma@amd.com>
@xintin xintin force-pushed the hotswap-inplace-patches branch from 4340ade to 72c5252 Compare April 27, 2026 15:21
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@chinmaydd chinmaydd merged commit d24c314 into ROCm:amd-staging Apr 27, 2026
37 of 39 checks passed
lamb-j pushed a commit that referenced this pull request Apr 30, 2026
… A0 (#2287)

## Summary

A0 silicon has a race condition where `s_barrier_signal_isfirst` with a
user cluster barrier ID may return garbage in SCC before the barrier
completes. The non-isfirst variant (`s_barrier_signal`) has the same
encoding size and operand layout but does not write SCC.

This patch adds the `s_barrier_signal_isfirst` -> `s_barrier_signal`
in-place opcode swap to the existing `applyInPlacePatches` dispatcher
alongside the `cluster_load -> global_load` and `s_clause -> s_nop`
rules from #2222. The swap is unconditional for all
`s_barrier_signal_isfirst` instructions. Non-isfirst `s_barrier_signal`
sites are left unchanged.

## Changes
- **comgr-hotswap-patch-inplace.cpp**: Add `s_barrier_signal_isfirst` ->
`s_barrier_signal` swap using the existing `resolveOpcode` /
`swapOpcode` helpers. Same pattern as the `cluster_load` swap -- no new
infrastructure.
- **hotswap-barrier-isfirst.s**: Positive lit test - kernel with 2x
`s_barrier_signal_isfirst -1`. Verifies both are swapped to
`s_barrier_signal`, waits are unchanged, no `s_barrier_signal_isfirst`
remains anywhere in disassembly. Idempotency check via second rewrite +
`cmp`.
- **hotswap-barrier-no-isfirst.s**: Passthrough lit test - kernel with
only `s_barrier_signal` (workgroup -1 and user cluster -3). Verifies
layout is structurally unchanged and no `s_barrier_signal_isfirst`
appears. Idempotency check.
- **hotswap-barrier-mixed.s**: Selectivity lit test - kernel with both
`s_barrier_signal_isfirst` and `s_barrier_signal` interleaved. Verifies
only the isfirst sites are swapped; the pre-existing non-isfirst site is
preserved. Guards against a `starts_with` / `contains` regression in the
mnemonic match.

## Tests
- 3 lit tests (positive / passthrough / mixed selectivity) with strict
CHECK-NEXT chains and CHECK-NOT brackets
- All existing lit, ctest, and unit tests pass

## Follow up (already tracked)  
Cache all in-place target opcodes in LLVMState at init time (covers
cluster_load, s_barrier_signal, and future in-place swaps). Currently
each swap calls resolveOpcode per instruction site;
caching at init eliminates redundant assemble+decode on every match.
This is tracked as part of the hotswap code-refactoring effort alongside
the others (#2253)

---------

Signed-off-by: xintin <gaurav.verma@amd.com>
kirthana14m pushed a commit that referenced this pull request Apr 30, 2026
… A0 (#2287)

## Summary

A0 silicon has a race condition where `s_barrier_signal_isfirst` with a
user cluster barrier ID may return garbage in SCC before the barrier
completes. The non-isfirst variant (`s_barrier_signal`) has the same
encoding size and operand layout but does not write SCC.

This patch adds the `s_barrier_signal_isfirst` -> `s_barrier_signal`
in-place opcode swap to the existing `applyInPlacePatches` dispatcher
alongside the `cluster_load -> global_load` and `s_clause -> s_nop`
rules from #2222. The swap is unconditional for all
`s_barrier_signal_isfirst` instructions. Non-isfirst `s_barrier_signal`
sites are left unchanged.

## Changes
- **comgr-hotswap-patch-inplace.cpp**: Add `s_barrier_signal_isfirst` ->
`s_barrier_signal` swap using the existing `resolveOpcode` /
`swapOpcode` helpers. Same pattern as the `cluster_load` swap -- no new
infrastructure.
- **hotswap-barrier-isfirst.s**: Positive lit test - kernel with 2x
`s_barrier_signal_isfirst -1`. Verifies both are swapped to
`s_barrier_signal`, waits are unchanged, no `s_barrier_signal_isfirst`
remains anywhere in disassembly. Idempotency check via second rewrite +
`cmp`.
- **hotswap-barrier-no-isfirst.s**: Passthrough lit test - kernel with
only `s_barrier_signal` (workgroup -1 and user cluster -3). Verifies
layout is structurally unchanged and no `s_barrier_signal_isfirst`
appears. Idempotency check.
- **hotswap-barrier-mixed.s**: Selectivity lit test - kernel with both
`s_barrier_signal_isfirst` and `s_barrier_signal` interleaved. Verifies
only the isfirst sites are swapped; the pre-existing non-isfirst site is
preserved. Guards against a `starts_with` / `contains` regression in the
mnemonic match.

## Tests
- 3 lit tests (positive / passthrough / mixed selectivity) with strict
CHECK-NEXT chains and CHECK-NOT brackets
- All existing lit, ctest, and unit tests pass

## Follow up (already tracked)  
Cache all in-place target opcodes in LLVMState at init time (covers
cluster_load, s_barrier_signal, and future in-place swaps). Currently
each swap calls resolveOpcode per instruction site;
caching at init eliminates redundant assemble+decode on every match.
This is tracked as part of the hotswap code-refactoring effort alongside
the others (#2253)

---------

Signed-off-by: xintin <gaurav.verma@amd.com>
jammm added a commit that referenced this pull request May 7, 2026
chinmaydd flagged on PR #2379 that the shared dispatcher API in
comgr-hotswap-b0a0.cpp returns uint32_t where 0 means EITHER
"this patch did not match the instruction" (correct dispatcher
fall-through) OR "matched but failed to apply" (silent miscompile
risk -- the rewriter returns SUCCESS with the original
A0-incompatible opcode left in .text). They identified this as a
cross-cutting concern across all hotswap patches (in-place #2222,
WMMA hazard #2265, WMMA split, future patches).

A proper fix is a separate follow-up upstream PR that refactors the
dispatcher to distinguish the two cases (e.g. an enum NoMatch /
Patched / Failed return type, or a `bool *Aborted` field threaded
through PatchContext that the dispatcher checks and uses to
short-circuit the rewrite with AMD_COMGR_STATUS_ERROR).

For now: classify each return-0 path in the splitter as either
"did not match" (correct) or "matched but failed" (currently silent),
add a header comment to applyWmmaSplitPatches documenting the
semantics + the cross-PR refactor concern, and tag every
matched-but-failed exit with an inline comment so it's clear which
return is the silent-miscompile risk.

The behavior is unchanged; this is documentation of an existing
limitation pending the cross-PR API refactor.
chinmaydd pushed a commit that referenced this pull request May 12, 2026
…ompatibility (#2379)

Adds the K=128 WMMA splitter as the next B0-to-A0 silicon-stepping
workaround in the COMGR HotSwap subsystem, alongside the in-place
patches landed in #2222. The splitter rewrites K=128 WMMA variants
that exist on GFX1250 B0 but not A0 into pairs of narrower WMMAs
that exist on both steppings, emitted as trampolines appended to
`.text` by the existing hotswap rewriter.

This PR plugs into the `applyWmmaSplitPatches` weak-symbol stub
already declared in `comgr-hotswap-b0a0.cpp` (from the policy/API
3-of-3 series #2203). Same plug-in pattern as
`comgr-hotswap-patch-inplace.cpp` (#2222): a single new
`comgr-hotswap-patch-wmma-split.cpp` provides the strong override,
and the integration is one CMakeLists.txt source-list line. No
changes to the existing hotswap infrastructure files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants