Skip to content

[AMDGPU] comgr: Add DS_ADDTID trampoline patch for A0 16-bit M0 truncation#2302

Open
xintin wants to merge 4 commits into
ROCm:amd-stagingfrom
xintin:hotswap-ds-load-store-addtid
Open

[AMDGPU] comgr: Add DS_ADDTID trampoline patch for A0 16-bit M0 truncation#2302
xintin wants to merge 4 commits into
ROCm:amd-stagingfrom
xintin:hotswap-ds-load-store-addtid

Conversation

@xintin
Copy link
Copy Markdown

@xintin xintin commented Apr 23, 2026

Stacked on #2583.

Problem

ds_load_addtid_b32 and ds_store_addtid_b32 compute LDS addresses as M0 + laneID * 4 + offset. On B0 the DS unit reads 20 bits of M0 (1 MB); on A0 it reads only 16 (64 KB), silently truncating bits [19:16] (DEGFXMI400-12025). gfx1250 supports up to 320 KB LDS per WGP, so the upper 256 KB is unreachable through ADDTID. When B0-compiled code using ADDTID is hotswapped onto A0, M0 values above 65535 wrap and produce silently wrong LDS accesses.

Fix

patchDsAddtid (in comgr-hotswap-patch-trampoline.cpp) rewrites each ADDTID instruction into an ALU sequence (v_mbcnt_lo_u32_b32 / v_mbcnt_hi_u32_b32 for laneID, v_lshlrev_b32 ×4, v_add_u32 with M0 as a 32-bit operand) followed by a regular ds_load_b32 / ds_store_b32.
The M0 read is bypassed and the original offset is preserved.
Emission goes through emitReplacementCode (NOP sled preferred, deferred-trampoline fallback). GDS=1 is rejected with a diagnostic.

  • Load reuses vDST as the laneID temporary (no extra VGPR).
  • Store pulls a dead VGPR via allocScratchVgpr. If none is available the patch bails; when the kernel's LDS exceeds 64 KiB a warning is also logged so the A0 truncation risk surfaces to the user. The LDS size is read through ElfView::getKernelLdsSize, added in [AMDGPU] comgr: Add ElfView::getKernelLdsSize helper #2583.

Testing

  • test-lit/hotswap-trampoline-addtid.s : sled-based rewrite, operand pinned DISASM across the full 6-instruction body.
  • test-lit/hotswap-trampoline-addtid-nosled.s : deferred-trampoline fallback.
  • test-unit/HotswapMCTest.cpp AddTid suite: ADDTID assemble/decode round-trips, operand layout, GDS rejection, trampoline asm assembly, buildTrampoline integration.

@xintin xintin added comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature ci:skip Skip all CI builds/tests for this PR labels Apr 23, 2026
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@xintin xintin changed the title [comgr][hotswap] Add DS_ADDTID trampoline patch for A0 16-bit M0 truncation [AMDGPU] comgr: Add DS_ADDTID trampoline patch for A0 16-bit M0 truncation Apr 30, 2026
@xintin xintin force-pushed the hotswap-ds-load-store-addtid branch from 961821a to d70fc7a Compare May 15, 2026 21:20
@xintin xintin force-pushed the hotswap-ds-load-store-addtid branch from d70fc7a to 8db0906 Compare May 18, 2026 12:03
Comment thread amd/comgr/src/comgr-hotswap-patch-trampoline.cpp Outdated
@xintin xintin force-pushed the hotswap-ds-load-store-addtid branch from 8db0906 to 42c9d6b Compare May 20, 2026 07:27
@xintin xintin marked this pull request as ready for review May 20, 2026 07:28
@xintin xintin requested review from chinmaydd and lamb-j as code owners May 20, 2026 07:28
@xintin xintin requested a review from jmmartinez May 20, 2026 07:40
@xintin xintin removed the ci:skip Skip all CI builds/tests for this PR label May 20, 2026
Comment thread amd/comgr/src/comgr-hotswap-elf.cpp
xintin added a commit that referenced this pull request May 20, 2026
Adds a small `ElfView` helper that reads `group_segment_fixed_size` from
the kernel descriptor symbol `<KernelName>.kd`. Returns` std::nullopt`
when the descriptor symbol is missing.

Split out as a precursor to the `ds_*_addtid_b32` trampoline patch (PR
#2302). That patch will use the helper to gate a diagnostic for kernels
with LDS allocations larger than the A0 16-bit M0 truncation limit.

Unit test `ElfView.GetKernelLdsSizeReturnsNulloptWhenKdMissing` covers
the missing-descriptor path with a hand-crafted minimal ELF; the
positive path will be exercised through lit once a consumer lands.

---------

Signed-off-by: xintin <gaurav.verma@amd.com>
Comment thread amd/comgr/src/comgr-hotswap-patch-trampoline.cpp Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-trampoline.cpp Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-trampoline.cpp
Comment thread amd/comgr/src/comgr-hotswap-patch-trampoline.cpp
Comment thread amd/comgr/src/comgr-hotswap-patch-trampoline.cpp Outdated
Comment thread amd/comgr/test-lit/hotswap-trampoline-addtid-nosled.s Outdated
Comment thread amd/comgr/test-lit/hotswap-trampoline-addtid.s Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-trampoline.cpp Outdated
Comment thread amd/comgr/src/comgr-hotswap-internal.h Outdated
Comment thread amd/comgr/src/comgr-hotswap-patch-trampoline.cpp
Comment thread amd/comgr/test-lit/hotswap-trampoline-addtid.s Outdated
…ation

Rewrites `ds_*_addtid_b32` into an ALU-computed `ds_*_b32` so the
address bypasses the 16-bit M0 read on A0 (DEGFXMI400-12025). A
diagnostic is emitted when the kernel's LDS allocation exceeds 64 KiB.

Tests: lit `hotswap-trampoline-addtid{,-nosled}.s` and gtest `AddTid`
in `HotswapMCTest.cpp`.

Stacked on llvm#2583.

Assisted-by: Cursor
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
@xintin xintin force-pushed the hotswap-ds-load-store-addtid branch from 65115ab to 12de2b6 Compare May 21, 2026 21:10
Signed-off-by: xintin <gaurav.verma@amd.com>
@xintin xintin force-pushed the hotswap-ds-load-store-addtid branch from 12de2b6 to cd69a96 Compare May 21, 2026 21:13
Copy link
Copy Markdown

@chinmaydd chinmaydd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing comments. Please ensure to keep the commit message minimal during squash+merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comgr Related to Code Object Manager hotswap Related to the Comgr Hotswap feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants