Skip to content

rocclr: Add optional hang recovery for SDMA D2H hang (all OFF by defa…#270

Open
chun-wan wants to merge 1 commit into
ROCm:developfrom
chun-wan:hang-recovery-feature
Open

rocclr: Add optional hang recovery for SDMA D2H hang (all OFF by defa…#270
chun-wan wants to merge 1 commit into
ROCm:developfrom
chun-wan:hang-recovery-feature

Conversation

@chun-wan
Copy link
Copy Markdown

@chun-wan chun-wan commented Apr 2, 2026

…ult)

Add 4-layer hang recovery mechanism controlled by HIP_HANG_RECOVERY_ENABLE (default=0, disabled). When disabled, zero behavioral change from stock develop — all new code paths are gated by the master switch.

When HIP_HANG_RECOVERY_ENABLE=1:

L1 - Signal timeout abort (HIP_MAX_SIGNAL_WAIT, default 60s):
WaitForSignal's existing 4-sec loop is extended with a configurable
max wait. On timeout, hsa_signal_silent_store_relaxed(signal, 0) is
used to force-complete the signal (bypassing roctracer interception)
and the thread resumes. An 'aborted' flag propagates to CpuWaitForSignal.

L2 - Permanent SDMA bypass:
After first signal abort, SdmaHealthTracker::ForcePermanentBypass()
is called. KernelBlitManager::copyBuffer then forces shader blit path
for all subsequent copies, preventing further submissions to the
faulted SDMA engine.

L3 - callbackQueue abort suppression:
When hang recovery is active, callbackQueue checks IsInHangRecovery()
and suppresses abort(), logging the error instead. This prevents the
process from being killed by GPU queue errors during recovery.

L4 - SIGABRT handler:
hangRecoveryAbortHandler intercepts SIGABRT from ROCr VM fault handler.
When recovery is active, it re-registers itself (defeating abort's
handler reset) and freezes the caller thread with pause(). The process
survives even if ROCr calls abort().

Optional debug logging via HIP_DEBUG_LOG env var (rocdebuglog.hpp). WaitActiveStreams cascade detection warns at 10K/100K idle iterations.

Background: Multi-process VRAM oversubscription can trigger KFD BO eviction during in-flight SDMA D2H copies, causing HSA signals to never complete. Without recovery, all threads hang permanently.

Env vars:
HIP_HANG_RECOVERY_ENABLE=0|1 (master switch, default 0)
HIP_MAX_SIGNAL_WAIT=N (seconds, default 60, 0=infinite)
HIP_DEBUG_LOG=0|1|path (optional logging, default off)

Tested with KFD eviction reproducer: 120s stress test with HANG_HOGS=6, HIP_MAX_SIGNAL_WAIT=4: 26 hang recoveries, 0 permanent hang, 0 coredump, process survives to EXIT=0.

Made-with: Cursor

Associated JIRA ticket number/Github issue number

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Continuous Integration

What were the changes?

Why are these changes needed?

Updated CHANGELOG?

  • Yes
  • No, Does not apply to this PR.

Added/Updated documentation?

  • Yes
  • No, Does not apply to this PR.

Additional Checks

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally.
  • Any dependent changes have been merged.

@chun-wan chun-wan force-pushed the hang-recovery-feature branch from d1ac31a to dc9693b Compare April 2, 2026 14:20
…ult)

Add 4-layer hang recovery mechanism controlled by HIP_HANG_RECOVERY_ENABLE
(default=0, disabled). When disabled, zero behavioral change from stock
develop — all new code paths are gated by the master switch.

When HIP_HANG_RECOVERY_ENABLE=1:

L1 - Signal timeout abort (HIP_MAX_SIGNAL_WAIT, default 60s):
  WaitForSignal's existing 4-sec loop is extended with a configurable
  max wait. On timeout, hsa_signal_silent_store_relaxed(signal, 0) is
  used to force-complete the signal (bypassing roctracer interception)
  and the thread resumes. An 'aborted' flag propagates to CpuWaitForSignal.

L2 - Permanent SDMA bypass:
  After first signal abort, SdmaHealthTracker::ForcePermanentBypass()
  is called. KernelBlitManager::copyBuffer then forces shader blit path
  for all subsequent copies, preventing further submissions to the
  faulted SDMA engine.

L3 - callbackQueue abort suppression:
  When hang recovery is active, callbackQueue checks IsInHangRecovery()
  and suppresses abort(), logging the error instead. This prevents the
  process from being killed by GPU queue errors during recovery.

L4 - SIGABRT handler:
  hangRecoveryAbortHandler intercepts SIGABRT from ROCr VM fault handler.
  When recovery is active, it re-registers itself (defeating abort's
  handler reset) and freezes the caller thread with pause(). The process
  survives even if ROCr calls abort().

Optional debug logging via HIP_DEBUG_LOG env var (rocdebuglog.hpp).
WaitActiveStreams cascade detection warns at 10K/100K idle iterations.

Background: Multi-process VRAM oversubscription can trigger KFD BO
eviction during in-flight SDMA D2H copies, causing HSA signals to
never complete. Without recovery, all threads hang permanently.

Env vars:
  HIP_HANG_RECOVERY_ENABLE=0|1  (master switch, default 0)
  HIP_MAX_SIGNAL_WAIT=N         (seconds, default 60, 0=infinite)
  HIP_DEBUG_LOG=0|1|path        (optional logging, default off)

Tested with KFD eviction reproducer: 120s stress test with HANG_HOGS=6,
HIP_MAX_SIGNAL_WAIT=4: 26 hang recoveries, 0 permanent hang, 0 coredump,
process survives to EXIT=0.

Co-authored-by: Clement Lin <clement.lin@amd.com>
Made-with: Cursor
@chun-wan chun-wan force-pushed the hang-recovery-feature branch from dc9693b to d473d6a Compare April 4, 2026 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant