rocclr: Add optional hang recovery for SDMA D2H hang (all OFF by defa… by chun-wan · Pull Request #270 · ROCm/clr

chun-wan · 2026-04-02T02:56:16Z

…ult)

Add 4-layer hang recovery mechanism controlled by HIP_HANG_RECOVERY_ENABLE (default=0, disabled). When disabled, zero behavioral change from stock develop — all new code paths are gated by the master switch.

When HIP_HANG_RECOVERY_ENABLE=1:

L1 - Signal timeout abort (HIP_MAX_SIGNAL_WAIT, default 60s):
WaitForSignal's existing 4-sec loop is extended with a configurable
max wait. On timeout, hsa_signal_silent_store_relaxed(signal, 0) is
used to force-complete the signal (bypassing roctracer interception)
and the thread resumes. An 'aborted' flag propagates to CpuWaitForSignal.

L2 - Permanent SDMA bypass:
After first signal abort, SdmaHealthTracker::ForcePermanentBypass()
is called. KernelBlitManager::copyBuffer then forces shader blit path
for all subsequent copies, preventing further submissions to the
faulted SDMA engine.

L3 - callbackQueue abort suppression:
When hang recovery is active, callbackQueue checks IsInHangRecovery()
and suppresses abort(), logging the error instead. This prevents the
process from being killed by GPU queue errors during recovery.

L4 - SIGABRT handler:
hangRecoveryAbortHandler intercepts SIGABRT from ROCr VM fault handler.
When recovery is active, it re-registers itself (defeating abort's
handler reset) and freezes the caller thread with pause(). The process
survives even if ROCr calls abort().

Optional debug logging via HIP_DEBUG_LOG env var (rocdebuglog.hpp). WaitActiveStreams cascade detection warns at 10K/100K idle iterations.

Background: Multi-process VRAM oversubscription can trigger KFD BO eviction during in-flight SDMA D2H copies, causing HSA signals to never complete. Without recovery, all threads hang permanently.

Env vars:
HIP_HANG_RECOVERY_ENABLE=0|1 (master switch, default 0)
HIP_MAX_SIGNAL_WAIT=N (seconds, default 60, 0=infinite)
HIP_DEBUG_LOG=0|1|path (optional logging, default off)

Tested with KFD eviction reproducer: 120s stress test with HANG_HOGS=6, HIP_MAX_SIGNAL_WAIT=4: 26 hang recoveries, 0 permanent hang, 0 coredump, process survives to EXIT=0.

Made-with: Cursor

Associated JIRA ticket number/Github issue number

What type of PR is this? (check all applicable)

What were the changes?

Why are these changes needed?

Updated CHANGELOG?

Yes
No, Does not apply to this PR.

Added/Updated documentation?

Yes
No, Does not apply to this PR.

Additional Checks

I have added tests relevant to the introduced functionality, and the unit tests are passing locally.
Any dependent changes have been merged.

…ult) Add 4-layer hang recovery mechanism controlled by HIP_HANG_RECOVERY_ENABLE (default=0, disabled). When disabled, zero behavioral change from stock develop — all new code paths are gated by the master switch. When HIP_HANG_RECOVERY_ENABLE=1: L1 - Signal timeout abort (HIP_MAX_SIGNAL_WAIT, default 60s): WaitForSignal's existing 4-sec loop is extended with a configurable max wait. On timeout, hsa_signal_silent_store_relaxed(signal, 0) is used to force-complete the signal (bypassing roctracer interception) and the thread resumes. An 'aborted' flag propagates to CpuWaitForSignal. L2 - Permanent SDMA bypass: After first signal abort, SdmaHealthTracker::ForcePermanentBypass() is called. KernelBlitManager::copyBuffer then forces shader blit path for all subsequent copies, preventing further submissions to the faulted SDMA engine. L3 - callbackQueue abort suppression: When hang recovery is active, callbackQueue checks IsInHangRecovery() and suppresses abort(), logging the error instead. This prevents the process from being killed by GPU queue errors during recovery. L4 - SIGABRT handler: hangRecoveryAbortHandler intercepts SIGABRT from ROCr VM fault handler. When recovery is active, it re-registers itself (defeating abort's handler reset) and freezes the caller thread with pause(). The process survives even if ROCr calls abort(). Optional debug logging via HIP_DEBUG_LOG env var (rocdebuglog.hpp). WaitActiveStreams cascade detection warns at 10K/100K idle iterations. Background: Multi-process VRAM oversubscription can trigger KFD BO eviction during in-flight SDMA D2H copies, causing HSA signals to never complete. Without recovery, all threads hang permanently. Env vars: HIP_HANG_RECOVERY_ENABLE=0|1 (master switch, default 0) HIP_MAX_SIGNAL_WAIT=N (seconds, default 60, 0=infinite) HIP_DEBUG_LOG=0|1|path (optional logging, default off) Tested with KFD eviction reproducer: 120s stress test with HANG_HOGS=6, HIP_MAX_SIGNAL_WAIT=4: 26 hang recoveries, 0 permanent hang, 0 coredump, process survives to EXIT=0. Co-authored-by: Clement Lin <clement.lin@amd.com> Made-with: Cursor

chun-wan force-pushed the hang-recovery-feature branch from d1ac31a to dc9693b Compare April 2, 2026 14:20

chun-wan force-pushed the hang-recovery-feature branch from dc9693b to d473d6a Compare April 4, 2026 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocclr: Add optional hang recovery for SDMA D2H hang (all OFF by defa…#270

rocclr: Add optional hang recovery for SDMA D2H hang (all OFF by defa…#270
chun-wan wants to merge 1 commit into
ROCm:developfrom
chun-wan:hang-recovery-feature

chun-wan commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chun-wan commented Apr 2, 2026

Associated JIRA ticket number/Github issue number

What type of PR is this? (check all applicable)

What were the changes?

Why are these changes needed?

Updated CHANGELOG?

Added/Updated documentation?

Additional Checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant