rocclr: Add optional hang recovery for SDMA D2H hang (all OFF by defa…#270
Open
chun-wan wants to merge 1 commit into
Open
rocclr: Add optional hang recovery for SDMA D2H hang (all OFF by defa…#270chun-wan wants to merge 1 commit into
chun-wan wants to merge 1 commit into
Conversation
d1ac31a to
dc9693b
Compare
…ult) Add 4-layer hang recovery mechanism controlled by HIP_HANG_RECOVERY_ENABLE (default=0, disabled). When disabled, zero behavioral change from stock develop — all new code paths are gated by the master switch. When HIP_HANG_RECOVERY_ENABLE=1: L1 - Signal timeout abort (HIP_MAX_SIGNAL_WAIT, default 60s): WaitForSignal's existing 4-sec loop is extended with a configurable max wait. On timeout, hsa_signal_silent_store_relaxed(signal, 0) is used to force-complete the signal (bypassing roctracer interception) and the thread resumes. An 'aborted' flag propagates to CpuWaitForSignal. L2 - Permanent SDMA bypass: After first signal abort, SdmaHealthTracker::ForcePermanentBypass() is called. KernelBlitManager::copyBuffer then forces shader blit path for all subsequent copies, preventing further submissions to the faulted SDMA engine. L3 - callbackQueue abort suppression: When hang recovery is active, callbackQueue checks IsInHangRecovery() and suppresses abort(), logging the error instead. This prevents the process from being killed by GPU queue errors during recovery. L4 - SIGABRT handler: hangRecoveryAbortHandler intercepts SIGABRT from ROCr VM fault handler. When recovery is active, it re-registers itself (defeating abort's handler reset) and freezes the caller thread with pause(). The process survives even if ROCr calls abort(). Optional debug logging via HIP_DEBUG_LOG env var (rocdebuglog.hpp). WaitActiveStreams cascade detection warns at 10K/100K idle iterations. Background: Multi-process VRAM oversubscription can trigger KFD BO eviction during in-flight SDMA D2H copies, causing HSA signals to never complete. Without recovery, all threads hang permanently. Env vars: HIP_HANG_RECOVERY_ENABLE=0|1 (master switch, default 0) HIP_MAX_SIGNAL_WAIT=N (seconds, default 60, 0=infinite) HIP_DEBUG_LOG=0|1|path (optional logging, default off) Tested with KFD eviction reproducer: 120s stress test with HANG_HOGS=6, HIP_MAX_SIGNAL_WAIT=4: 26 hang recoveries, 0 permanent hang, 0 coredump, process survives to EXIT=0. Co-authored-by: Clement Lin <clement.lin@amd.com> Made-with: Cursor
dc9693b to
d473d6a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…ult)
Add 4-layer hang recovery mechanism controlled by HIP_HANG_RECOVERY_ENABLE (default=0, disabled). When disabled, zero behavioral change from stock develop — all new code paths are gated by the master switch.
When HIP_HANG_RECOVERY_ENABLE=1:
L1 - Signal timeout abort (HIP_MAX_SIGNAL_WAIT, default 60s):
WaitForSignal's existing 4-sec loop is extended with a configurable
max wait. On timeout, hsa_signal_silent_store_relaxed(signal, 0) is
used to force-complete the signal (bypassing roctracer interception)
and the thread resumes. An 'aborted' flag propagates to CpuWaitForSignal.
L2 - Permanent SDMA bypass:
After first signal abort, SdmaHealthTracker::ForcePermanentBypass()
is called. KernelBlitManager::copyBuffer then forces shader blit path
for all subsequent copies, preventing further submissions to the
faulted SDMA engine.
L3 - callbackQueue abort suppression:
When hang recovery is active, callbackQueue checks IsInHangRecovery()
and suppresses abort(), logging the error instead. This prevents the
process from being killed by GPU queue errors during recovery.
L4 - SIGABRT handler:
hangRecoveryAbortHandler intercepts SIGABRT from ROCr VM fault handler.
When recovery is active, it re-registers itself (defeating abort's
handler reset) and freezes the caller thread with pause(). The process
survives even if ROCr calls abort().
Optional debug logging via HIP_DEBUG_LOG env var (rocdebuglog.hpp). WaitActiveStreams cascade detection warns at 10K/100K idle iterations.
Background: Multi-process VRAM oversubscription can trigger KFD BO eviction during in-flight SDMA D2H copies, causing HSA signals to never complete. Without recovery, all threads hang permanently.
Env vars:
HIP_HANG_RECOVERY_ENABLE=0|1 (master switch, default 0)
HIP_MAX_SIGNAL_WAIT=N (seconds, default 60, 0=infinite)
HIP_DEBUG_LOG=0|1|path (optional logging, default off)
Tested with KFD eviction reproducer: 120s stress test with HANG_HOGS=6, HIP_MAX_SIGNAL_WAIT=4: 26 hang recoveries, 0 permanent hang, 0 coredump, process survives to EXIT=0.
Made-with: Cursor
Associated JIRA ticket number/Github issue number
What type of PR is this? (check all applicable)
What were the changes?
Why are these changes needed?
Updated CHANGELOG?
Added/Updated documentation?
Additional Checks