Skip to content

refactor(core): drain GPU workers on unregister via shared-ref shutdown#329

Open
xiaguan wants to merge 2 commits into
masterfrom
refactor/instance-shutdown-drain-join
Open

refactor(core): drain GPU workers on unregister via shared-ref shutdown#329
xiaguan wants to merge 2 commits into
masterfrom
refactor/instance-shutdown-drain-join

Conversation

@xiaguan

@xiaguan xiaguan commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What

Instance teardown now drains and joins each GPU worker thread before returning, so any in-flight save/load finishes touching GPU memory before the caller releases CUDA tensors. This closes the use-after-unmap window on the unregister / cleanup path (drain-then-unmap).

Why the rewrite

The earlier approach inferred "in-flight" from Arc::strong_count and polled with a 20s/50ms retry loop, refusing to unregister a "busy" instance (InstanceBusy). That is fragile (any incidental Arc clone breaks it; it doesn't even cover fire-and-forget loads — only the thread join() does) and it only existed to work around one constraint: shutdown needed owned access (Arc::try_unwrap) to join() the worker.

The fix makes worker shutdown callable through a shared reference: tx/handle live behind Mutex<Option<…>>, so shutdown(&self) closes the channel (draining queued tasks) and joins the thread without owning the GpuContext. The real safety guarantee — the join() plus "release tensors after unregister" ordering — is unchanged; the strong_count scaffolding is just deleted.

Changes

  • gpu_worker.rs: Worker<T> shutdown via interior mutability; GpuWorkerPool::shutdown(&self). In-flight tasks drain before the join (crossbeam channel semantics).
  • instance.rs: InstanceContext::shutdown(&self) / GpuContext::shutdown(&self); registry unregister removes-then-shuts-down. Removed strong_count gating, active_gpu_refs, InstanceBusy*, take_idle* retry loops, timeout constants. Registry now uses parking_lot::RwLock (no poison ceremony, no .expect() panic in list_ids).
  • lib.rs: dropped EngineError::{InstanceBusy, Poisoned}; unregister_all_instances is best-effort and returns Vec<String> again (one stuck instance no longer blocks teardown of the others). Documented that unregister_* block on join → must run via spawn_blocking.
  • service.rs / http_server.rs: error-mapping arms for the removed variants dropped; cleanup keeps the "engine unregister, then release CUDA tensors" order.

Sync-lock-in-async note

All locks are parking_lot (sync). Audited: no guard is held across an .await or across the blocking join (save submit drops its guard before awaiting the reply; registry write guard is a temporary dropped before shutdown; metadata is cloned-and-released before joining). Sync is also requiredDrop must be able to call shutdown and cannot .await, and join is blocking (hence spawn_blocking). parking_lot guards are !Send, so holding one across an await wouldn't compile anyway.

Tests

New GPU integration tests pegaflow-core/tests/unregister_lifecycle.rs (real engine, real CUDA, public API, no mocks):

  • unregister_while_saving_is_safe — 16 concurrent saves race unregister_instance. Verified non-vacuous via temporary instrumentation: every run drains some in-flight saves to completion and rejects others cleanly; 10× stable.
  • unregister_all_drains_every_instance_under_load — 3 instances under concurrent save load, unregister_all tears all down and empties the registry.

Every interleaving satisfies the invariant (a save either completes or fails cleanly); a panic / CUDA fault is the regression.

Full cargo test --release --no-default-features --features cuda-13,rdma workspace suite passes, including http_cleanup_hang_repro and mock_vllm_rpc_e2e.

🤖 Generated with Claude Code

xiaguan and others added 2 commits June 8, 2026 13:29
Instance unregister now closes each GPU worker's channel and joins the
thread before returning, so any in-flight save/load finishes touching GPU
memory before the caller releases CUDA tensors — no use-after-unmap.

Worker shutdown runs through a shared reference (tx/handle behind
Mutex<Option<...>>), so an instance can be torn down while other tasks
still hold an Arc to its GpuContext. This removes the ownership handoff
that previously forced the whole gating edifice: Arc::strong_count
polling, active_gpu_refs, InstanceBusy, the take_idle retry loop, and the
20s/50ms timeout constants are all gone.

- unregister_all is best-effort: one instance can no longer block teardown
  of the others (was all-or-nothing)
- registry moved to parking_lot::RwLock, dropping the Poisoned error path
  and the inconsistent .expect() panic in list_ids
- document that unregister_* block on join and must run via spawn_blocking
- add GPU integration tests for unregister / unregister_all racing with
  concurrent saves (every interleaving valid; panic/CUDA fault = regression)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The task queue already uses crossbeam (its Sender must be Sync to live behind Arc<GpuContext>). Move the one-shot worker-ready handshake onto crossbeam too and drop the std::sync::mpsc import — one channel abstraction in the file, no 'as std_mpsc' alias.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@GentleCold

Copy link
Copy Markdown
Collaborator

I found two lifecycle issues that should be addressed before merging:

  1. High: cleanup can drop tensors for a newly registered instance with the same instance_id.

InstanceRegistry::unregister() removes the instance from the engine map before shutdown() drains and joins the workers. While that join is still running, a new register_context_batch using the same instance_id can create a fresh engine instance and register fresh CUDA IPC tensors. After the old unregister finishes, cleanup_instance() / HTTP cleanup calls registry.drop_instance(instance_id) by prefix, or clear() for cleanup-all. That can release the new instance's tensors while the new engine instance still holds their data_ptrs, leaving later save/load operations pointing at unmapped CUDA IPC memory.

The fix probably needs either a per-instance cleanup barrier that prevents same-id registration while teardown is in progress, or an engine/registry generation token so cleanup only releases tensors belonging to the old generation captured at cleanup start.

  1. Medium: if worker shutdown returns an error, the instance is already removed but tensors and leases are not cleaned up.

PegaEngine::unregister_instance() propagates the error from instances.unregister(instance_id) before calling query_leases.release_instance(instance_id). The gRPC cleanup path also returns immediately for non-InstanceMissing errors, before registry.drop_instance(). But InstanceRegistry::unregister() has already removed the instance from the map at that point. For example, a worker panic during join() would leave the engine instance gone while CUDA tensors and query leases remain registered.

Please consider returning a richer unregister result such as "removed + shutdown_result", or otherwise ensuring that once an instance has been removed and no worker is still running, query leases and registry tensors are released even if shutdown reports an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants