fix(shm): prevent watchdog invalidation of in-transit SHM chunks#2632
fix(shm): prevent watchdog invalidation of in-transit SHM chunks#2632YuanYuYuan wants to merge 11 commits into
Conversation
Add `transport_ref_count: AtomicU32` to `ChunkHeaderType`. The TX path increments this counter when a chunk is promoted to ShmPtr for transport; the watchdog validator skips invalidation while the count is above zero; the RX path decrements it in `read_shmbuf()` after installing its own `ConfirmedDescriptor`. This closes the race window between the sender releasing its `ConfirmedDescriptor` (after serialisation) and the receiver mounting the buffer, which previously caused silent query drops when the RX thread was descheduled for longer than the watchdog TTL (~100 ms). The approach mirrors iceoryx2's reference-counted loan model: TX and RX communicate through an atomic field in the shared-memory chunk header itself, requiring no new protocol messages or wire round-trips. `internal_schedule` in the universal transport is split into a wrapper that owns SHM cleanup and an inner `do_push`, so transport references are released whenever a message does not reach the wire (no link, congestion drop). Fixes eclipse-zenoh#2628.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2632 +/- ##
==========================================
+ Coverage 74.11% 74.12% +0.01%
==========================================
Files 400 400
Lines 61030 61207 +177
==========================================
+ Hits 45234 45372 +138
- Misses 15796 15835 +39 ☔ View full report in Codecov by Harness. |
Implement the Gray & Cheriton lease model to close the race between TX
releasing its ShmBufInner and RX acquiring its own ConfirmedDescriptor.
TX clones each in-flight ShmBufInner into a per-connection VecDeque
(shm_pending) after a successful push. The clone holds a ConfirmedDescriptor
that keeps the watchdog kicking. Leases are released on:
1. TTL expiry (SHM_PENDING_TTL = 500 ms, 5x the validator period) —
swept lazily on each insert; bounds memory on long-lived connections.
2. Connection close (transport delete()) — covers peer crash (TCP RST)
and clean shutdown; in-process memory, not shared memory, so a
crashed RX process cannot leave entries stuck.
The pending set lives in process heap (not the POSIX SHM segment), so
a crashed RX process cannot leave it stuck — the fundamental flaw of the
transport_ref_count approach that this commit replaces.
Two unit tests in shm::tests validate the key invariants:
A. A pending clone keeps the chunk alive through multiple validator ticks;
clearing pending lets the validator fire within one period.
B. TTL sweep removes expired front entries (VecDeque front-drain, O(1)).
Two integration tests in unicast_shm.rs validate end-to-end delivery
with the lease mechanism active.
Note: SHM pool must be sized to hold all in-flight messages for the TTL
window. BlockOn<GarbageCollect> blocks when the pool is exhausted because
the lease keeps refcount > 0, preventing GC reclamation. Existing tests
updated to use (MSG_COUNT + 10) * MSG_SIZE pool.
Fixes eclipse-zenoh#2628
The lease model keeps ShmBufInner clones alive in shm_pending for up to SHM_PENDING_TTL (500 ms), preventing GC from reclaiming slots while refcount > 0. BlockOn<GarbageCollect> deadlocks when the pool is smaller than (MSG_COUNT * msg_size). Raise pool from size * MSG_COUNT / 10 to size * (MSG_COUNT + 10).
|
As far as I understand, there is no signalling from RX to TX that RX attached to SHM buffer's watchdog. It means TX just always delays normal buffer reclamation by 500ms, holding buffer's refcount and watchdog. This is very close to the situation as if we changed watchdog period from 100ms to 500ms. |
|
(FYI) WIP on counter-based imlementation |
Add rx_ack: AtomicBool to ChunkHeaderType (adjacent to watchdog_invalidated for stabby layout). RX sets it after GLOBAL_CONFIRMATOR.add() returns; TX sweep drops the pending lease immediately instead of waiting for TTL. Migrate shm_pending from VecDeque<PendingShmBuf> to HashMap<MetadataDescriptor, PendingShmBuf> for O(1) keyed removal. Key: shmb.info.metadata.clone(). Sweep uses retain(!rx_ack && !expired). GC resets rx_ack=false in reclaim() alongside generation increment. Add is_rx_acked() and mark_rx_acked() to ShmBufInner public API. Add unit test for rx_ack early-release invariant (invariant 4).
|
Thanks for both comments, @yellowhatter — your observation was exactly right, and it pointed directly to the fix. The initial version had no RX→TX signal, so TX always held the clone for the full 500 ms regardless of whether RX had mounted the buffer. Your description — "close to changing the watchdog period from 100 ms to 500 ms" — is an accurate characterisation of what that version did. The latest commit (
The Regarding your WIP counter-based implementation: happy to coordinate if it makes sense. The |
|
Any progress there ? |
What's broken
When SHM transport optimization is enabled, a message can be silently dropped if the receiver's RX thread is descheduled for longer than 100 ms after the sender writes the wire frame but before the receiver calls
read_shmbuf. For aPushthis is a best-effort drop; for aRequest(query) it's worse — the querier hangs for the fullqueries_default_timeout(default 600 s) with no error or feedback.Fixes #2628. Also fixes the production symptom reported in ros2/rmw_zenoh#978.
Why it happens
Every SHM chunk is protected by a watchdog: the process holding the chunk sets a bit in the chunk header every 50 ms (the confirmator); the validator (100 ms period) clears that bit and marks the chunk
watchdog_invalidatedif it was not refreshed.The race:
The window is anything from one scheduler preemption to a full GC pause.
What we tried first — and why it was wrong
The first attempt (
0a14a75f) added atransport_ref_count: AtomicU32field to the SHM chunk header. The validator would skip invalidation while the count was above zero.Fatal flaw: the chunk header lives in a POSIX shared memory segment, which survives process crashes. If the receiver crashes while a message is in flight,
transport_ref_countstays above zero forever — the validator never reclaims the chunk, and the slot leaks permanently. This is the same "crashed process prevents reclamation" problem that epoch-based reclamation (EBR) runs into, formally described in Reclaiming Memory for Lock-Free Data Structures (T. Brown, PODC 2015). That approach has been reverted in full.The key constraint: any state used to protect a chunk must live in the survivor's process heap, not in shared memory.
The fix — lease model + rx_ack back-signal
Lease model (TTL + connection-close)
The sender keeps a clone of each in-flight
ShmBufInnerin a per-connectionHashMap<MetadataDescriptor, PendingShmBuf>(shm_pending). BecauseShmBufInnerholds aConfirmedDescriptor, the watchdog keeps being kicked until the clone is released.The clone is released on whichever comes first:
1.
rx_acksignal (typical case, see next section)2. TTL expiry (
SHM_PENDING_TTL = 500 ms, 5× the validator period)Fallback for cases where RX is stalled long enough that the rx_ack hasn't arrived by the time of the next TX sweep. Entries are swept lazily via
HashMap::retainon every insert.3. Connection close (transport
delete())shm_pendingis in process heap — not shared memory. A crashed peer triggers TCP RST → transport teardown on the sender →shm_pending.clear()→ allConfirmedDescriptors drop → validator reclaims within ≤ 100 ms. A crashed receiver cannot leave anything stuck.This is the lease model (Gray & Cheriton, SOSP 1989): grant exclusive use of a resource for a TTL, auto-reclaim on expiry regardless of the holder's state.
rx_ack back-signal (no wire-protocol change)
TX and RX share the SHM segment, so
ChunkHeaderTypeis readable by both without any network round-trip. A singlerx_ack: AtomicBoolfield is added toChunkHeaderType:RX path (
read_shmbuf): afterGLOBAL_CONFIRMATOR.add()returns (RX'sConfirmedDescriptoris installed and kicking), RX storesrx_ack = truewithReleaseordering. The ordering constraint is critical:rx_ackmust be set after the confirmator is active — otherwise TX could drop its clone in the gap before RX's confirmator takes over, leaving the chunk unprotected.TX sweep: on each insert,
pending.retain(|_, v| !v.buf.is_rx_acked() && v.deadline > now). Entries where RX has signalled are removed immediately — no 500 ms wait in the common (no-stall) case. In practice the hold time drops to ~50 ms (one confirmator tick after RX mounts).GC path:
reclaim()resetsrx_ack = falsealongside thegenerationincrement, so recycled chunks start clean for their next use.Crash safety — same guarantees as the base lease model:
rx_ack: RX'sConfirmedDescriptordrops with the process; confirmator stops kicking; validator fires within ≤ 100 ms. GC proceeds normally.rx_ack, then RX crashes: RX held the onlyConfirmedDescriptor; it drops on crash → validator fires → GC. Normal path.Correctness under all failure modes
rx_ack; TX releases lease on next sweep (~50 ms after push) ✓rx_ack; TX releases ✓delete()→ pending cleared → chunk reclaimed ✓HashMapdestroyed → chunk reclaimed ✓Pool sizing
The lease model keeps
refcount > 0while chunks are inshm_pending. With rx_ack the expected hold time in the no-stall case drops from 500 ms to ~50 ms, significantly reducing pool pressure vs the TTL-only approach. Applications usingBlockOn<GarbageCollect>still need a pool large enough for the TTL window as a worst-case bound. Existing SHM tests updated accordingly.Impact on rmw_zenoh
The rmw_zenoh issue is a direct consequence of this bug. Composable container loads issue a burst of service requests at startup under heavy CPU contention. With this fix the chunk is held alive through the scheduling gap, the query reaches the queryable, the service responds, and the component loads. No changes required in rmw_zenoh.
Changes
commons/zenoh-shm/src/header/chunk_header.rsrx_ack: AtomicBooltoChunkHeaderType(adjacent towatchdog_invalidatedfor stabby layout)commons/zenoh-shm/src/lib.rsis_valid(),is_rx_acked(),mark_rx_acked()madepubonShmBufInnercommons/zenoh-shm/src/reader.rsrx_ack = trueafterGLOBAL_CONFIRMATOR.add()returnscommons/zenoh-shm/src/metadata/storage.rsrx_ack = falseinreclaim()alongsidegenerationincrementio/zenoh-transport/src/shm.rsSHM_PENDING_TTL,PendingShmBuf,collect_shm_bufs; unit tests A, B, Cio/zenoh-transport/src/unicast/universal/transport.rsshm_pending: HashMap<MetadataDescriptor, PendingShmBuf>; cleared indelete()io/zenoh-transport/src/unicast/universal/tx.rsretain(!rx_ack && !expired)sweep;BlockFirstpath coveredio/zenoh-transport/src/unicast/lowlatency/transport.rsshm_pendingfield; cleared indelete()io/zenoh-transport/src/unicast/lowlatency/tx.rsio/zenoh-transport/tests/unicast_shm.rstransport_tcp_shm_lease_modelregression testszenoh/tests/shm.rsTesting
Unit tests verify the three core lease invariants directly, without running a full transport:
HashMap.rx_ackearly release: settingrx_ack = trueon a pending entry causesretainto remove it immediately, well before TTL.Integration tests (
transport_tcp_shm_lease_model,_lowlatency) verify end-to-end delivery over TCP with the lease mechanism active, for both universal and low-latency transports. All existing SHM transport tests continue to pass.🏷️ Label-Based Checklist
Based on the labels applied to this PR, please complete these additional requirements:
Labels:
enhancement✨ Enhancement Requirements
Since this PR enhances existing functionality:
Remember: Enhancements should not introduce new APIs or breaking changes.
Instructions:
- [ ]to- [x])This checklist updates automatically when labels change, but preserves your checked boxes.