Skip to content

fix(shm): prevent watchdog invalidation of in-transit SHM chunks#2632

Draft
YuanYuYuan wants to merge 11 commits into
eclipse-zenoh:mainfrom
ZettaScaleLabs:dev/fix-shm-silent-drop
Draft

fix(shm): prevent watchdog invalidation of in-transit SHM chunks#2632
YuanYuYuan wants to merge 11 commits into
eclipse-zenoh:mainfrom
ZettaScaleLabs:dev/fix-shm-silent-drop

Conversation

@YuanYuYuan

@YuanYuYuan YuanYuYuan commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

What's broken

When SHM transport optimization is enabled, a message can be silently dropped if the receiver's RX thread is descheduled for longer than 100 ms after the sender writes the wire frame but before the receiver calls read_shmbuf. For a Push this is a best-effort drop; for a Request (query) it's worse — the querier hangs for the full queries_default_timeout (default 600 s) with no error or feedback.

Fixes #2628. Also fixes the production symptom reported in ros2/rmw_zenoh#978.

Why it happens

Every SHM chunk is protected by a watchdog: the process holding the chunk sets a bit in the chunk header every 50 ms (the confirmator); the validator (100 ms period) clears that bit and marks the chunk watchdog_invalidated if it was not refreshed.

The race:

t=0 ms   TX writes SHM descriptor to wire, drops its ShmBufInner
         └─ ConfirmedDescriptor drops → watchdog bit stops being set

t=50 ms  confirmator tick — no entry for this chunk → bit NOT refreshed

t=100 ms validator tick — bit not set → watchdog_invalidated = true  ← BUG

t=???    RX thread finally runs, calls read_shmbuf
         └─ is_valid() = false → error → message silently dropped

The window is anything from one scheduler preemption to a full GC pause.

What we tried first — and why it was wrong

The first attempt (0a14a75f) added a transport_ref_count: AtomicU32 field to the SHM chunk header. The validator would skip invalidation while the count was above zero.

Fatal flaw: the chunk header lives in a POSIX shared memory segment, which survives process crashes. If the receiver crashes while a message is in flight, transport_ref_count stays above zero forever — the validator never reclaims the chunk, and the slot leaks permanently. This is the same "crashed process prevents reclamation" problem that epoch-based reclamation (EBR) runs into, formally described in Reclaiming Memory for Lock-Free Data Structures (T. Brown, PODC 2015). That approach has been reverted in full.

The key constraint: any state used to protect a chunk must live in the survivor's process heap, not in shared memory.

The fix — lease model + rx_ack back-signal

Lease model (TTL + connection-close)

The sender keeps a clone of each in-flight ShmBufInner in a per-connection HashMap<MetadataDescriptor, PendingShmBuf> (shm_pending). Because ShmBufInner holds a ConfirmedDescriptor, the watchdog keeps being kicked until the clone is released.

The clone is released on whichever comes first:

1. rx_ack signal (typical case, see next section)

2. TTL expiry (SHM_PENDING_TTL = 500 ms, 5× the validator period)
Fallback for cases where RX is stalled long enough that the rx_ack hasn't arrived by the time of the next TX sweep. Entries are swept lazily via HashMap::retain on every insert.

3. Connection close (transport delete())
shm_pending is in process heap — not shared memory. A crashed peer triggers TCP RST → transport teardown on the sender → shm_pending.clear() → all ConfirmedDescriptors drop → validator reclaims within ≤ 100 ms. A crashed receiver cannot leave anything stuck.

This is the lease model (Gray & Cheriton, SOSP 1989): grant exclusive use of a resource for a TTL, auto-reclaim on expiry regardless of the holder's state.

rx_ack back-signal (no wire-protocol change)

TX and RX share the SHM segment, so ChunkHeaderType is readable by both without any network round-trip. A single rx_ack: AtomicBool field is added to ChunkHeaderType:

pub struct ChunkHeaderType {
    pub refcount: AtomicU32,
    pub watchdog_invalidated: AtomicBool,
    pub rx_ack: AtomicBool,   // ← RX signals watchdog handoff complete
    pub generation: AtomicU32,
    pub protocol: AtomicU32,
    // ...
}

RX path (read_shmbuf): after GLOBAL_CONFIRMATOR.add() returns (RX's ConfirmedDescriptor is installed and kicking), RX stores rx_ack = true with Release ordering. The ordering constraint is critical: rx_ack must be set after the confirmator is active — otherwise TX could drop its clone in the gap before RX's confirmator takes over, leaving the chunk unprotected.

TX sweep: on each insert, pending.retain(|_, v| !v.buf.is_rx_acked() && v.deadline > now). Entries where RX has signalled are removed immediately — no 500 ms wait in the common (no-stall) case. In practice the hold time drops to ~50 ms (one confirmator tick after RX mounts).

GC path: reclaim() resets rx_ack = false alongside the generation increment, so recycled chunks start clean for their next use.

Crash safety — same guarantees as the base lease model:

  • RX crash after setting rx_ack: RX's ConfirmedDescriptor drops with the process; confirmator stops kicking; validator fires within ≤ 100 ms. GC proceeds normally.
  • TX drops lease on rx_ack, then RX crashes: RX held the only ConfirmedDescriptor; it drops on crash → validator fires → GC. Normal path.
  • No counter in shared memory → no stuck-counter failure mode.

Correctness under all failure modes

Scenario Outcome
RX mounts promptly (common case) RX sets rx_ack; TX releases lease on next sweep (~50 ms after push) ✓
RX stall < 500 ms Chunk kept alive by pending clone; RX mounts, sets rx_ack; TX releases ✓
RX stall > 500 ms TTL sweeps clone; validator fires; message dropped (same as before fix; rare)
RX crash TCP RST → delete() → pending cleared → chunk reclaimed ✓
TX crash Process heap freed by OS → HashMap destroyed → chunk reclaimed ✓

Pool sizing

The lease model keeps refcount > 0 while chunks are in shm_pending. With rx_ack the expected hold time in the no-stall case drops from 500 ms to ~50 ms, significantly reducing pool pressure vs the TTL-only approach. Applications using BlockOn<GarbageCollect> still need a pool large enough for the TTL window as a worst-case bound. Existing SHM tests updated accordingly.

Impact on rmw_zenoh

The rmw_zenoh issue is a direct consequence of this bug. Composable container loads issue a burst of service requests at startup under heavy CPU contention. With this fix the chunk is held alive through the scheduling gap, the query reaches the queryable, the service responds, and the component loads. No changes required in rmw_zenoh.

Changes

File Change
commons/zenoh-shm/src/header/chunk_header.rs Add rx_ack: AtomicBool to ChunkHeaderType (adjacent to watchdog_invalidated for stabby layout)
commons/zenoh-shm/src/lib.rs is_valid(), is_rx_acked(), mark_rx_acked() made pub on ShmBufInner
commons/zenoh-shm/src/reader.rs Set rx_ack = true after GLOBAL_CONFIRMATOR.add() returns
commons/zenoh-shm/src/metadata/storage.rs Reset rx_ack = false in reclaim() alongside generation increment
io/zenoh-transport/src/shm.rs SHM_PENDING_TTL, PendingShmBuf, collect_shm_bufs; unit tests A, B, C
io/zenoh-transport/src/unicast/universal/transport.rs shm_pending: HashMap<MetadataDescriptor, PendingShmBuf>; cleared in delete()
io/zenoh-transport/src/unicast/universal/tx.rs Collect clones after push; retain(!rx_ack && !expired) sweep; BlockFirst path covered
io/zenoh-transport/src/unicast/lowlatency/transport.rs Same shm_pending field; cleared in delete()
io/zenoh-transport/src/unicast/lowlatency/tx.rs Same collect/sweep pattern
io/zenoh-transport/tests/unicast_shm.rs Pool size fix; transport_tcp_shm_lease_model regression tests
zenoh/tests/shm.rs Pool size fix (same reason)

Testing

Unit tests verify the three core lease invariants directly, without running a full transport:

  • A — a pending clone keeps the chunk alive through multiple validator ticks; clearing the pending set lets the validator fire.
  • B — TTL sweep removes expired entries from the HashMap.
  • Crx_ack early release: setting rx_ack = true on a pending entry causes retain to remove it immediately, well before TTL.

Integration tests (transport_tcp_shm_lease_model, _lowlatency) verify end-to-end delivery over TCP with the lease mechanism active, for both universal and low-latency transports. All existing SHM transport tests continue to pass.

# Unit tests
cargo test -p zenoh-transport -p zenoh --features shared-memory --lib -- shm::tests

# Integration tests (all SHM)
cargo test -p zenoh-transport -p zenoh \
  --features shared-memory,transport_tcp \
  --test unicast_shm

# Clippy (SHM)
cargo clippy -p zenoh-shm -p zenoh-transport \
  --lib \
  --features shared-memory,transport_tcp -- --deny warnings

🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: enhancement

✨ Enhancement Requirements

Since this PR enhances existing functionality:

  • Enhancement scope documented - Clear description of what is being improved
  • Minimum necessary code - Implementation is as simple as possible, doesn't overcomplicate the system
  • Backwards compatible - Existing code/APIs still work unchanged
  • No new APIs added - Only improving existing functionality
  • Tests updated - Existing tests pass, new test cases added if needed
  • Performance improvement measured - If applicable, before/after metrics provided
  • Documentation updated - Existing docs updated to reflect improvements
  • User impact documented - How users benefit from this enhancement

Remember: Enhancements should not introduce new APIs or breaking changes.

Instructions:

  1. Check off items as you complete them (change - [ ] to - [x])
  2. The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

Add `transport_ref_count: AtomicU32` to `ChunkHeaderType`. The TX path
increments this counter when a chunk is promoted to ShmPtr for transport;
the watchdog validator skips invalidation while the count is above zero;
the RX path decrements it in `read_shmbuf()` after installing its own
`ConfirmedDescriptor`. This closes the race window between the sender
releasing its `ConfirmedDescriptor` (after serialisation) and the receiver
mounting the buffer, which previously caused silent query drops when the
RX thread was descheduled for longer than the watchdog TTL (~100 ms).

The approach mirrors iceoryx2's reference-counted loan model: TX and RX
communicate through an atomic field in the shared-memory chunk header
itself, requiring no new protocol messages or wire round-trips.

`internal_schedule` in the universal transport is split into a wrapper
that owns SHM cleanup and an inner `do_push`, so transport references are
released whenever a message does not reach the wire (no link, congestion
drop).

Fixes eclipse-zenoh#2628.
@YuanYuYuan YuanYuYuan added the enhancement Existing things could work better label Jun 3, 2026
@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.44444% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.12%. Comparing base (e03b0c0) to head (c6c3106).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
io/zenoh-transport/src/unicast/universal/tx.rs 52.38% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2632      +/-   ##
==========================================
+ Coverage   74.11%   74.12%   +0.01%     
==========================================
  Files         400      400              
  Lines       61030    61207     +177     
==========================================
+ Hits        45234    45372     +138     
- Misses      15796    15835      +39     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

Implement the Gray & Cheriton lease model to close the race between TX
releasing its ShmBufInner and RX acquiring its own ConfirmedDescriptor.

TX clones each in-flight ShmBufInner into a per-connection VecDeque
(shm_pending) after a successful push. The clone holds a ConfirmedDescriptor
that keeps the watchdog kicking. Leases are released on:

  1. TTL expiry (SHM_PENDING_TTL = 500 ms, 5x the validator period) —
     swept lazily on each insert; bounds memory on long-lived connections.
  2. Connection close (transport delete()) — covers peer crash (TCP RST)
     and clean shutdown; in-process memory, not shared memory, so a
     crashed RX process cannot leave entries stuck.

The pending set lives in process heap (not the POSIX SHM segment), so
a crashed RX process cannot leave it stuck — the fundamental flaw of the
transport_ref_count approach that this commit replaces.

Two unit tests in shm::tests validate the key invariants:
  A. A pending clone keeps the chunk alive through multiple validator ticks;
     clearing pending lets the validator fire within one period.
  B. TTL sweep removes expired front entries (VecDeque front-drain, O(1)).

Two integration tests in unicast_shm.rs validate end-to-end delivery
with the lease mechanism active.

Note: SHM pool must be sized to hold all in-flight messages for the TTL
window. BlockOn<GarbageCollect> blocks when the pool is exhausted because
the lease keeps refcount > 0, preventing GC reclamation. Existing tests
updated to use (MSG_COUNT + 10) * MSG_SIZE pool.

Fixes eclipse-zenoh#2628
The lease model keeps ShmBufInner clones alive in shm_pending for up
to SHM_PENDING_TTL (500 ms), preventing GC from reclaiming slots
while refcount > 0. BlockOn<GarbageCollect> deadlocks when the pool
is smaller than (MSG_COUNT * msg_size).

Raise pool from size * MSG_COUNT / 10 to size * (MSG_COUNT + 10).
@yellowhatter

yellowhatter commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

As far as I understand, there is no signalling from RX to TX that RX attached to SHM buffer's watchdog. It means TX just always delays normal buffer reclamation by 500ms, holding buffer's refcount and watchdog. This is very close to the situation as if we changed watchdog period from 100ms to 500ms.

@yellowhatter

Copy link
Copy Markdown
Contributor

(FYI) WIP on counter-based imlementation

Add rx_ack: AtomicBool to ChunkHeaderType (adjacent to
watchdog_invalidated for stabby layout). RX sets it after
GLOBAL_CONFIRMATOR.add() returns; TX sweep drops the pending lease
immediately instead of waiting for TTL.

Migrate shm_pending from VecDeque<PendingShmBuf> to
HashMap<MetadataDescriptor, PendingShmBuf> for O(1) keyed removal.
Key: shmb.info.metadata.clone(). Sweep uses retain(!rx_ack && !expired).
GC resets rx_ack=false in reclaim() alongside generation increment.

Add is_rx_acked() and mark_rx_acked() to ShmBufInner public API.
Add unit test for rx_ack early-release invariant (invariant 4).
@YuanYuYuan

Copy link
Copy Markdown
Contributor Author

Thanks for both comments, @yellowhatter — your observation was exactly right, and it pointed directly to the fix.

The initial version had no RX→TX signal, so TX always held the clone for the full 500 ms regardless of whether RX had mounted the buffer. Your description — "close to changing the watchdog period from 100 ms to 500 ms" — is an accurate characterisation of what that version did.

The latest commit (20a18b65) adds the back-signal via rx_ack: AtomicBool in ChunkHeaderType. Since TX and RX share the SHM segment, no wire-protocol message is needed — the ack goes through the same header both sides already map:

  • RX sets rx_ack = true (Release) in read_shmbuf, immediately after GLOBAL_CONFIRMATOR.add() returns. The ordering is critical: the ConfirmedDescriptor must be installed and kicking before the ack is visible to TX, so TX can safely drop its clone without leaving the chunk unprotected.
  • TX sweep uses pending.retain(|_, v| !v.buf.is_rx_acked() && v.deadline > now) on every insert. Acked entries are removed immediately — in the no-stall case the lease hold time drops from 500 ms to ~50 ms (one confirmator tick after RX mounts).
  • GC resets rx_ack = false in reclaim() alongside the generation increment.
  • The 500 ms TTL and connection-close paths remain as fallbacks, covering the stall case and crash recovery respectively.

The VecDeque has been replaced with HashMap<MetadataDescriptor, PendingShmBuf> (key: shmb.info.metadata), which makes early keyed removal O(1) and the TTL sweep a simple retain.

Regarding your WIP counter-based implementation: happy to coordinate if it makes sense. The rx_ack approach changes the SHM ABI (one AtomicBool in ChunkHeaderType) but not the wire protocol. If your design also modifies ChunkHeaderType we should make sure the two don't conflict — let me know what you have in mind.

@YuanYuYuan YuanYuYuan marked this pull request as draft June 9, 2026 08:08
@doisyg

doisyg commented Jun 10, 2026

Copy link
Copy Markdown

Any progress there ?
How can we help ? Testing this PR with a real world setup ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Existing things could work better

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query requests are silently dropped when using shared memory when receiver's RX thread stalls

3 participants