fix(shm): prevent watchdog invalidation of in-transit SHM chunks by YuanYuYuan · Pull Request #2632 · eclipse-zenoh/zenoh

YuanYuYuan · 2026-06-03T10:48:55Z

What's broken

When SHM transport optimization is enabled, a message can be silently dropped if the receiver's RX thread is descheduled for longer than 100 ms after the sender writes the wire frame but before the receiver calls read_shmbuf. For a Push this is a best-effort drop; for a Request (query) it's worse — the querier hangs for the full queries_default_timeout (default 600 s) with no error or feedback.

Fixes #2628. Also fixes the production symptom reported in ros2/rmw_zenoh#978.

Why it happens

Every SHM chunk is protected by a watchdog: the process holding the chunk sets a bit in the chunk header every 50 ms (the confirmator); the validator (100 ms period) clears that bit and marks the chunk watchdog_invalidated if it was not refreshed.

The race:

t=0 ms   TX writes SHM descriptor to wire, drops its ShmBufInner
         └─ ConfirmedDescriptor drops → watchdog bit stops being set

t=50 ms  confirmator tick — no entry for this chunk → bit NOT refreshed

t=100 ms validator tick — bit not set → watchdog_invalidated = true  ← BUG

t=???    RX thread finally runs, calls read_shmbuf
         └─ is_valid() = false → error → message silently dropped

The window is anything from one scheduler preemption to a full GC pause.

What we tried first — and why it was wrong

The first attempt (0a14a75f) added a transport_ref_count: AtomicU32 field to the SHM chunk header. The validator would skip invalidation while the count was above zero.

Fatal flaw: the chunk header lives in a POSIX shared memory segment, which survives process crashes. If the receiver crashes while a message is in flight, transport_ref_count stays above zero forever — the validator never reclaims the chunk, and the slot leaks permanently. This is the same "crashed process prevents reclamation" problem that epoch-based reclamation (EBR) runs into, formally described in Reclaiming Memory for Lock-Free Data Structures (T. Brown, PODC 2015). That approach has been reverted in full.

The key constraint: any state used to protect a chunk must live in the survivor's process heap, not in shared memory.

The fix — lease model + rx_ack back-signal

Lease model (TTL + connection-close)

The sender keeps a clone of each in-flight ShmBufInner in a per-connection HashMap<MetadataDescriptor, PendingShmBuf> (shm_pending). Because ShmBufInner holds a ConfirmedDescriptor, the watchdog keeps being kicked until the clone is released.

The clone is released on whichever comes first:

1. rx_ack signal (typical case, see next section)

2. TTL expiry (SHM_PENDING_TTL = 500 ms, 5× the validator period)
Fallback for cases where RX is stalled long enough that the rx_ack hasn't arrived by the time of the next TX sweep. Entries are swept lazily via HashMap::retain on every insert.

3. Connection close (transport delete())
shm_pending is in process heap — not shared memory. A crashed peer triggers TCP RST → transport teardown on the sender → shm_pending.clear() → all ConfirmedDescriptors drop → validator reclaims within ≤ 100 ms. A crashed receiver cannot leave anything stuck.

This is the lease model (Gray & Cheriton, SOSP 1989): grant exclusive use of a resource for a TTL, auto-reclaim on expiry regardless of the holder's state.

rx_ack back-signal (no wire-protocol change)

TX and RX share the SHM segment, so ChunkHeaderType is readable by both without any network round-trip. A single rx_ack: AtomicBool field is added to ChunkHeaderType:

pub struct ChunkHeaderType {
    pub refcount: AtomicU32,
    pub watchdog_invalidated: AtomicBool,
    pub rx_ack: AtomicBool,   // ← RX signals watchdog handoff complete
    pub generation: AtomicU32,
    pub protocol: AtomicU32,
    // ...
}

RX path (read_shmbuf): after GLOBAL_CONFIRMATOR.add() returns (RX's ConfirmedDescriptor is installed and kicking), RX stores rx_ack = true with Release ordering. The ordering constraint is critical: rx_ack must be set after the confirmator is active — otherwise TX could drop its clone in the gap before RX's confirmator takes over, leaving the chunk unprotected.

TX sweep: on each insert, pending.retain(|_, v| !v.buf.is_rx_acked() && v.deadline > now). Entries where RX has signalled are removed immediately — no 500 ms wait in the common (no-stall) case. In practice the hold time drops to ~50 ms (one confirmator tick after RX mounts).

GC path: reclaim() resets rx_ack = false alongside the generation increment, so recycled chunks start clean for their next use.

Crash safety — same guarantees as the base lease model:

RX crash after setting rx_ack: RX's ConfirmedDescriptor drops with the process; confirmator stops kicking; validator fires within ≤ 100 ms. GC proceeds normally.
TX drops lease on rx_ack, then RX crashes: RX held the only ConfirmedDescriptor; it drops on crash → validator fires → GC. Normal path.
No counter in shared memory → no stuck-counter failure mode.

Correctness under all failure modes

Scenario	Outcome
RX mounts promptly (common case)	RX sets `rx_ack`; TX releases lease on next sweep (~50 ms after push) ✓
RX stall < 500 ms	Chunk kept alive by pending clone; RX mounts, sets `rx_ack`; TX releases ✓
RX stall > 500 ms	TTL sweeps clone; validator fires; message dropped (same as before fix; rare)
RX crash	TCP RST → `delete()` → pending cleared → chunk reclaimed ✓
TX crash	Process heap freed by OS → `HashMap` destroyed → chunk reclaimed ✓

Pool sizing

The lease model keeps refcount > 0 while chunks are in shm_pending. With rx_ack the expected hold time in the no-stall case drops from 500 ms to ~50 ms, significantly reducing pool pressure vs the TTL-only approach. Applications using BlockOn<GarbageCollect> still need a pool large enough for the TTL window as a worst-case bound. Existing SHM tests updated accordingly.

Impact on rmw_zenoh

The rmw_zenoh issue is a direct consequence of this bug. Composable container loads issue a burst of service requests at startup under heavy CPU contention. With this fix the chunk is held alive through the scheduling gap, the query reaches the queryable, the service responds, and the component loads. No changes required in rmw_zenoh.

Changes

File	Change
`commons/zenoh-shm/src/header/chunk_header.rs`	Add `rx_ack: AtomicBool` to `ChunkHeaderType` (adjacent to `watchdog_invalidated` for stabby layout)
`commons/zenoh-shm/src/lib.rs`	`is_valid()`, `is_rx_acked()`, `mark_rx_acked()` made `pub` on `ShmBufInner`
`commons/zenoh-shm/src/reader.rs`	Set `rx_ack = true` after `GLOBAL_CONFIRMATOR.add()` returns
`commons/zenoh-shm/src/metadata/storage.rs`	Reset `rx_ack = false` in `reclaim()` alongside `generation` increment
`io/zenoh-transport/src/shm.rs`	`SHM_PENDING_TTL`, `PendingShmBuf`, `collect_shm_bufs`; unit tests A, B, C
`io/zenoh-transport/src/unicast/universal/transport.rs`	`shm_pending: HashMap<MetadataDescriptor, PendingShmBuf>`; cleared in `delete()`
`io/zenoh-transport/src/unicast/universal/tx.rs`	Collect clones after push; `retain(!rx_ack && !expired)` sweep; `BlockFirst` path covered
`io/zenoh-transport/src/unicast/lowlatency/transport.rs`	Same `shm_pending` field; cleared in `delete()`
`io/zenoh-transport/src/unicast/lowlatency/tx.rs`	Same collect/sweep pattern
`io/zenoh-transport/tests/unicast_shm.rs`	Pool size fix; `transport_tcp_shm_lease_model` regression tests
`zenoh/tests/shm.rs`	Pool size fix (same reason)

Testing

Unit tests verify the three core lease invariants directly, without running a full transport:

A — a pending clone keeps the chunk alive through multiple validator ticks; clearing the pending set lets the validator fire.
B — TTL sweep removes expired entries from the HashMap.
C — rx_ack early release: setting rx_ack = true on a pending entry causes retain to remove it immediately, well before TTL.

Integration tests (transport_tcp_shm_lease_model, _lowlatency) verify end-to-end delivery over TCP with the lease mechanism active, for both universal and low-latency transports. All existing SHM transport tests continue to pass.

# Unit tests
cargo test -p zenoh-transport -p zenoh --features shared-memory --lib -- shm::tests

# Integration tests (all SHM)
cargo test -p zenoh-transport -p zenoh \
  --features shared-memory,transport_tcp \
  --test unicast_shm

# Clippy (SHM)
cargo clippy -p zenoh-shm -p zenoh-transport \
  --lib \
  --features shared-memory,transport_tcp -- --deny warnings

🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: enhancement

✨ Enhancement Requirements

Since this PR enhances existing functionality:

Enhancement scope documented - Clear description of what is being improved
Minimum necessary code - Implementation is as simple as possible, doesn't overcomplicate the system
Backwards compatible - Existing code/APIs still work unchanged
No new APIs added - Only improving existing functionality
Tests updated - Existing tests pass, new test cases added if needed
Performance improvement measured - If applicable, before/after metrics provided
Documentation updated - Existing docs updated to reflect improvements
User impact documented - How users benefit from this enhancement

Remember: Enhancements should not introduce new APIs or breaking changes.

Instructions:

Check off items as you complete them (change - [ ] to - [x])
The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

Add `transport_ref_count: AtomicU32` to `ChunkHeaderType`. The TX path increments this counter when a chunk is promoted to ShmPtr for transport; the watchdog validator skips invalidation while the count is above zero; the RX path decrements it in `read_shmbuf()` after installing its own `ConfirmedDescriptor`. This closes the race window between the sender releasing its `ConfirmedDescriptor` (after serialisation) and the receiver mounting the buffer, which previously caused silent query drops when the RX thread was descheduled for longer than the watchdog TTL (~100 ms). The approach mirrors iceoryx2's reference-counted loan model: TX and RX communicate through an atomic field in the shared-memory chunk header itself, requiring no new protocol messages or wire round-trips. `internal_schedule` in the universal transport is split into a wrapper that owns SHM cleanup and an inner `do_push`, so transport references are released whenever a message does not reach the wire (no link, congestion drop). Fixes eclipse-zenoh#2628.

codecov · 2026-06-03T11:25:10Z

Codecov Report

❌ Patch coverage is 94.44444% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.12%. Comparing base (e03b0c0) to head (c6c3106).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
io/zenoh-transport/src/unicast/universal/tx.rs	52.38%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2632      +/-   ##
==========================================
+ Coverage   74.11%   74.12%   +0.01%     
==========================================
  Files         400      400              
  Lines       61030    61207     +177     
==========================================
+ Hits        45234    45372     +138     
- Misses      15796    15835      +39

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

Implement the Gray & Cheriton lease model to close the race between TX releasing its ShmBufInner and RX acquiring its own ConfirmedDescriptor. TX clones each in-flight ShmBufInner into a per-connection VecDeque (shm_pending) after a successful push. The clone holds a ConfirmedDescriptor that keeps the watchdog kicking. Leases are released on: 1. TTL expiry (SHM_PENDING_TTL = 500 ms, 5x the validator period) — swept lazily on each insert; bounds memory on long-lived connections. 2. Connection close (transport delete()) — covers peer crash (TCP RST) and clean shutdown; in-process memory, not shared memory, so a crashed RX process cannot leave entries stuck. The pending set lives in process heap (not the POSIX SHM segment), so a crashed RX process cannot leave it stuck — the fundamental flaw of the transport_ref_count approach that this commit replaces. Two unit tests in shm::tests validate the key invariants: A. A pending clone keeps the chunk alive through multiple validator ticks; clearing pending lets the validator fire within one period. B. TTL sweep removes expired front entries (VecDeque front-drain, O(1)). Two integration tests in unicast_shm.rs validate end-to-end delivery with the lease mechanism active. Note: SHM pool must be sized to hold all in-flight messages for the TTL window. BlockOn<GarbageCollect> blocks when the pool is exhausted because the lease keeps refcount > 0, preventing GC reclamation. Existing tests updated to use (MSG_COUNT + 10) * MSG_SIZE pool. Fixes eclipse-zenoh#2628

The lease model keeps ShmBufInner clones alive in shm_pending for up to SHM_PENDING_TTL (500 ms), preventing GC from reclaiming slots while refcount > 0. BlockOn<GarbageCollect> deadlocks when the pool is smaller than (MSG_COUNT * msg_size). Raise pool from size * MSG_COUNT / 10 to size * (MSG_COUNT + 10).

yellowhatter · 2026-06-04T08:47:40Z

As far as I understand, there is no signalling from RX to TX that RX attached to SHM buffer's watchdog. It means TX just always delays normal buffer reclamation by 500ms, holding buffer's refcount and watchdog. This is very close to the situation as if we changed watchdog period from 100ms to 500ms.

yellowhatter · 2026-06-04T16:51:30Z

(FYI) WIP on counter-based imlementation

Add rx_ack: AtomicBool to ChunkHeaderType (adjacent to watchdog_invalidated for stabby layout). RX sets it after GLOBAL_CONFIRMATOR.add() returns; TX sweep drops the pending lease immediately instead of waiting for TTL. Migrate shm_pending from VecDeque<PendingShmBuf> to HashMap<MetadataDescriptor, PendingShmBuf> for O(1) keyed removal. Key: shmb.info.metadata.clone(). Sweep uses retain(!rx_ack && !expired). GC resets rx_ack=false in reclaim() alongside generation increment. Add is_rx_acked() and mark_rx_acked() to ShmBufInner public API. Add unit test for rx_ack early-release invariant (invariant 4).

YuanYuYuan · 2026-06-05T05:57:33Z

Thanks for both comments, @yellowhatter — your observation was exactly right, and it pointed directly to the fix.

The initial version had no RX→TX signal, so TX always held the clone for the full 500 ms regardless of whether RX had mounted the buffer. Your description — "close to changing the watchdog period from 100 ms to 500 ms" — is an accurate characterisation of what that version did.

The latest commit (20a18b65) adds the back-signal via rx_ack: AtomicBool in ChunkHeaderType. Since TX and RX share the SHM segment, no wire-protocol message is needed — the ack goes through the same header both sides already map:

RX sets rx_ack = true (Release) in read_shmbuf, immediately after GLOBAL_CONFIRMATOR.add() returns. The ordering is critical: the ConfirmedDescriptor must be installed and kicking before the ack is visible to TX, so TX can safely drop its clone without leaving the chunk unprotected.
TX sweep uses pending.retain(|_, v| !v.buf.is_rx_acked() && v.deadline > now) on every insert. Acked entries are removed immediately — in the no-stall case the lease hold time drops from 500 ms to ~50 ms (one confirmator tick after RX mounts).
GC resets rx_ack = false in reclaim() alongside the generation increment.
The 500 ms TTL and connection-close paths remain as fallbacks, covering the stall case and crash recovery respectively.

The VecDeque has been replaced with HashMap<MetadataDescriptor, PendingShmBuf> (key: shmb.info.metadata), which makes early keyed removal O(1) and the TTL sweep a simple retain.

Regarding your WIP counter-based implementation: happy to coordinate if it makes sense. The rx_ack approach changes the SHM ABI (one AtomicBool in ChunkHeaderType) but not the wire protocol. If your design also modifies ChunkHeaderType we should make sure the two don't conflict — let me know what you have in mind.

doisyg · 2026-06-10T10:07:35Z

Any progress there ?
How can we help ? Testing this PR with a real world setup ?

YuanYuYuan added the enhancement Existing things could work better label Jun 3, 2026

style: apply rustfmt to test files

4bd223b

YuanYuYuan added 3 commits June 3, 2026 17:40

revert: remove transport_ref_count — replace with lease model

37da949

YuanYuYuan added 5 commits June 5, 2026 05:59

style: apply rustfmt to rx_ack back-signal changes

4e25bc7

style: fix rustfmt in zenoh/src/api/builders/scouting.rs

090bcb3

style: fix import group ordering for nightly rustfmt CI

ea3ebfa

style: apply nightly rustfmt with CI import grouping config

efb170d

style: fix import order in scouting.rs per nightly rustfmt

c6c3106

YuanYuYuan marked this pull request as draft June 9, 2026 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shm): prevent watchdog invalidation of in-transit SHM chunks#2632

fix(shm): prevent watchdog invalidation of in-transit SHM chunks#2632
YuanYuYuan wants to merge 11 commits into
eclipse-zenoh:mainfrom
ZettaScaleLabs:dev/fix-shm-silent-drop

YuanYuYuan commented Jun 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

yellowhatter commented Jun 4, 2026 •

edited

Loading

Uh oh!

yellowhatter commented Jun 4, 2026

Uh oh!

YuanYuYuan commented Jun 5, 2026

Uh oh!

doisyg commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YuanYuYuan commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's broken

Why it happens

What we tried first — and why it was wrong

The fix — lease model + rx_ack back-signal

Lease model (TTL + connection-close)

rx_ack back-signal (no wire-protocol change)

Correctness under all failure modes

Pool sizing

Impact on rmw_zenoh

Changes

Testing

🏷️ Label-Based Checklist

✨ Enhancement Requirements

Uh oh!

codecov Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yellowhatter commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yellowhatter commented Jun 4, 2026

Uh oh!

YuanYuYuan commented Jun 5, 2026

Uh oh!

doisyg commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YuanYuYuan commented Jun 3, 2026 •

edited

Loading

codecov Bot commented Jun 3, 2026 •

edited

Loading

yellowhatter commented Jun 4, 2026 •

edited

Loading