Skip to content

Fix re-entrant routing-tables RwLock deadlock in admin-space handle#2619

Open
lindskogen wants to merge 1 commit into
eclipse-zenoh:mainfrom
lindskogen:bugfix/routing-deadlock
Open

Fix re-entrant routing-tables RwLock deadlock in admin-space handle#2619
lindskogen wants to merge 1 commit into
eclipse-zenoh:mainfrom
lindskogen:bugfix/routing-deadlock

Conversation

@lindskogen

@lindskogen lindskogen commented May 25, 2026

Copy link
Copy Markdown

The admin-space handlers resources_data, linkstate_data, and route_successor hold zread!(tables.tables) across query.reply(...).wait(). .wait() routes the response synchronously via Face::send_responseroute_send_response, which re-acquires the same zread!(tables.tables) on the same thread.

std::sync::RwLock is writer-preferring: once a register_expr writer (any peer declare) queues between the two reads, the re-entrant second read blocks behind it, the writer blocks behind the first read, and the entire routing layer wedges. The transport acceptor task is independent and keeps running, so the node still accepts connections and looks alive while routing is dead.

The pattern has existed since at least 1.7.0 and is present on main.

Root cause

Three admin handlers in zenoh/src/net/runtime/adminspace.rs reply while holding the routing-tables read guard:

Handler Lock acquisition Reply call under lock
resources_data let rtables = zread!(tables.tables); query.reply(...).wait() inside the loop
linkstate_data let rtables = zread!(tables.tables); query.reply(...).wait() inside the loop
route_successor let rtables = zread!(tables.tables); reply(...) (calls query.reply().wait()) under the guard before the explicit drop(rtables)

query.reply(...).wait() ultimately calls
zenoh/src/net/routing/dispatcher/queries.rs::route_send_response:

pub fn route_send_response(..., tables_ref: &Arc<TablesLock>, ...) {
    let tables = zread!(tables_ref.tables);   // <-- same RwLock, second read
    ...
}

Concurrently, zenoh/src/net/routing/dispatcher/resource.rs::register_expr takes zwrite!(tables.tables) on every declare. Sequence that wedges:

  1. Thread A enters resources_data and takes read no 1.
  2. Thread B issues register_expr and queues a write.
  3. Thread A calls query.reply(...).wait() which re-enters route_send_response and tries to take read no 2.
  4. Writer-preferring RwLock blocks read no 2 behind the queued write.
  5. The write blocks behind read no 1, which Thread A still holds.
  6. Permanent deadlock; all subsequent routing piles up.
A live 8-thread `gdb thread apply all bt` showing all four states:
gdb -p <host-pid-of-zenohd> -batch -ex 'thread apply all bt'
Zenoh release/1.9.0 @ 81c6c933b6e41d72a05f04c4442ef57717ddc72b — zenohd, aarch64 Linux
Captured 2026-05-22 while the routing loop was deadlocked (frozen since 11:11:48).

Note: gdb run from the container HOST (the container itself blocks ptrace by default).
"Target and debugger are in different PID namespaces" warning is expected and harmless.

================================================================================
DEADLOCKED ROUTING THREADS — all contended on the same RwLock (router tables.tables)
================================================================================

Thread 8 (LWP 262662 "rx-30")  —  RE-ENTRANT READER (holds read #1, blocked on read #2)
#1  std::sys::sync::rwlock::futex::RwLock::read_contended ()        <-- blocked acquiring READ
#2  zenoh::net::routing::dispatcher::queries::route_send_response ()   queries.rs:538  zread!(tables.tables)
#3  <...Face as ...Primitives>::send_response ()
#4  zenoh::api::queryable::Query::_reply_sample ()
#5  <...ReplyBuilder<...> as zenoh_core::Wait>::wait ()
#6  zenoh::net::runtime::adminspace::resources_data ()               adminspace.rs:740 holds zread!(tables.tables)
#7  core::ops::function::Fn::call ()
#8  <...AdminSpace as ...Primitives>::send_request ()
#9  <...AdminSpace as ...EPrimitives>::send_request ()
#10 <...Face as ...Primitives>::send_request ()
#11 <...DeMux as ...TransportPeerEventHandler>::handle_message ()
#12 ...universal::rx::...::trigger_callback ()
#13 ...universal::rx::...::read_messages ()
   --> incoming request -> admin-space resources_data takes zread!(tables) [HELD]
       -> query.reply().wait() -> route_send_response takes zread!(tables) AGAIN [BLOCKED]

Thread 7 (LWP 262613 "rx-28")  —  WRITER (the trigger)
#1  std::sys::sync::rwlock::futex::RwLock::write_contended ()        <-- blocked acquiring WRITE
#2  zenoh::net::routing::dispatcher::resource::register_expr ()      resource.rs zwrite!(tables.tables)
#3  <...Face as ...Primitives>::send_declare ()
#4  <...DeMux as ...TransportPeerEventHandler>::handle_message ()
#5  ...universal::rx::...::trigger_callback ()
#6  ...universal::rx::...::read_messages ()
   --> incoming declare -> register_expr wants zwrite!(tables); queued behind readers,
       and (writer-priority) blocks all further readers including Thread 8's read #2.

Thread 5 (LWP 262549 "rx-26")  —  reader piled up behind the queued writer
#1  std::sys::sync::rwlock::futex::RwLock::read_contended ()
#2  zenoh::net::routing::dispatcher::queries::route_send_response ()
#3  <...Face as ...Primitives>::send_response ()
#4  <...DeMux as ...TransportPeerEventHandler>::handle_message ()
   (... rx_task / tokio worker ...)

Thread 6 (LWP 262550 "net-14")  —  reader piled up behind the queued writer
#1  std::sys::sync::rwlock::futex::RwLock::read_contended ()
#2  zenoh::net::routing::dispatcher::queries::route_send_response ()
#3  <...QueryCleanup as zenoh_util::timer::Timed>::run::{{closure}} ()
   (... tokio worker ...)

================================================================================
IDLE THREADS — unaffected (this is why the node still "looks up")
================================================================================

Thread 1 (LWP 255715 "zenohd")  — main thread, parked
#1  std::thread::functions::park ()
#2  zenohd::main ()

Thread 2 (LWP 255717 "app-0")   — tokio worker, idle in epoll_pwait
Thread 3 (LWP 255721 "acc-0")   — transport ACCEPTOR, idle in epoll_pwait
                                  (independent of routing — keeps accepting
                                   QUIC/TLS connections while routing is dead)
Thread 4 (LWP 255722 "tx-0")    — tokio worker, idle in epoll_pwait

================================================================================
DEADLOCK CYCLE
================================================================================

  Thread 8  holds  read #1   (adminspace.rs:740)
            wants  read #2   (queries.rs:538)   --> blocked: writer is queued
  Thread 7  wants  write     (resource.rs register_expr) --> blocked: read #1 still held
  Threads 5,6  want read     --> blocked behind the queued writer

  std::sync::RwLock is writer-preferring: once Thread 7's write is queued, no new
  reader may proceed. Thread 8's re-entrant read #2 is a "new reader" -> blocked
  -> read #1 never released -> Thread 7 never writes -> total routing deadlock.

Secondary bug fixed in the same change

resources_data had keyexpr::new(res.0.expr()).unwrap(). Under declare churn a resource can be caught transiently with an expr that is not a valid keyexpr (e.g. empty), causing a panic. zenohd's release profile is panic = "abort", so this took the entire process down once the primary deadlock was patched away. Changed to .ok()? (skip the resource).

Suggested fix

Collect the data that needs the lock, drop the guard, then reply:

let to_reply: Vec<_> = {
    let rtables = zread!(tables.tables);
    f(&rtables)
        .into_iter()
        .filter_map(|res| {
            let key = prefix / keyexpr::new(res.0.expr()).ok()?;
            query.key_expr().intersects(&key).then(|| {
                let payload = ZBytes::from(
                    serde_json::to_string(&res.1).unwrap_or_else(|_| "{}".to_string()),
                );
                (key, payload)
            })
        })
        .collect()
};
for (key, payload) in to_reply {
    if let Err(e) = query.reply(key, payload).encoding(Encoding::APPLICATION_JSON).wait() {
        tracing::error!("Error sending AdminSpace reply: {:?}", e);
    }
}

Applied identically in linkstate_data and route_successor (the latter also folds both the shortcut and full-successor paths into one collect-then-drop block — stock already had a drop(rtables) before the full-successor reply, but the shortcut reply() calls above it were still under the guard).

154-line patch, no behavior change beyond the lock-hold window. Replies still go out in the same order; the only externally visible difference is that responses are computed before they are streamed, which for these handlers (small admin queries) is negligible.

Reproduction

Python repro code
"""Local plaintext churn repro for the Zenoh routing-tables re-entrant deadlock.

  - admin-query churn:  `get @/**`  -> resources_data holds zread!(tables.tables)
                                       across query.reply().wait() -> route_send_response re-reads.
  - declare churn:      declare/undeclare subscribers -> register_expr zwrite!(tables.tables).

Usage:  churn.py <label>   (label is just printed, e.g. "patched" / "unpatched")
A health probe (`get @/*/router`) detects a wedge: 3 consecutive failures = deadlocked.
"""
import sys, time, threading
import zenoh

ENDPOINT = 'tcp/127.0.0.1:7447'
LABEL = sys.argv[1] if len(sys.argv) > 1 else 'run'
ADMIN_CHURN_THREADS   = 12
DECLARE_CHURN_THREADS = 12
RUN_SECONDS = 300
PROBE_EVERY = 8

stop = threading.Event()
stats = {'admin_gets': 0, 'declares': 0, 'errors': 0}
slock = threading.Lock()

def mk(mode):
    cfg = zenoh.Config()
    cfg.insert_json5('mode', '"%s"' % mode)
    cfg.insert_json5('connect/endpoints', '["%s"]' % ENDPOINT)
    cfg.insert_json5('scouting/multicast/enabled', 'false')
    return zenoh.open(cfg)

def admin_churn():
    s = None
    while not stop.is_set():
        try:
            if s is None:
                s = mk('client')
            for _ in s.get('@/**'):
                pass
            with slock: stats['admin_gets'] += 1
        except Exception:
            with slock: stats['errors'] += 1
            try: s and s.close()
            except Exception: pass
            s = None
            time.sleep(0.2)
    try: s and s.close()
    except Exception: pass

def declare_churn():
    # Reuse one session per thread (re-open only on error). Declaring/undeclaring
    # subscribers drives register_expr -> zwrite!(tables.tables) without the
    # client-side resource leak of churning whole sessions.
    s = None
    while not stop.is_set():
        try:
            if s is None:
                s = mk('peer')
            subs = [s.declare_subscriber('churn/%d/%d' % (id(s) % 9999, i),
                                         lambda sample: None)
                    for i in range(15)]
            with slock: stats['declares'] += len(subs)
            time.sleep(0.02)
            for x in subs:
                try: x.undeclare()
                except Exception: pass
        except Exception:
            with slock: stats['errors'] += 1
            try: s and s.close()
            except Exception: pass
            s = None
            time.sleep(0.2)
    try: s and s.close()
    except Exception: pass

def probe():
    t0 = time.time()
    s = None
    try:
        s = mk('client')
        n = sum(1 for _ in s.get('@/*/router', timeout=5.0))
        return (n > 0, 'routers=%d in %.1fs' % (n, time.time() - t0))
    except Exception as e:
        return (False, '%s: %s' % (type(e).__name__, str(e)[:80]))
    finally:
        try: s and s.close()
        except Exception: pass

def main():
    print('[%s] endpoint=%s admin=%d declare=%d cap=%ds'
          % (LABEL, ENDPOINT, ADMIN_CHURN_THREADS, DECLARE_CHURN_THREADS, RUN_SECONDS), flush=True)
    ok, detail = probe()
    print('[%s] baseline probe: %s (%s)' % (LABEL, 'OK' if ok else 'FAIL', detail), flush=True)
    if not ok:
        print('[%s] relay not healthy at start - aborting' % LABEL, flush=True)
        return 2

    threads = ([threading.Thread(target=admin_churn,   daemon=True) for _ in range(ADMIN_CHURN_THREADS)] +
               [threading.Thread(target=declare_churn, daemon=True) for _ in range(DECLARE_CHURN_THREADS)])
    for t in threads: t.start()

    start = time.time(); bad = 0; reproduced = False
    while time.time() - start < RUN_SECONDS:
        time.sleep(PROBE_EVERY)
        el = int(time.time() - start)
        ok, detail = probe()
        with slock:
            ag, dc, er = stats['admin_gets'], stats['declares'], stats['errors']
        print('[%s t+%4ds] probe %-4s (%s) | admin_gets=%d declares=%d errs=%d'
              % (LABEL, el, 'OK' if ok else 'BAD', detail, ag, dc, er), flush=True)
        if not ok:
            bad += 1
            if bad >= 3:
                print('[%s] *** DEADLOCK REPRODUCED at t+%ds ***' % (LABEL, el), flush=True)
                reproduced = True
                break
        else:
            bad = 0
    if not reproduced:
        print('[%s] survived full %ds under churn — no deadlock' % (LABEL, RUN_SECONDS), flush=True)
    stop.set(); time.sleep(1)
    return 1 if reproduced else 0

if __name__ == '__main__':
    sys.exit(main())
  • 12 threads looping get @/** (admin-query side → resources_data)
  • 12 threads declaring / undeclaring 15 subscribers in a loop (declare side → register_expr)
  • Health probe get @/*/router every 8s; three consecutive failures = wedged.

Run against a stock zenohd with no config file, no extra features, only -l tcp/127.0.0.1:7447 so the harness can reach it. Effective config is zenohd's built-in default: mode=router, gossip on, multicast on, no gateway, no region routing, default features (no transport_quic, no shared-memory, no stats, no unstable).

[stock-1.9.0-defaultcfg] baseline probe: OK (routers=1 in 0.0s)
[stock-1.9.0-defaultcfg t+   8s] probe BAD | admin_gets=52  declares=108795 errs=0
[stock-1.9.0-defaultcfg t+  26s] probe BAD | admin_gets=76  declares=218040 errs=0
[stock-1.9.0-defaultcfg t+  44s] probe BAD | admin_gets=100 declares=329385 errs=0
[stock-1.9.0-defaultcfg] *** DEADLOCK REPRODUCED at t+44s ***

Two independent runs (with and without the bundled repro config that disables multicast) wedged at exactly t+44s, after ~110k declares — race is very tight and very reliable.

Operational impact

Until this lands, any peer that can open a session and declare resources can wedge a zenohd router in seconds — remote DoS without any authentication-bypass needed (just a trusted peer that churns subscriptions fast enough; the original incident was triggered by accidental gossip churn, not an attack).

Files changed

  • zenoh/src/net/runtime/adminspace.rs — three handlers fixed, ~154 lines.

Related


🏷️ Label-Based Checklist

No specific label requirements detected.

Current labels: No labels

Add one of these labels to this PR to see relevant checklist items: api-sync, breaking-change, bug, ci, dependencies, documentation, enhancement, new feature, internal

This section updates automatically when labels change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant