Skip to content

feat(routing): upstream (northbound) declaration aggregation#2631

Open
BOURBONCASK wants to merge 1 commit into
eclipse-zenoh:mainfrom
BOURBONCASK:feature/upstream-northbound-agg
Open

feat(routing): upstream (northbound) declaration aggregation#2631
BOURBONCASK wants to merge 1 commit into
eclipse-zenoh:mainfrom
BOURBONCASK:feature/upstream-northbound-agg

Conversation

@BOURBONCASK

@BOURBONCASK BOURBONCASK commented Jun 2, 2026

Copy link
Copy Markdown

This follows up on the discussion in #2630.

Up front, so there's no confusion about what this is: it's an experiment I built to deal with a real
routing-table scaling problem at edge-to-cloud scale, and I'm sharing it mostly as a concrete
reference / conversation starter rather than something I expect to land as-is. I understand a broader
Zenoh 2.0 redesign is only just starting to be discussed, with the scope still taking shape — so rather
than assume where this area lands, please read this as "here's one way it could look, and here's what I
learned doing it," offered as possible input to that conversation. I'm very happy to reshape it, cut it
down, rebase it onto whatever makes sense, or just leave it as something others can borrow from.

What it's for

When many downstream sessions each declare K subscribers/queryables under a shared key-expression prefix
and a router forwards them up into a router mesh, every upstream router ends up holding ~N×K
routing-table Resources (N branches × K keys). The routing table tends to be the first limit you hit.
Zenoh's existing aggregation.subscribers / publishers only collapses a session's own declarations,
not what a router forwards upstream for the sessions below it — so this extends config aggregation to a
router's northbound forwarding, letting an upstream router keep one Resource per configured prefix
instead of one per forwarded key.

What it does

When a north-bound router HAT forwards a downstream subscriber/queryable whose key-expression is included
by a configured aggregation.upstream.{subscribers,queryables} prefix, it folds the per-key children
into a single ${prefix} declaration upstream and suppresses the children there — keeping them
registered in the source region so downward routing is unchanged. There's no new wire type (the aggregate
is an ordinary DeclareSubscriber / DeclareQueryable), and it's opt-in (an empty aggregation.upstream
takes the existing propagation path).

aggregation: { upstream: { subscribers: ["example/**"], queryables: ["example/**"] } }

A few design notes (in case they're useful for the broader discussion)

The aggregate Resources are created once when the gateway is built and the fold path just looks them
up — creating a Resource needs the whole Tables, which isn't reachable from inside a single HAT, and
doing it once means each aggregate's match-set is wired exactly once, so reconnect/churn can't pile up
duplicate cross-links.

The aggregate queryable is advertised complete=false (presence, not authority), so BestMatching falls
through to the real per-key queryable — a genuinely-complete source is never shadowed and no complete
state can go stale across owner churn.

For target=AllComplete, a route entry that is non-complete, whose matched resource covers the query, and
points at a router, is treated as a transparent forwarder (a small in-process flag on the query route
entry, set only at the router-net fold site — no wire change). AllComplete passes through it so the next
router re-applies the filter against its real children; BestMatching is untouched.

It's northbound-only, the fold goes through a refcounted ledger, and route caches are invalidated on fold
and teardown. Suspicious prefixes (a bare ** root, the @/ admin-space, duplicates, mutual inclusion)
get a startup warning.

Trade-offs

At the upstream node, per-key ACL / QoS / interceptors and admin-space enumeration see only the
${prefix} aggregate (so per-key policy belongs on the forwarding router); the wildcard aggregate can
forward data toward the forwarding router for unsubscribed keys; and liveliness tokens are intentionally
not folded (a liveliness sample's key is the token's key, so a folded ${prefix}/** token couldn't
enumerate per-key presence or signal a per-key loss). These are noted alongside the config in
DEFAULT_CONFIG.json5 and the schema docs.

Numbers

From an in-process loopback-TCP benchmark:

  • Upstream routing-table cardinality drops from O(N·K) to O(N) — e.g. 5000 → 100 entries at N=100, K=50.
    This is exact by construction (the fold produces one aggregate per branch), independent of RAM.
  • Roughly 6–9× less whole-process RSS at the upstream node (K=50) — a process-level measurement and a
    secondary signal; the A/B delta is dominated by the cardinality collapse, which is the load-bearing
    result.

What changed

It's roughly ~480 lines of routing code plus config and docs, with a similar amount of tests — all behind
the empty-config fast path. The bulk lives in the router HAT (the per-prefix fold/suppress/teardown for
subscribers and queryables). Supporting pieces add the config and its docs, the pre-created aggregates and
the prefix validation in the dispatcher tables, the transparent-forwarder flag on the query route entry
(threaded through the other HATs' query paths), and the tests.

Tests

  • Deterministic, in-process (MockFace, no sleeps): K downstream subscribers collapse to exactly one
    aggregate on the upstream face, and undeclaring all of them withdraws exactly one aggregate.
  • Real loopback-TCP integration: subscriber/queryable collapse K→1 plus delivery and teardown; wildcard
    get fans out to all children; a missing-key get returns empty without hanging; target=AllComplete
    reaches a complete child through the aggregate while a non-complete child stays empty (with a negative
    control); two branches under the same prefix don't shadow each other; cross-mesh propagation; plus an
    ignored scale bench.
  • The existing unit / regions / queryable / matching / acl / adminspace / qos suites stay
    green, clippy --deny warnings is clean, and there's no behaviour change when the config is unset.
    (Also built + run on aarch64 with the same results.)

Compatibility

No protocol/wire change; additive opt-in config (defaults empty); MSRV (1.75) clean; built on the existing
regions/gateway routing model.


🏷️ Label-Based Checklist

No specific label requirements detected.

Current labels: No labels

Add one of these labels to this PR to see relevant checklist items: api-sync, breaking-change, bug, ci, dependencies, documentation, enhancement, new feature, internal

This section updates automatically when labels change.

Extend config-driven aggregation to a router's northbound forwarding. When a north-bound
router HAT forwards a downstream subscriber/queryable whose key-expression is included by a
configured `aggregation.upstream.{subscribers,queryables}` prefix, the per-key children are
folded into a single `${prefix}` declaration toward the upstream and suppressed there, while
staying registered in the source region so downward routing is unchanged. An upstream router
then holds one routing Resource per configured prefix instead of one per forwarded key.

The stock `aggregation.subscribers`/`publishers` only collapses a session's own declarations at
the session boundary; this collapses what a router forwards upstream on behalf of the sessions
below it -- the cost that grows as O(N*K) (N downstream branches, K keys each) on every upstream
router in a mesh.

Design:
- No new wire type: the aggregate is an ordinary DeclareSubscriber/DeclareQueryable, so mesh
  propagation, matching and admin are unchanged.
- Opt-in: an empty `aggregation.upstream` takes the existing propagation path (no-op fast path).
- Aggregate Resources are pre-created at gateway build; the fold path only looks them up (a
  Resource needs the full Tables, unreachable from inside a single HAT), so each aggregate's
  match-set is wired once and reconnect/churn cannot accumulate duplicate cross-links.
- The aggregate queryable is advertised complete=false, so BestMatching falls through to the
  real per-key queryable: a genuinely-complete source is never shadowed and no completeness
  state can go stale across owner churn.
- target=AllComplete reaches complete children behind the aggregate via a transparent-forwarder
  flag on the in-process query route entry (set only at the router-net fold site, no wire
  change); BestMatching is untouched.
- Liveliness tokens are intentionally not folded (a liveliness sample's key is the token's own
  key, so a folded wildcard token could neither enumerate the live set nor signal a per-key
  removal).
- A startup check warns on suspicious prefixes (a bare `**` root, the `@/` admin-space,
  duplicates, mutually-including prefixes).

Trade-offs (documented in DEFAULT_CONFIG.json5 and the config schema): at the upstream node,
per-key ACL/QoS-overwrite/interceptors and admin-space enumeration see only the `${prefix}`
aggregate, and the wildcard aggregate can forward data toward the forwarding router for
unsubscribed keys.

Tests: a deterministic in-process (MockFace) fold/teardown test, plus real loopback-TCP
integration tests covering subscriber/queryable collapse, delivery and teardown, wildcard get
fan-out, missing-key empty (no hang), target=AllComplete reaching a complete child through the
aggregate with a non-complete negative control, no cross-branch shadowing, and cross-mesh
propagation. MSRV 1.75; clippy --deny warnings clean; no behaviour change when the config is unset.

Signed-off-by: yifei.ma <yifeima98@gmail.com>

@fuzzypixelz fuzzypixelz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments:

  • Conceptually, any mode can sit in upstream regions. That this design focuses on router aggregation introduces debatable asymmetries in declaration propagation. The design philosophy of Zenoh doesn't require routers to necessarily be the "backbone" region.
  • Propagating aggregated queryables as non-complete and consequently making them indistinguishable on the wire from non-aggregated non-complete queryables feels like a hack and incurs the cost of unnecessary query propagation. Now, it is possible to remedy this using a protocol extension and I would argue that this is the proper way to implement this.
  • Given the first two points, this feature introduces significant complexity to the routing subsystem—I'm not convinced that the utility gained here justifies that cost. Especially since entity aggregation can simply be enabled in all application nodes. and so I'd like to understand why that route (no pun intended) was not feasible for your use case (e.g. configuration nightmare?)

Overall, the present pull request and linked issue are well-thought out and touch on a real problem. Thank you for working through the problem and exploring the solution space.

@BOURBONCASK

Copy link
Copy Markdown
Author

Thanks so much for taking the time to go through this so carefully ! @fuzzypixelz Let me share some deployment context here.

Why not just enable entity aggregation on all application nodes:

My setup is a large robot fleet. Each robot runs many processes isolated by a per-robot namespace (with near-identical internal key-exprs across robots), and between each robot and the cloud sits a single edge-side zenohd that also enforces ACL. The cloud router is what hits the O(N·K) wall.

Enabling aggregation.subscribers/publishers on every application node does collapse each node's own declarations, but in this topology I may ran into two issues (haven't tried yet, I will give a try next week):

  1. Over-delivery lands inside the robot. Aggregating a node's own declaration also coarsens the downward routing toward it — a process that only wants robotN/cam/3 ends up advertising robotN/**, so the bridge pushes the whole namespace down over the most constrained (intra-robot) hop. Northbound-only aggregation keeps the downward path per-key and only folds what goes up to the cloud.
  2. It may defeats per-key ACL on the bridge. If application nodes pre-aggregate, the bridge's ingress only sees robotN/**, so per-key rules can't be enforced there anymore. Keeping aggregation northbound means the bridge still sees per-key declarations, and only the cloud-facing side is folded.

There's also aggregation not covering queryables today (at least not in configs), which matters for my case. Config-at-scale is a factor too, but the two above are what really blocked me.

On routers being treated specially:
This is a fair point and I'd rather not bake "router == backbone" into it. The property I actually rely on is "a node forwarding declarations on behalf of a downstream region folds them northbound," which isn't router-specific. Would it make more sense to tie this to a region's north/south boundary rather than special-casing router mode? I'd really value your guidance on the right abstraction here.

On complete=false being a hack:
Agreed — that's a fair characterization, and your protocol-extension suggestion sounds like the proper way to do it. I'd be happy to explore that direction.

Lastly, the regions model is a beautiful foundation to build on — it's what made this feel tractable in the first place. I'd be glad to keep exploring directions toward making Zenoh exceptionally scalable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants