feat(routing): upstream (northbound) declaration aggregation by BOURBONCASK · Pull Request #2631 · eclipse-zenoh/zenoh

BOURBONCASK · 2026-06-02T09:18:03Z

This follows up on the discussion in #2630.

Up front, so there's no confusion about what this is: it's an experiment I built to deal with a real
routing-table scaling problem at edge-to-cloud scale, and I'm sharing it mostly as a concrete
reference / conversation starter rather than something I expect to land as-is. I understand a broader
Zenoh 2.0 redesign is only just starting to be discussed, with the scope still taking shape — so rather
than assume where this area lands, please read this as "here's one way it could look, and here's what I
learned doing it," offered as possible input to that conversation. I'm very happy to reshape it, cut it
down, rebase it onto whatever makes sense, or just leave it as something others can borrow from.

What it's for

When many downstream sessions each declare K subscribers/queryables under a shared key-expression prefix
and a router forwards them up into a router mesh, every upstream router ends up holding ~N×K
routing-table Resources (N branches × K keys). The routing table tends to be the first limit you hit.
Zenoh's existing aggregation.subscribers / publishers only collapses a session's own declarations,
not what a router forwards upstream for the sessions below it — so this extends config aggregation to a
router's northbound forwarding, letting an upstream router keep one Resource per configured prefix
instead of one per forwarded key.

What it does

When a north-bound router HAT forwards a downstream subscriber/queryable whose key-expression is included
by a configured aggregation.upstream.{subscribers,queryables} prefix, it folds the per-key children
into a single ${prefix} declaration upstream and suppresses the children there — keeping them
registered in the source region so downward routing is unchanged. There's no new wire type (the aggregate
is an ordinary DeclareSubscriber / DeclareQueryable), and it's opt-in (an empty aggregation.upstream
takes the existing propagation path).

aggregation: { upstream: { subscribers: ["example/**"], queryables: ["example/**"] } }

A few design notes (in case they're useful for the broader discussion)

The aggregate Resources are created once when the gateway is built and the fold path just looks them
up — creating a Resource needs the whole Tables, which isn't reachable from inside a single HAT, and
doing it once means each aggregate's match-set is wired exactly once, so reconnect/churn can't pile up
duplicate cross-links.

The aggregate queryable is advertised complete=false (presence, not authority), so BestMatching falls
through to the real per-key queryable — a genuinely-complete source is never shadowed and no complete
state can go stale across owner churn.

For target=AllComplete, a route entry that is non-complete, whose matched resource covers the query, and
points at a router, is treated as a transparent forwarder (a small in-process flag on the query route
entry, set only at the router-net fold site — no wire change). AllComplete passes through it so the next
router re-applies the filter against its real children; BestMatching is untouched.

It's northbound-only, the fold goes through a refcounted ledger, and route caches are invalidated on fold
and teardown. Suspicious prefixes (a bare ** root, the @/ admin-space, duplicates, mutual inclusion)
get a startup warning.

Trade-offs

At the upstream node, per-key ACL / QoS / interceptors and admin-space enumeration see only the
${prefix} aggregate (so per-key policy belongs on the forwarding router); the wildcard aggregate can
forward data toward the forwarding router for unsubscribed keys; and liveliness tokens are intentionally
not folded (a liveliness sample's key is the token's key, so a folded ${prefix}/** token couldn't
enumerate per-key presence or signal a per-key loss). These are noted alongside the config in
DEFAULT_CONFIG.json5 and the schema docs.

Numbers

From an in-process loopback-TCP benchmark:

Upstream routing-table cardinality drops from O(N·K) to O(N) — e.g. 5000 → 100 entries at N=100, K=50.
This is exact by construction (the fold produces one aggregate per branch), independent of RAM.
Roughly 6–9× less whole-process RSS at the upstream node (K=50) — a process-level measurement and a
secondary signal; the A/B delta is dominated by the cardinality collapse, which is the load-bearing
result.

What changed

It's roughly ~480 lines of routing code plus config and docs, with a similar amount of tests — all behind
the empty-config fast path. The bulk lives in the router HAT (the per-prefix fold/suppress/teardown for
subscribers and queryables). Supporting pieces add the config and its docs, the pre-created aggregates and
the prefix validation in the dispatcher tables, the transparent-forwarder flag on the query route entry
(threaded through the other HATs' query paths), and the tests.

Tests

Deterministic, in-process (MockFace, no sleeps): K downstream subscribers collapse to exactly one
aggregate on the upstream face, and undeclaring all of them withdraws exactly one aggregate.
Real loopback-TCP integration: subscriber/queryable collapse K→1 plus delivery and teardown; wildcard
get fans out to all children; a missing-key get returns empty without hanging; target=AllComplete
reaches a complete child through the aggregate while a non-complete child stays empty (with a negative
control); two branches under the same prefix don't shadow each other; cross-mesh propagation; plus an
ignored scale bench.
The existing unit / regions / queryable / matching / acl / adminspace / qos suites stay
green, clippy --deny warnings is clean, and there's no behaviour change when the config is unset.
(Also built + run on aarch64 with the same results.)

Compatibility

No protocol/wire change; additive opt-in config (defaults empty); MSRV (1.75) clean; built on the existing
regions/gateway routing model.

🏷️ Label-Based Checklist

No specific label requirements detected.

Current labels: No labels

Add one of these labels to this PR to see relevant checklist items: api-sync, breaking-change, bug, ci, dependencies, documentation, enhancement, new feature, internal

This section updates automatically when labels change.

Extend config-driven aggregation to a router's northbound forwarding. When a north-bound router HAT forwards a downstream subscriber/queryable whose key-expression is included by a configured `aggregation.upstream.{subscribers,queryables}` prefix, the per-key children are folded into a single `${prefix}` declaration toward the upstream and suppressed there, while staying registered in the source region so downward routing is unchanged. An upstream router then holds one routing Resource per configured prefix instead of one per forwarded key. The stock `aggregation.subscribers`/`publishers` only collapses a session's own declarations at the session boundary; this collapses what a router forwards upstream on behalf of the sessions below it -- the cost that grows as O(N*K) (N downstream branches, K keys each) on every upstream router in a mesh. Design: - No new wire type: the aggregate is an ordinary DeclareSubscriber/DeclareQueryable, so mesh propagation, matching and admin are unchanged. - Opt-in: an empty `aggregation.upstream` takes the existing propagation path (no-op fast path). - Aggregate Resources are pre-created at gateway build; the fold path only looks them up (a Resource needs the full Tables, unreachable from inside a single HAT), so each aggregate's match-set is wired once and reconnect/churn cannot accumulate duplicate cross-links. - The aggregate queryable is advertised complete=false, so BestMatching falls through to the real per-key queryable: a genuinely-complete source is never shadowed and no completeness state can go stale across owner churn. - target=AllComplete reaches complete children behind the aggregate via a transparent-forwarder flag on the in-process query route entry (set only at the router-net fold site, no wire change); BestMatching is untouched. - Liveliness tokens are intentionally not folded (a liveliness sample's key is the token's own key, so a folded wildcard token could neither enumerate the live set nor signal a per-key removal). - A startup check warns on suspicious prefixes (a bare `**` root, the `@/` admin-space, duplicates, mutually-including prefixes). Trade-offs (documented in DEFAULT_CONFIG.json5 and the config schema): at the upstream node, per-key ACL/QoS-overwrite/interceptors and admin-space enumeration see only the `${prefix}` aggregate, and the wildcard aggregate can forward data toward the forwarding router for unsubscribed keys. Tests: a deterministic in-process (MockFace) fold/teardown test, plus real loopback-TCP integration tests covering subscriber/queryable collapse, delivery and teardown, wildcard get fan-out, missing-key empty (no hang), target=AllComplete reaching a complete child through the aggregate with a non-complete negative control, no cross-branch shadowing, and cross-mesh propagation. MSRV 1.75; clippy --deny warnings clean; no behaviour change when the config is unset. Signed-off-by: yifei.ma <yifeima98@gmail.com>

fuzzypixelz

A few comments:

Conceptually, any mode can sit in upstream regions. That this design focuses on router aggregation introduces debatable asymmetries in declaration propagation. The design philosophy of Zenoh doesn't require routers to necessarily be the "backbone" region.
Propagating aggregated queryables as non-complete and consequently making them indistinguishable on the wire from non-aggregated non-complete queryables feels like a hack and incurs the cost of unnecessary query propagation. Now, it is possible to remedy this using a protocol extension and I would argue that this is the proper way to implement this.
Given the first two points, this feature introduces significant complexity to the routing subsystem—I'm not convinced that the utility gained here justifies that cost. Especially since entity aggregation can simply be enabled in all application nodes. and so I'd like to understand why that route (no pun intended) was not feasible for your use case (e.g. configuration nightmare?)

Overall, the present pull request and linked issue are well-thought out and touch on a real problem. Thank you for working through the problem and exploring the solution space.

BOURBONCASK · 2026-06-18T07:51:05Z

Thanks so much for taking the time to go through this so carefully ! @fuzzypixelz Let me share some deployment context here.

Why not just enable entity aggregation on all application nodes:

My setup is a large robot fleet. Each robot runs many processes isolated by a per-robot namespace (with near-identical internal key-exprs across robots), and between each robot and the cloud sits a single edge-side zenohd that also enforces ACL. The cloud router is what hits the O(N·K) wall.

Enabling aggregation.subscribers/publishers on every application node does collapse each node's own declarations, but in this topology I may ran into two issues (haven't tried yet, I will give a try next week):

Over-delivery lands inside the robot. Aggregating a node's own declaration also coarsens the downward routing toward it — a process that only wants robotN/cam/3 ends up advertising robotN/**, so the bridge pushes the whole namespace down over the most constrained (intra-robot) hop. Northbound-only aggregation keeps the downward path per-key and only folds what goes up to the cloud.
It may defeats per-key ACL on the bridge. If application nodes pre-aggregate, the bridge's ingress only sees robotN/**, so per-key rules can't be enforced there anymore. Keeping aggregation northbound means the bridge still sees per-key declarations, and only the cloud-facing side is folded.

There's also aggregation not covering queryables today (at least not in configs), which matters for my case. Config-at-scale is a factor too, but the two above are what really blocked me.

On routers being treated specially:
This is a fair point and I'd rather not bake "router == backbone" into it. The property I actually rely on is "a node forwarding declarations on behalf of a downstream region folds them northbound," which isn't router-specific. Would it make more sense to tie this to a region's north/south boundary rather than special-casing router mode? I'd really value your guidance on the right abstraction here.

On complete=false being a hack:
Agreed — that's a fair characterization, and your protocol-extension suggestion sounds like the proper way to do it. I'd be happy to explore that direction.

Lastly, the regions model is a beautiful foundation to build on — it's what made this feel tractable in the first place. I'd be glad to keep exploring directions toward making Zenoh exceptionally scalable.

fuzzypixelz reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(routing): upstream (northbound) declaration aggregation#2631

feat(routing): upstream (northbound) declaration aggregation#2631
BOURBONCASK wants to merge 1 commit into
eclipse-zenoh:mainfrom
BOURBONCASK:feature/upstream-northbound-agg

BOURBONCASK commented Jun 2, 2026 •

edited by github-actions Bot

Loading

Uh oh!

fuzzypixelz left a comment •

edited

Loading

Uh oh!

BOURBONCASK commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

BOURBONCASK commented Jun 2, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it's for

What it does

A few design notes (in case they're useful for the broader discussion)

Trade-offs

Numbers

What changed

Tests

Compatibility

🏷️ Label-Based Checklist

Uh oh!

fuzzypixelz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BOURBONCASK commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BOURBONCASK commented Jun 2, 2026 •

edited by github-actions Bot

Loading

fuzzypixelz left a comment •

edited

Loading