Skip to content

feat: stripe RDMA traffic across multiple Thunderbolt cables (multi-rail jaccl)#2160

Draft
jasonpaulso wants to merge 1 commit into
exo-explore:mainfrom
jasonpaulso:feat/multi-rail-rdma
Draft

feat: stripe RDMA traffic across multiple Thunderbolt cables (multi-rail jaccl)#2160
jasonpaulso wants to merge 1 commit into
exo-explore:mainfrom
jasonpaulso:feat/multi-rail-rdma

Conversation

@jasonpaulso

Copy link
Copy Markdown

Summary

With more than one Thunderbolt cable between two nodes, the jaccl devices matrix previously kept a single RDMA interface per node pair, picked independently for each direction. The two directions could name interfaces on different cables, and since jaccl pairs queue pairs by index across ranks via the side channel, the mismatched pairing silently deadlocked model load (we consistently stalled at 47/48 layers loading a TP instance with 2 cables connected; unplugging one cable "fixed" it). Extra cables were also invisible to the backend — no bandwidth benefit.

This PR:

  • Rewrites get_mlx_jaccl_devices_matrix to collect all RDMA links per node pair (from both edge directions) and emit list[list[list[str]]] — each cell carries every local interface for that peer, ordered so that rail k on rank i and rail k on rank j refer to the same physical cable (both sides are emitted from one enumeration of the shared link set). jaccl's matrix parser accepts string-or-array cells, and the mesh backend keeps using [0], so single-cable behavior is unchanged.
  • Sets MLX_JACCL_RING=1 when any pair has more than one link, so jaccl's ring backend stripes traffic across all cables.
  • Adds regression tests, including a crossed-interface case (cable1 = A:en6↔B:en7, cable2 = A:en7↔B:en6 inserted in opposite orders) asserting rails align to physical cables.

Measured impact

2-node cluster (Mac Studio + MacBook Pro, macOS 26), 3× TB5 cables, mlx-community/Qwen3-Coder-Next-4bit, Tensor + RDMA: 19.4 → 42.0 tok/s generation, TTFT 610 → ~430 ms. And the 2-cable mesh deadlock above is fixed regardless of striping.

⚠️ Dependency — draft until the mlx pin updates

jaccl's ring backend has a recv-prefill bug that deadlocks multi-wire point-to-point recvs of small messages (pipeline-parallel activation transfers hang in warmup; tensor collectives are unaffected because all_reduce falls back to one wire ≤ 64 KiB). Fixes are submitted upstream:

This PR should land together with an mlx pin bump that includes that fix. We've been running the full stack (this PR + patched jaccl) on the cluster above: PP+RDMA and TP+RDMA both stripe across 3 cables and pass smoke tests.

🤖 Generated with Claude Code

The jaccl devices matrix previously kept a single interface name per node
pair, picked independently for each direction. With more than one cable
between two nodes the two directions could name interfaces on different
cables, and the resulting queue pairs silently deadlocked during model
load. Any additional cables were also invisible to the backend.

Each matrix cell now carries every physical link between a pair, with both
directions derived from one enumeration so that rail k of [i][j] and rail k
of [j][i] always name the two ends of the same cable (jaccl pairs queue
pairs across ranks by index).

When any pair has more than one link the worker sets MLX_JACCL_RING=1,
selecting the jaccl ring backend, which stripes collectives across all
links (the default mesh backend only uses the first). Single-link setups
keep the mesh backend and existing behaviour.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant