feat: stripe RDMA traffic across multiple Thunderbolt cables (multi-rail jaccl)#2160
Draft
jasonpaulso wants to merge 1 commit into
Draft
feat: stripe RDMA traffic across multiple Thunderbolt cables (multi-rail jaccl)#2160jasonpaulso wants to merge 1 commit into
jasonpaulso wants to merge 1 commit into
Conversation
The jaccl devices matrix previously kept a single interface name per node pair, picked independently for each direction. With more than one cable between two nodes the two directions could name interfaces on different cables, and the resulting queue pairs silently deadlocked during model load. Any additional cables were also invisible to the backend. Each matrix cell now carries every physical link between a pair, with both directions derived from one enumeration so that rail k of [i][j] and rail k of [j][i] always name the two ends of the same cable (jaccl pairs queue pairs across ranks by index). When any pair has more than one link the worker sets MLX_JACCL_RING=1, selecting the jaccl ring backend, which stripes collectives across all links (the default mesh backend only uses the first). Single-link setups keep the mesh backend and existing behaviour. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
With more than one Thunderbolt cable between two nodes, the jaccl devices matrix previously kept a single RDMA interface per node pair, picked independently for each direction. The two directions could name interfaces on different cables, and since jaccl pairs queue pairs by index across ranks via the side channel, the mismatched pairing silently deadlocked model load (we consistently stalled at 47/48 layers loading a TP instance with 2 cables connected; unplugging one cable "fixed" it). Extra cables were also invisible to the backend — no bandwidth benefit.
This PR:
get_mlx_jaccl_devices_matrixto collect all RDMA links per node pair (from both edge directions) and emitlist[list[list[str]]]— each cell carries every local interface for that peer, ordered so that rail k on rank i and rail k on rank j refer to the same physical cable (both sides are emitted from one enumeration of the shared link set). jaccl's matrix parser accepts string-or-array cells, and the mesh backend keeps using[0], so single-cable behavior is unchanged.MLX_JACCL_RING=1when any pair has more than one link, so jaccl's ring backend stripes traffic across all cables.Measured impact
2-node cluster (Mac Studio + MacBook Pro, macOS 26), 3× TB5 cables,
mlx-community/Qwen3-Coder-Next-4bit, Tensor + RDMA: 19.4 → 42.0 tok/s generation, TTFT 610 → ~430 ms. And the 2-cable mesh deadlock above is fixed regardless of striping.jaccl's ring backend has a recv-prefill bug that deadlocks multi-wire point-to-point recvs of small messages (pipeline-parallel activation transfers hang in warmup; tensor collectives are unaffected because
all_reducefalls back to one wire ≤ 64 KiB). Fixes are submitted upstream:This PR should land together with an mlx pin bump that includes that fix. We've been running the full stack (this PR + patched jaccl) on the cluster above: PP+RDMA and TP+RDMA both stripe across 3 cables and pass smoke tests.
🤖 Generated with Claude Code