mnnvl: normal mode supports mnnvl by xuefeng-d · Pull Request #565 · deepseek-ai/DeepEP

xuefeng-d · 2026-01-16T05:28:44Z

Allow more than 8 GPUs to work via mnnvl in normal intra mode.

Motivation:

To enable DeepEP to run on machines that support MNNVL.

For example, on these machines, each has 4 GPUs, and the GPUs can be interconnected across machines via MNNVL.

Current status:

Some work has already been completed specifically, the use_fabric parameter can now be used for MNNVL. However, there are several issues:

It currently only supports two machines (each with 4 GPUs), meaning it is still constrained by the 8-GPU limitation.
This behavior is controlled by an additional parameter, which is inconvenient. Higher-level frameworks typically don't know when to set it. Moreover, when MNNVL is not in use, RDMA is usually available as a fallback; thus, MNNVL should be treated more like an optimization rather than a mandatory configuration.

Implementation:

I introduced an environment variable DEEP_EP_NORMAL_MNNVL to control whether the normal mode should use MNNVL. To clarify: for low-latency mode, if the user enables the 'allow_MNNVL' option, MNNVL is already usable. This provides a smooth user experience, and frameworks like SGLang have already adopted this approach. However, for normal mode, deciding when to enable MNNVL is more nuanced. I believe using an environment variable is a clean and flexible solution. When this variable is set, the system enters intra-mode and bypasses the inter-node logic entirely, using MNNVL for communication across all GPUs.
I removed the buffer parameter use_fabric. This name seemed to refer to a specific handle-exchange mechanism under MNNVL, which is an implementation detail we shouldn't expose at this layer. it's not relevant to our high-level orchestration logic.
Previously, addresses constrained by NUM_MAX_NVL_PEERS now use num_nvl_ranks for resource allocation.
Finally, it's worth noting that for configurations exceeding 8 GPUs, the current implementation only supports 12 and 24 GPUs. This limitation stems from the current intra-node design: due to shared memory size constraints, we cannot set kNumThreads too large, and num_ranks must be a divisor of kNumThreads. As a result, only specific GPU counts (like 12 and 24) are currently feasible. To support 16 or 32 GPUs, we would likely need to decrease kNumThreads to 512 to satisfy both shared memory capacity and warp occupancy requirements.

By the way, I also removed the --allow-mnnvl flag from test_intranode.py, as it only affects NVSHMEM and has no practical effect in intra-node scenarios.

Allow more than 8 GPUs to work via mnnvl in normal intra mode. Motivation: To enable DeepEP to run on machines that support MNNVL. For example, on these machines, each has 4 GPUs, and the GPUs can be interconnected across machines via MNNVL. Current status: Some work has already been completed specifically, the `use_fabric` parameter can now be used for MNNVL. However, there are several issues: 1. It currently only supports two machines (each with 4 GPUs), meaning it is still constrained by the 8-GPU limitation. 2. This behavior is controlled by an additional parameter, which is inconvenient. Higher-level frameworks typically don't know when to set it. Moreover, when MNNVL is not in use, RDMA is usually available as a fallback; thus, MNNVL should be treated more like an optimization rather than a mandatory configuration. Implementation: 1. I introduced an environment variable `DEEP_EP_NORMAL_MNNVL` to control whether the *normal* mode should use MNNVL. To clarify: for *low-latency* mode, if the user enables the 'allow_MNNVL' option, MNNVL is already usable. This provides a smooth user experience, and frameworks like SGLang have already adopted this approach. However, for *normal* mode, deciding when to enable MNNVL is more nuanced. I believe using an environment variable is a clean and flexible solution. When this variable is set, the system enters *intra*-mode and bypasses the *inter*-node logic entirely, using MNNVL for communication across all GPUs. 2. I removed the buffer parameter `use_fabric`. This name seemed to refer to a specific handle-exchange mechanism under MNNVL, which is an implementation detail we shouldn't expose at this layer. it's not relevant to our high-level orchestration logic. 3. Previously, addresses constrained by `NUM_MAX_NVL_PEERS` now use `num_nvl_ranks` for resource allocation. 4. Finally, it's worth noting that for configurations exceeding 8 GPUs, the current implementation only supports 12 and 24 GPUs. This limitation stems from the current intra-node design: due to shared memory size constraints, we cannot set `kNumThreads` too large, and `num_ranks` must be a divisor of `kNumThreads`. As a result, only specific GPU counts (like 12 and 24) are currently feasible. To support 16 or 32 GPUs, we would likely need to decrease `kNumThreads` to 512 to satisfy both shared memory capacity and warp occupancy requirements. By the way, I also removed the `--allow-mnnvl` flag from `test_intranode.py`, as it only affects NVSHMEM and has no practical effect in intra-node scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mnnvl: normal mode supports mnnvl#565

mnnvl: normal mode supports mnnvl#565
xuefeng-d wants to merge 1 commit intodeepseek-ai:mainfrom
xuefeng-d:normal-mnnvl

xuefeng-d commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xuefeng-d commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant