Open
Conversation
Allow more than 8 GPUs to work via mnnvl in normal intra mode. Motivation: To enable DeepEP to run on machines that support MNNVL. For example, on these machines, each has 4 GPUs, and the GPUs can be interconnected across machines via MNNVL. Current status: Some work has already been completed specifically, the `use_fabric` parameter can now be used for MNNVL. However, there are several issues: 1. It currently only supports two machines (each with 4 GPUs), meaning it is still constrained by the 8-GPU limitation. 2. This behavior is controlled by an additional parameter, which is inconvenient. Higher-level frameworks typically don't know when to set it. Moreover, when MNNVL is not in use, RDMA is usually available as a fallback; thus, MNNVL should be treated more like an optimization rather than a mandatory configuration. Implementation: 1. I introduced an environment variable `DEEP_EP_NORMAL_MNNVL` to control whether the *normal* mode should use MNNVL. To clarify: for *low-latency* mode, if the user enables the 'allow_MNNVL' option, MNNVL is already usable. This provides a smooth user experience, and frameworks like SGLang have already adopted this approach. However, for *normal* mode, deciding when to enable MNNVL is more nuanced. I believe using an environment variable is a clean and flexible solution. When this variable is set, the system enters *intra*-mode and bypasses the *inter*-node logic entirely, using MNNVL for communication across all GPUs. 2. I removed the buffer parameter `use_fabric`. This name seemed to refer to a specific handle-exchange mechanism under MNNVL, which is an implementation detail we shouldn't expose at this layer. it's not relevant to our high-level orchestration logic. 3. Previously, addresses constrained by `NUM_MAX_NVL_PEERS` now use `num_nvl_ranks` for resource allocation. 4. Finally, it's worth noting that for configurations exceeding 8 GPUs, the current implementation only supports 12 and 24 GPUs. This limitation stems from the current intra-node design: due to shared memory size constraints, we cannot set `kNumThreads` too large, and `num_ranks` must be a divisor of `kNumThreads`. As a result, only specific GPU counts (like 12 and 24) are currently feasible. To support 16 or 32 GPUs, we would likely need to decrease `kNumThreads` to 512 to satisfy both shared memory capacity and warp occupancy requirements. By the way, I also removed the `--allow-mnnvl` flag from `test_intranode.py`, as it only affects NVSHMEM and has no practical effect in intra-node scenarios.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Allow more than 8 GPUs to work via mnnvl in normal intra mode.
Motivation:
To enable DeepEP to run on machines that support MNNVL.
For example, on these machines, each has 4 GPUs, and the GPUs can be interconnected across machines via MNNVL.
Current status:
Some work has already been completed specifically, the
use_fabricparameter can now be used for MNNVL. However, there are several issues:It currently only supports two machines (each with 4 GPUs), meaning it is still constrained by the 8-GPU limitation.
This behavior is controlled by an additional parameter, which is inconvenient. Higher-level frameworks typically don't know when to set it. Moreover, when MNNVL is not in use, RDMA is usually available as a fallback; thus, MNNVL should be treated more like an optimization rather than a mandatory configuration.
Implementation:
I introduced an environment variable
DEEP_EP_NORMAL_MNNVLto control whether the normal mode should use MNNVL. To clarify: for low-latency mode, if the user enables the 'allow_MNNVL' option, MNNVL is already usable. This provides a smooth user experience, and frameworks like SGLang have already adopted this approach. However, for normal mode, deciding when to enable MNNVL is more nuanced. I believe using an environment variable is a clean and flexible solution. When this variable is set, the system enters intra-mode and bypasses the inter-node logic entirely, using MNNVL for communication across all GPUs.I removed the buffer parameter
use_fabric. This name seemed to refer to a specific handle-exchange mechanism under MNNVL, which is an implementation detail we shouldn't expose at this layer. it's not relevant to our high-level orchestration logic.Previously, addresses constrained by
NUM_MAX_NVL_PEERSnow usenum_nvl_ranksfor resource allocation.Finally, it's worth noting that for configurations exceeding 8 GPUs, the current implementation only supports 12 and 24 GPUs. This limitation stems from the current intra-node design: due to shared memory size constraints, we cannot set
kNumThreadstoo large, andnum_ranksmust be a divisor ofkNumThreads. As a result, only specific GPU counts (like 12 and 24) are currently feasible. To support 16 or 32 GPUs, we would likely need to decreasekNumThreadsto 512 to satisfy both shared memory capacity and warp occupancy requirements.By the way, I also removed the
--allow-mnnvlflag fromtest_intranode.py, as it only affects NVSHMEM and has no practical effect in intra-node scenarios.