Skip to content

mnnvl: normal mode supports mnnvl#565

Open
xuefeng-d wants to merge 1 commit intodeepseek-ai:mainfrom
xuefeng-d:normal-mnnvl
Open

mnnvl: normal mode supports mnnvl#565
xuefeng-d wants to merge 1 commit intodeepseek-ai:mainfrom
xuefeng-d:normal-mnnvl

Conversation

@xuefeng-d
Copy link

Allow more than 8 GPUs to work via mnnvl in normal intra mode.

Motivation:

To enable DeepEP to run on machines that support MNNVL.

For example, on these machines, each has 4 GPUs, and the GPUs can be interconnected across machines via MNNVL.

Current status:

Some work has already been completed specifically, the use_fabric parameter can now be used for MNNVL. However, there are several issues:

  1. It currently only supports two machines (each with 4 GPUs), meaning it is still constrained by the 8-GPU limitation.

  2. This behavior is controlled by an additional parameter, which is inconvenient. Higher-level frameworks typically don't know when to set it. Moreover, when MNNVL is not in use, RDMA is usually available as a fallback; thus, MNNVL should be treated more like an optimization rather than a mandatory configuration.

Implementation:

  1. I introduced an environment variable DEEP_EP_NORMAL_MNNVL to control whether the normal mode should use MNNVL. To clarify: for low-latency mode, if the user enables the 'allow_MNNVL' option, MNNVL is already usable. This provides a smooth user experience, and frameworks like SGLang have already adopted this approach. However, for normal mode, deciding when to enable MNNVL is more nuanced. I believe using an environment variable is a clean and flexible solution. When this variable is set, the system enters intra-mode and bypasses the inter-node logic entirely, using MNNVL for communication across all GPUs.

  2. I removed the buffer parameter use_fabric. This name seemed to refer to a specific handle-exchange mechanism under MNNVL, which is an implementation detail we shouldn't expose at this layer. it's not relevant to our high-level orchestration logic.

  3. Previously, addresses constrained by NUM_MAX_NVL_PEERS now use num_nvl_ranks for resource allocation.

  4. Finally, it's worth noting that for configurations exceeding 8 GPUs, the current implementation only supports 12 and 24 GPUs. This limitation stems from the current intra-node design: due to shared memory size constraints, we cannot set kNumThreads too large, and num_ranks must be a divisor of kNumThreads. As a result, only specific GPU counts (like 12 and 24) are currently feasible. To support 16 or 32 GPUs, we would likely need to decrease kNumThreads to 512 to satisfy both shared memory capacity and warp occupancy requirements.

By the way, I also removed the --allow-mnnvl flag from test_intranode.py, as it only affects NVSHMEM and has no practical effect in intra-node scenarios.

Allow more than 8 GPUs to work via mnnvl in normal intra mode.

Motivation:

To enable DeepEP to run on machines that support MNNVL.

For example, on these machines, each has 4 GPUs, and the GPUs can be
interconnected across machines via MNNVL.

Current status:

Some work has already been completed specifically, the `use_fabric`
parameter can now be used for MNNVL. However, there are
several issues:

1. It currently only supports two machines (each with 4 GPUs), meaning
   it is still constrained by the 8-GPU limitation.

2. This behavior is controlled by an additional parameter, which is
   inconvenient. Higher-level frameworks typically don't know when to
   set it. Moreover, when MNNVL is not in use, RDMA is usually available
   as a fallback; thus, MNNVL should be treated more like an
   optimization rather than a mandatory configuration.

Implementation:

1. I introduced an environment variable `DEEP_EP_NORMAL_MNNVL` to
   control whether the *normal* mode should use MNNVL. To clarify: for
   *low-latency* mode, if the user enables the 'allow_MNNVL' option,
   MNNVL is already usable. This provides a smooth user experience, and
   frameworks like SGLang have already adopted this approach. However,
   for *normal* mode, deciding when to enable MNNVL is more nuanced. I
   believe using an environment variable is a clean and flexible
   solution. When this variable is set, the system enters *intra*-mode
   and bypasses the *inter*-node logic entirely, using MNNVL for
   communication across all GPUs.

2. I removed the buffer parameter `use_fabric`. This name seemed
   to refer to a specific handle-exchange mechanism under MNNVL, which
   is an implementation detail we shouldn't expose at this layer. it's
   not relevant to our high-level orchestration logic.

3. Previously, addresses constrained by `NUM_MAX_NVL_PEERS` now use
   `num_nvl_ranks` for resource allocation.

4. Finally, it's worth noting that for configurations exceeding 8 GPUs,
   the current implementation only supports 12 and 24 GPUs. This
   limitation stems from the current intra-node design: due to shared
   memory size constraints, we cannot set `kNumThreads` too large, and
   `num_ranks` must be a divisor of `kNumThreads`. As a result, only
   specific GPU counts (like 12 and 24) are currently feasible. To
   support 16 or 32 GPUs, we would likely need to decrease `kNumThreads`
   to 512 to satisfy both shared memory capacity and warp occupancy
   requirements.

By the way, I also removed the `--allow-mnnvl` flag from
`test_intranode.py`, as it only affects NVSHMEM and has no practical
effect in intra-node scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant