Skip to content

[chore] Remove weight-sync hacks now handled natively by vLLM 0.23.0#1861

Open
hao-aaron wants to merge 3 commits into
NovaSky-AI:mainfrom
hao-aaron:vllm-upgrade-cleanup
Open

[chore] Remove weight-sync hacks now handled natively by vLLM 0.23.0#1861
hao-aaron wants to merge 3 commits into
NovaSky-AI:mainfrom
hao-aaron:vllm-upgrade-cleanup

Conversation

@hao-aaron

@hao-aaron hao-aaron commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

vLLM 0.23.0 adds native support for the RL weight-sync lifecycle and for returning sampled token IDs over HTTP. This PR upgrades SkyRL's minimum vLLM to 0.23.0 and rips out the shims we were carrying to work around their absence on the new inference path (_SKYRL_USE_NEW_INFERENCE=1, remote inference servers — the default).

What changed

1. Weight sync now uses vLLM's native endpoints (new path)

Previously SkyRL routed every weight update through vLLM's /collective_rpc into custom worker methods on NewInferenceWorkerWrap (skyrl_start_weight_update, update_weights_ipc, update_weights_nccl, skyrl_finish_weight_update). Two reasons that shim existed, both now resolved upstream:

  • vLLM's native /update_weights didn't accept SkyRL's chunked, packed CUDA-IPC payload.
  • MoE process_weights_after_loading read get_current_vllm_config(), so the receive had to be wrapped in set_current_vllm_config(...) — which the native endpoint didn't do. vLLM 0.23.0 ([Bugfix][MoE] Snapshot max_cudagraph_capture_size into FusedMoEConfig vllm-project/vllm#44613) dropped that get_current_vllm_config() read from the FlashInfer MoE kernels, so the wrap is no longer needed.

RemoteInferenceClient now calls the native /start_weight_update, /update_weights (via the existing update_named_weights), and /finish_weight_update. The update_weights_ipc / update_weights_nccl client methods are deleted.

2. Senders delegate packing to vLLM's transfer engines

  • CUDA IPC (cuda_ipc_strategy.py): instead of hand-packing each chunk into a contiguous buffer, creating IPC handles, and all_gather_object-ing them across ranks, we delegate to vLLM's IPCWeightTransferEngine.trainer_send_weights (packed mode), which handles uint8 packing, the cross-rank IPC-handle all-gather, bounded per-chunk buffers, and barriers. trainer_send_weights is synchronous, so we run it in a worker thread and bridge its rank-0 send_mode callback back to the event loop via run_coroutine_threadsafe, fanning the native IPCWeightTransferUpdateInfo payload out to all servers through the async client. The receiver's weight_transfer_engine (backend="ipc") unpacks via packed_ipc_consumer.
  • Broadcast / NCCL (broadcast_strategy.py): rank 0 drives NCCLWeightTransferEngine.trainer_send_weights and calls update_named_weights (native /update_weights) instead of the removed update_weights_nccl skyrl wrap.

3. NewInferenceWorkerWrap reduced to a hook

It's no longer a LayerwiseReloadWorkerMixin subclass with receive/load methods — it's a near-empty class kept only as the injection point (--worker-extension-cls) for SkyRL-specific worker overrides. The one remaining behavior: it applies the #44814 layerwise-reload numel patch (patch_numel_loaded) at import time. That fix is not in 0.23.0 (lands in 0.23.1); without it the native finish_weight_update → finalize_layerwise_reload → get_numel_loaded over-counts elements for composed weight loaders and silently drops trailing params (e.g. Mamba mixer.D). The call is guarded in try/except so the module stays importable where vLLM isn't (CPU CI). Remove once we bump to vLLM ≥ 0.23.1.

4. Token-in-token-out generation (remote_inference_engine.py)

Set return_token_ids=True on completion requests (vllm-project/vllm#22587) and read choice["token_ids"] directly, replacing the lossy re-tokenization of the output text (tokenizer.encode(text)). Falls back to re-tokenizing only if the field is absent (older server).

5. Version bump + docs

  • ray_wrapped_inference_engine.py: minimum vLLM assertion 0.18.0 → 0.23.0.
  • .claude/docs/weight_sync.md: rewritten to describe the native endpoints and the new hook role of NewInferenceWorkerWrap.

6. Tests

  • test_weight_sync.py: the NCCL flow now calls update_named_weights; the colocated IPC test's create_ipc_update_info produces the native IPCWeightTransferUpdateInfo packed payload (uint8 buffer via vLLM's pack_tensors, byte tensor_sizes, packed=True, handle carrying only the rebuild_cuda_tensor args) that packed_ipc_consumer expects, replacing SkyRL's old element-offset format.
  • test_pause_and_continue_generation.py: sets return_token_ids=True and asserts choice["token_ids"] length equals max_tokens, replacing the prior TODO.

Follow-ups

Drop the patch_numel_loaded() import-time call once on vLLM ≥ 0.23.1.

Test plan

Passing tests:

  • test_logprobs_matching_roundtrip[nemotron3-nano_tp4_ep4_h100]
  • test_weight_sync.py
  • test_weight_sync_moe.py

full scale gsm8k run:
https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k/runs/1q2du2kq?nw=nwuserahao

vLLM 0.23.0 performs RL weight sync natively, so the SkyRL worker-extension
methods that routed weight updates through /collective_rpc are no longer
needed:

- Drop skyrl_start_weight_update / update_weights_ipc / update_weights_nccl /
  skyrl_finish_weight_update. RemoteInferenceClient now drives vLLM's native
  /start_weight_update, /update_weights, and /finish_weight_update endpoints.
- CUDA IPC and broadcast (NCCL) senders delegate to vLLM's own
  IPCWeightTransferEngine / NCCLWeightTransferEngine.trainer_send_weights and
  call update_named_weights instead of the SkyRL wrap. The set_current_vllm_config
  wrap is gone: vLLM 0.23.0 (vllm-project/vllm#44613) dropped the
  get_current_vllm_config() read from the FlashInfer MoE kernels.
- NewInferenceWorkerWrap is now a near-empty hook that only applies the #44814
  layerwise-reload numel patch at import time (remove once on vLLM >= 0.23.1).
- Token-in-token-out: request return_token_ids=True (vllm-project/vllm#22587)
  and read choice["token_ids"] instead of re-tokenizing the output text.
- Bump the minimum vLLM assertion to 0.23.0 and update weight_sync.md.

Signed-off-by: haoaaron <ahao@anyscale.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request upgrades the vLLM integration to version 0.23.0, migrating from custom weight synchronization mechanisms (using /collective_rpc and NewInferenceWorkerWrap) to vLLM's native weight-sync endpoints (/start_weight_update, /update_weights, /finish_weight_update). It also leverages vLLM's new return_token_ids parameter to achieve true token-in-token-out generation. Feedback on the changes includes moving the top-level import of patch_numel_loaded inside the guarded try...except block in new_inference_worker_wrap.py to prevent import failures in environments without vLLM (such as CPU CI), and converting packed.shapes to lists of integers in test_weight_sync.py to avoid downstream JSON serialization issues.

Comment on lines 34 to 36
from skyrl.backends.skyrl_train.inference_servers.layerwise_reload import (
LayerwiseReloadWorkerMixin,
patch_numel_loaded,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure that new_inference_worker_wrap.py remains importable on environments where vllm or layerwise_reload cannot be imported (such as CPU CI), we should avoid importing patch_numel_loaded at the top level. We can remove this top-level import and instead import it lazily inside the guarded try...except block at the bottom of the file.

Comment on lines +60 to +63
try:
patch_numel_loaded()
except Exception: # pragma: no cover - vllm not importable in this process
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import patch_numel_loaded lazily inside this guarded try...except block to ensure the module remains importable on CPU CI or other environments where vllm is not available.

Suggested change
try:
patch_numel_loaded()
except Exception: # pragma: no cover - vllm not importable in this process
pass
try:
from skyrl.backends.skyrl_train.inference_servers.layerwise_reload import (
patch_numel_loaded,
)
patch_numel_loaded()
except Exception: # pragma: no cover - vllm or layerwise_reload not importable
pass

"sizes": sizes,
"names": packed.names,
"dtype_names": [str(dt).split(".")[-1] for dt in packed.dtypes],
"shapes": packed.shapes,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Convert packed.shapes (which contains torch.Size objects) to lists of lists of integers to prevent potential JSON serialization issues in downstream HTTP clients.

Suggested change
"shapes": packed.shapes,
"shapes": [list(s) for s in packed.shapes],

hao-aaron and others added 2 commits July 1, 2026 22:35
Reconciles our vLLM 0.23.0 native weight-sync cleanup with main's
NovaSky-AI#1835 "[refactor] Remove legacy inference codepath" (which removed the
local Ray-actor engine path and the _SKYRL_USE_NEW_INFERENCE flag).

Conflict resolutions:
- inference_engines/ray_wrapped_inference_engine.py, remote_inference_engine.py:
  deleted on main. Accepted deletion — our changes are superseded:
  the vLLM version is now hard-pinned to 0.23.0 in pyproject.toml (our
  runtime assertion is moot), and token-in-token-out is now native in
  vllm_server_actor.py's /skyrl/v1/generate (returns resp.token_ids
  directly), so our return_token_ids HTTP change is no longer needed.
- test_pause_and_continue_generation.py: deleted on main. Accepted deletion.
- new_inference_worker_wrap.py: kept our near-empty hook + #44814
  patch_numel_loaded (main had NOT done this cleanup; its version still
  carried the collective_rpc update_weights_ipc/nccl methods).
- broadcast_strategy.py / cuda_ipc_strategy.py / remote_inference_client.py:
  auto-merged — main dropped the legacy _SKYRL_USE_NEW_INFERENCE dispatch
  and _send_chunks_legacy, our native-endpoint sender edits apply on top.
  Verified coherent (no leftover refs to removed symbols).
- .claude/docs/weight_sync.md: dropped legacy WorkerWrap sections, kept
  our accurate native-endpoint lifecycle description.

Follow-up: LayerwiseReloadWorkerMixin in layerwise_reload.py is now dead
code (patch_numel_loaded/get_numel_loaded there are still used).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ght sync)

`_send_chunks_vllm_native` annotates the inner `weight_iterator()` with
`Iterator[...]`, but `Iterator` was never imported — so the annotation
raised `NameError: name 'Iterator' is not defined` the moment the CUDA
IPC (colocated) sender ran a weight sync.

Not caught earlier because: py_compile only checks syntax; the GPU
test_weight_sync.py IPC case drives the client endpoints directly and
bypasses this method; and both gsm8k smoke runs used NCCL broadcast
(broadcast_strategy.py, which does import Iterator). The colocated
megatron logprobs roundtrip test is the first to exercise this path.

Also drops two dead imports (reduce_tensor, str_to_torch_dtype) left
over from the native-weight-sync refactor (ruff --fix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant