[chore] Remove weight-sync hacks now handled natively by vLLM 0.23.0 by hao-aaron · Pull Request #1861 · NovaSky-AI/SkyRL

hao-aaron · 2026-07-01T21:49:43Z

Summary

vLLM 0.23.0 adds native support for the RL weight-sync lifecycle and for returning sampled token IDs over HTTP. This PR upgrades SkyRL's minimum vLLM to 0.23.0 and rips out the shims we were carrying to work around their absence on the new inference path (_SKYRL_USE_NEW_INFERENCE=1, remote inference servers — the default).

What changed

1. Weight sync now uses vLLM's native endpoints (new path)

Previously SkyRL routed every weight update through vLLM's /collective_rpc into custom worker methods on NewInferenceWorkerWrap (skyrl_start_weight_update, update_weights_ipc, update_weights_nccl, skyrl_finish_weight_update). Two reasons that shim existed, both now resolved upstream:

vLLM's native /update_weights didn't accept SkyRL's chunked, packed CUDA-IPC payload.
MoE process_weights_after_loading read get_current_vllm_config(), so the receive had to be wrapped in set_current_vllm_config(...) — which the native endpoint didn't do. vLLM 0.23.0 ([Bugfix][MoE] Snapshot max_cudagraph_capture_size into FusedMoEConfig vllm-project/vllm#44613) dropped that get_current_vllm_config() read from the FlashInfer MoE kernels, so the wrap is no longer needed.

RemoteInferenceClient now calls the native /start_weight_update, /update_weights (via the existing update_named_weights), and /finish_weight_update. The update_weights_ipc / update_weights_nccl client methods are deleted.

2. Senders delegate packing to vLLM's transfer engines

CUDA IPC (cuda_ipc_strategy.py): instead of hand-packing each chunk into a contiguous buffer, creating IPC handles, and all_gather_object-ing them across ranks, we delegate to vLLM's IPCWeightTransferEngine.trainer_send_weights (packed mode), which handles uint8 packing, the cross-rank IPC-handle all-gather, bounded per-chunk buffers, and barriers. trainer_send_weights is synchronous, so we run it in a worker thread and bridge its rank-0 send_mode callback back to the event loop via run_coroutine_threadsafe, fanning the native IPCWeightTransferUpdateInfo payload out to all servers through the async client. The receiver's weight_transfer_engine (backend="ipc") unpacks via packed_ipc_consumer.
Broadcast / NCCL (broadcast_strategy.py): rank 0 drives NCCLWeightTransferEngine.trainer_send_weights and calls update_named_weights (native /update_weights) instead of the removed update_weights_nccl skyrl wrap.

3. `NewInferenceWorkerWrap` reduced to a hook

It's no longer a LayerwiseReloadWorkerMixin subclass with receive/load methods — it's a near-empty class kept only as the injection point (--worker-extension-cls) for SkyRL-specific worker overrides. The one remaining behavior: it applies the #44814 layerwise-reload numel patch (patch_numel_loaded) at import time. That fix is not in 0.23.0 (lands in 0.23.1); without it the native finish_weight_update → finalize_layerwise_reload → get_numel_loaded over-counts elements for composed weight loaders and silently drops trailing params (e.g. Mamba mixer.D). The call is guarded in try/except so the module stays importable where vLLM isn't (CPU CI). Remove once we bump to vLLM ≥ 0.23.1.

4. Token-in-token-out generation (`remote_inference_engine.py`)

Set return_token_ids=True on completion requests (vllm-project/vllm#22587) and read choice["token_ids"] directly, replacing the lossy re-tokenization of the output text (tokenizer.encode(text)). Falls back to re-tokenizing only if the field is absent (older server).

5. Version bump + docs

ray_wrapped_inference_engine.py: minimum vLLM assertion 0.18.0 → 0.23.0.
.claude/docs/weight_sync.md: rewritten to describe the native endpoints and the new hook role of NewInferenceWorkerWrap.

6. Tests

test_weight_sync.py: the NCCL flow now calls update_named_weights; the colocated IPC test's create_ipc_update_info produces the native IPCWeightTransferUpdateInfo packed payload (uint8 buffer via vLLM's pack_tensors, byte tensor_sizes, packed=True, handle carrying only the rebuild_cuda_tensor args) that packed_ipc_consumer expects, replacing SkyRL's old element-offset format.
test_pause_and_continue_generation.py: sets return_token_ids=True and asserts choice["token_ids"] length equals max_tokens, replacing the prior TODO.

Follow-ups

Drop the patch_numel_loaded() import-time call once on vLLM ≥ 0.23.1.

Test plan

Passing tests:

test_logprobs_matching_roundtrip[nemotron3-nano_tp4_ep4_h100]
test_weight_sync.py
test_weight_sync_moe.py

full scale gsm8k run:
https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k/runs/1q2du2kq?nw=nwuserahao

vLLM 0.23.0 performs RL weight sync natively, so the SkyRL worker-extension methods that routed weight updates through /collective_rpc are no longer needed: - Drop skyrl_start_weight_update / update_weights_ipc / update_weights_nccl / skyrl_finish_weight_update. RemoteInferenceClient now drives vLLM's native /start_weight_update, /update_weights, and /finish_weight_update endpoints. - CUDA IPC and broadcast (NCCL) senders delegate to vLLM's own IPCWeightTransferEngine / NCCLWeightTransferEngine.trainer_send_weights and call update_named_weights instead of the SkyRL wrap. The set_current_vllm_config wrap is gone: vLLM 0.23.0 (vllm-project/vllm#44613) dropped the get_current_vllm_config() read from the FlashInfer MoE kernels. - NewInferenceWorkerWrap is now a near-empty hook that only applies the #44814 layerwise-reload numel patch at import time (remove once on vLLM >= 0.23.1). - Token-in-token-out: request return_token_ids=True (vllm-project/vllm#22587) and read choice["token_ids"] instead of re-tokenizing the output text. - Bump the minimum vLLM assertion to 0.23.0 and update weight_sync.md. Signed-off-by: haoaaron <ahao@anyscale.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request upgrades the vLLM integration to version 0.23.0, migrating from custom weight synchronization mechanisms (using /collective_rpc and NewInferenceWorkerWrap) to vLLM's native weight-sync endpoints (/start_weight_update, /update_weights, /finish_weight_update). It also leverages vLLM's new return_token_ids parameter to achieve true token-in-token-out generation. Feedback on the changes includes moving the top-level import of patch_numel_loaded inside the guarded try...except block in new_inference_worker_wrap.py to prevent import failures in environments without vLLM (such as CPU CI), and converting packed.shapes to lists of integers in test_weight_sync.py to avoid downstream JSON serialization issues.

gemini-code-assist · 2026-07-01T21:53:01Z

 from skyrl.backends.skyrl_train.inference_servers.layerwise_reload import (
-    LayerwiseReloadWorkerMixin,
+    patch_numel_loaded,
 )


To ensure that new_inference_worker_wrap.py remains importable on environments where vllm or layerwise_reload cannot be imported (such as CPU CI), we should avoid importing patch_numel_loaded at the top level. We can remove this top-level import and instead import it lazily inside the guarded try...except block at the bottom of the file.

gemini-code-assist · 2026-07-01T21:53:01Z

+try:
+    patch_numel_loaded()
+except Exception:  # pragma: no cover - vllm not importable in this process
+    pass


Import patch_numel_loaded lazily inside this guarded try...except block to ensure the module remains importable on CPU CI or other environments where vllm is not available.

Suggested change

try:

patch_numel_loaded()

except Exception: # pragma: no cover - vllm not importable in this process

pass

try:

from skyrl.backends.skyrl_train.inference_servers.layerwise_reload import (

patch_numel_loaded,

)

patch_numel_loaded()

except Exception: # pragma: no cover - vllm or layerwise_reload not importable

pass

gemini-code-assist · 2026-07-01T21:53:01Z

-            "sizes": sizes,
+            "names": packed.names,
+            "dtype_names": [str(dt).split(".")[-1] for dt in packed.dtypes],
+            "shapes": packed.shapes,


Convert packed.shapes (which contains torch.Size objects) to lists of lists of integers to prevent potential JSON serialization issues in downstream HTTP clients.

Suggested change

"shapes": packed.shapes,

"shapes": [list(s) for s in packed.shapes],

Reconciles our vLLM 0.23.0 native weight-sync cleanup with main's NovaSky-AI#1835 "[refactor] Remove legacy inference codepath" (which removed the local Ray-actor engine path and the _SKYRL_USE_NEW_INFERENCE flag). Conflict resolutions: - inference_engines/ray_wrapped_inference_engine.py, remote_inference_engine.py: deleted on main. Accepted deletion — our changes are superseded: the vLLM version is now hard-pinned to 0.23.0 in pyproject.toml (our runtime assertion is moot), and token-in-token-out is now native in vllm_server_actor.py's /skyrl/v1/generate (returns resp.token_ids directly), so our return_token_ids HTTP change is no longer needed. - test_pause_and_continue_generation.py: deleted on main. Accepted deletion. - new_inference_worker_wrap.py: kept our near-empty hook + #44814 patch_numel_loaded (main had NOT done this cleanup; its version still carried the collective_rpc update_weights_ipc/nccl methods). - broadcast_strategy.py / cuda_ipc_strategy.py / remote_inference_client.py: auto-merged — main dropped the legacy _SKYRL_USE_NEW_INFERENCE dispatch and _send_chunks_legacy, our native-endpoint sender edits apply on top. Verified coherent (no leftover refs to removed symbols). - .claude/docs/weight_sync.md: dropped legacy WorkerWrap sections, kept our accurate native-endpoint lifecycle description. Follow-up: LayerwiseReloadWorkerMixin in layerwise_reload.py is now dead code (patch_numel_loaded/get_numel_loaded there are still used). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ght sync) `_send_chunks_vllm_native` annotates the inner `weight_iterator()` with `Iterator[...]`, but `Iterator` was never imported — so the annotation raised `NameError: name 'Iterator' is not defined` the moment the CUDA IPC (colocated) sender ran a weight sync. Not caught earlier because: py_compile only checks syntax; the GPU test_weight_sync.py IPC case drives the client endpoints directly and bypasses this method; and both gsm8k smoke runs used NCCL broadcast (broadcast_strategy.py, which does import Iterator). The colocated megatron logprobs roundtrip test is the first to exercise this path. Also drops two dead imports (reduce_tensor, str_to_torch_dtype) left over from the native-weight-sync refactor (ruff --fix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jul 1, 2026

View reviewed changes

hao-aaron and others added 2 commits July 1, 2026 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[chore] Remove weight-sync hacks now handled natively by vLLM 0.23.0#1861

[chore] Remove weight-sync hacks now handled natively by vLLM 0.23.0#1861
hao-aaron wants to merge 3 commits into
NovaSky-AI:mainfrom
hao-aaron:vllm-upgrade-cleanup

hao-aaron commented Jul 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	"shapes": packed.shapes,
	"shapes": [list(s) for s in packed.shapes],

Uh oh!

Conversation

hao-aaron commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

1. Weight sync now uses vLLM's native endpoints (new path)

2. Senders delegate packing to vLLM's transfer engines

3. NewInferenceWorkerWrap reduced to a hook

4. Token-in-token-out generation (remote_inference_engine.py)

5. Version bump + docs

6. Tests

Follow-ups

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hao-aaron commented Jul 1, 2026 •

edited

Loading

3. `NewInferenceWorkerWrap` reduced to a hook

4. Token-in-token-out generation (`remote_inference_engine.py`)