[chore] Remove weight-sync hacks now handled natively by vLLM 0.23.0#1861
[chore] Remove weight-sync hacks now handled natively by vLLM 0.23.0#1861hao-aaron wants to merge 3 commits into
Conversation
vLLM 0.23.0 performs RL weight sync natively, so the SkyRL worker-extension methods that routed weight updates through /collective_rpc are no longer needed: - Drop skyrl_start_weight_update / update_weights_ipc / update_weights_nccl / skyrl_finish_weight_update. RemoteInferenceClient now drives vLLM's native /start_weight_update, /update_weights, and /finish_weight_update endpoints. - CUDA IPC and broadcast (NCCL) senders delegate to vLLM's own IPCWeightTransferEngine / NCCLWeightTransferEngine.trainer_send_weights and call update_named_weights instead of the SkyRL wrap. The set_current_vllm_config wrap is gone: vLLM 0.23.0 (vllm-project/vllm#44613) dropped the get_current_vllm_config() read from the FlashInfer MoE kernels. - NewInferenceWorkerWrap is now a near-empty hook that only applies the #44814 layerwise-reload numel patch at import time (remove once on vLLM >= 0.23.1). - Token-in-token-out: request return_token_ids=True (vllm-project/vllm#22587) and read choice["token_ids"] instead of re-tokenizing the output text. - Bump the minimum vLLM assertion to 0.23.0 and update weight_sync.md. Signed-off-by: haoaaron <ahao@anyscale.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request upgrades the vLLM integration to version 0.23.0, migrating from custom weight synchronization mechanisms (using /collective_rpc and NewInferenceWorkerWrap) to vLLM's native weight-sync endpoints (/start_weight_update, /update_weights, /finish_weight_update). It also leverages vLLM's new return_token_ids parameter to achieve true token-in-token-out generation. Feedback on the changes includes moving the top-level import of patch_numel_loaded inside the guarded try...except block in new_inference_worker_wrap.py to prevent import failures in environments without vLLM (such as CPU CI), and converting packed.shapes to lists of integers in test_weight_sync.py to avoid downstream JSON serialization issues.
| from skyrl.backends.skyrl_train.inference_servers.layerwise_reload import ( | ||
| LayerwiseReloadWorkerMixin, | ||
| patch_numel_loaded, | ||
| ) |
There was a problem hiding this comment.
To ensure that new_inference_worker_wrap.py remains importable on environments where vllm or layerwise_reload cannot be imported (such as CPU CI), we should avoid importing patch_numel_loaded at the top level. We can remove this top-level import and instead import it lazily inside the guarded try...except block at the bottom of the file.
| try: | ||
| patch_numel_loaded() | ||
| except Exception: # pragma: no cover - vllm not importable in this process | ||
| pass |
There was a problem hiding this comment.
Import patch_numel_loaded lazily inside this guarded try...except block to ensure the module remains importable on CPU CI or other environments where vllm is not available.
| try: | |
| patch_numel_loaded() | |
| except Exception: # pragma: no cover - vllm not importable in this process | |
| pass | |
| try: | |
| from skyrl.backends.skyrl_train.inference_servers.layerwise_reload import ( | |
| patch_numel_loaded, | |
| ) | |
| patch_numel_loaded() | |
| except Exception: # pragma: no cover - vllm or layerwise_reload not importable | |
| pass |
| "sizes": sizes, | ||
| "names": packed.names, | ||
| "dtype_names": [str(dt).split(".")[-1] for dt in packed.dtypes], | ||
| "shapes": packed.shapes, |
Reconciles our vLLM 0.23.0 native weight-sync cleanup with main's NovaSky-AI#1835 "[refactor] Remove legacy inference codepath" (which removed the local Ray-actor engine path and the _SKYRL_USE_NEW_INFERENCE flag). Conflict resolutions: - inference_engines/ray_wrapped_inference_engine.py, remote_inference_engine.py: deleted on main. Accepted deletion — our changes are superseded: the vLLM version is now hard-pinned to 0.23.0 in pyproject.toml (our runtime assertion is moot), and token-in-token-out is now native in vllm_server_actor.py's /skyrl/v1/generate (returns resp.token_ids directly), so our return_token_ids HTTP change is no longer needed. - test_pause_and_continue_generation.py: deleted on main. Accepted deletion. - new_inference_worker_wrap.py: kept our near-empty hook + #44814 patch_numel_loaded (main had NOT done this cleanup; its version still carried the collective_rpc update_weights_ipc/nccl methods). - broadcast_strategy.py / cuda_ipc_strategy.py / remote_inference_client.py: auto-merged — main dropped the legacy _SKYRL_USE_NEW_INFERENCE dispatch and _send_chunks_legacy, our native-endpoint sender edits apply on top. Verified coherent (no leftover refs to removed symbols). - .claude/docs/weight_sync.md: dropped legacy WorkerWrap sections, kept our accurate native-endpoint lifecycle description. Follow-up: LayerwiseReloadWorkerMixin in layerwise_reload.py is now dead code (patch_numel_loaded/get_numel_loaded there are still used). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ght sync) `_send_chunks_vllm_native` annotates the inner `weight_iterator()` with `Iterator[...]`, but `Iterator` was never imported — so the annotation raised `NameError: name 'Iterator' is not defined` the moment the CUDA IPC (colocated) sender ran a weight sync. Not caught earlier because: py_compile only checks syntax; the GPU test_weight_sync.py IPC case drives the client endpoints directly and bypasses this method; and both gsm8k smoke runs used NCCL broadcast (broadcast_strategy.py, which does import Iterator). The colocated megatron logprobs roundtrip test is the first to exercise this path. Also drops two dead imports (reduce_tensor, str_to_torch_dtype) left over from the native-weight-sync refactor (ruff --fix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
vLLM
0.23.0adds native support for the RL weight-sync lifecycle and for returning sampled token IDs over HTTP. This PR upgrades SkyRL's minimum vLLM to0.23.0and rips out the shims we were carrying to work around their absence on the new inference path (_SKYRL_USE_NEW_INFERENCE=1, remote inference servers — the default).What changed
1. Weight sync now uses vLLM's native endpoints (new path)
Previously SkyRL routed every weight update through vLLM's
/collective_rpcinto custom worker methods onNewInferenceWorkerWrap(skyrl_start_weight_update,update_weights_ipc,update_weights_nccl,skyrl_finish_weight_update). Two reasons that shim existed, both now resolved upstream:/update_weightsdidn't accept SkyRL's chunked, packed CUDA-IPC payload.process_weights_after_loadingreadget_current_vllm_config(), so the receive had to be wrapped inset_current_vllm_config(...)— which the native endpoint didn't do. vLLM0.23.0([Bugfix][MoE] Snapshot max_cudagraph_capture_size into FusedMoEConfig vllm-project/vllm#44613) dropped thatget_current_vllm_config()read from the FlashInfer MoE kernels, so the wrap is no longer needed.RemoteInferenceClientnow calls the native/start_weight_update,/update_weights(via the existingupdate_named_weights), and/finish_weight_update. Theupdate_weights_ipc/update_weights_ncclclient methods are deleted.2. Senders delegate packing to vLLM's transfer engines
cuda_ipc_strategy.py): instead of hand-packing each chunk into a contiguous buffer, creating IPC handles, andall_gather_object-ing them across ranks, we delegate to vLLM'sIPCWeightTransferEngine.trainer_send_weights(packed mode), which handles uint8 packing, the cross-rank IPC-handle all-gather, bounded per-chunk buffers, and barriers.trainer_send_weightsis synchronous, so we run it in a worker thread and bridge its rank-0send_modecallback back to the event loop viarun_coroutine_threadsafe, fanning the nativeIPCWeightTransferUpdateInfopayload out to all servers through the async client. The receiver'sweight_transfer_engine(backend="ipc") unpacks viapacked_ipc_consumer.broadcast_strategy.py): rank 0 drivesNCCLWeightTransferEngine.trainer_send_weightsand callsupdate_named_weights(native/update_weights) instead of the removedupdate_weights_ncclskyrl wrap.3.
NewInferenceWorkerWrapreduced to a hookIt's no longer a
LayerwiseReloadWorkerMixinsubclass with receive/load methods — it's a near-empty class kept only as the injection point (--worker-extension-cls) for SkyRL-specific worker overrides. The one remaining behavior: it applies the#44814layerwise-reload numel patch (patch_numel_loaded) at import time. That fix is not in0.23.0(lands in0.23.1); without it the nativefinish_weight_update → finalize_layerwise_reload → get_numel_loadedover-counts elements for composed weight loaders and silently drops trailing params (e.g. Mambamixer.D). The call is guarded intry/exceptso the module stays importable where vLLM isn't (CPU CI). Remove once we bump to vLLM ≥ 0.23.1.4. Token-in-token-out generation (
remote_inference_engine.py)Set
return_token_ids=Trueon completion requests (vllm-project/vllm#22587) and readchoice["token_ids"]directly, replacing the lossy re-tokenization of the output text (tokenizer.encode(text)). Falls back to re-tokenizing only if the field is absent (older server).5. Version bump + docs
ray_wrapped_inference_engine.py: minimum vLLM assertion0.18.0 → 0.23.0..claude/docs/weight_sync.md: rewritten to describe the native endpoints and the new hook role ofNewInferenceWorkerWrap.6. Tests
test_weight_sync.py: the NCCL flow now callsupdate_named_weights; the colocated IPC test'screate_ipc_update_infoproduces the nativeIPCWeightTransferUpdateInfopacked payload (uint8 buffer via vLLM'spack_tensors, bytetensor_sizes,packed=True, handle carrying only therebuild_cuda_tensorargs) thatpacked_ipc_consumerexpects, replacing SkyRL's old element-offset format.test_pause_and_continue_generation.py: setsreturn_token_ids=Trueand assertschoice["token_ids"]length equalsmax_tokens, replacing the prior TODO.Follow-ups
Drop the
patch_numel_loaded()import-time call once on vLLM ≥0.23.1.Test plan
Passing tests:
full scale gsm8k run:
https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k/runs/1q2du2kq?nw=nwuserahao