Is your feature request related to a problem?
When using spot instances for RL rollout, worker nodes can fail without notifications. To avoid generating from the very beginning, we need to save completed streaming responses and dispatch to a healthy node to continue chat.
Describe the Solution you'd like
- Add a mode for RL + spot instances.
- In this mode, all generations should use streaming and scheduler need to save the streaming response.
- If worker fails, dispatch the requests to healthy nodes with generated outputs.
Alternatives Considered (Optional)
No response
Additional Context (Optional)
No response