Skip to content

[Feature]: Save streaming response and continue generation if worker node fails for RL #411

@TianyiZhao1437

Description

@TianyiZhao1437

Is your feature request related to a problem?

When using spot instances for RL rollout, worker nodes can fail without notifications. To avoid generating from the very beginning, we need to save completed streaming responses and dispatch to a healthy node to continue chat.

Describe the Solution you'd like

  1. Add a mode for RL + spot instances.
  2. In this mode, all generations should use streaming and scheduler need to save the streaming response.
  3. If worker fails, dispatch the requests to healthy nodes with generated outputs.

Alternatives Considered (Optional)

No response

Additional Context (Optional)

No response

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions