Skip to content

Balance burst requests by tracking pending master assignments#2143

Open
RWL-Dittrich wants to merge 1 commit into
exo-explore:mainfrom
RWL-Dittrich:feature/round-robin-instances
Open

Balance burst requests by tracking pending master assignments#2143
RWL-Dittrich wants to merge 1 commit into
exo-explore:mainfrom
RWL-Dittrich:feature/round-robin-instances

Conversation

@RWL-Dittrich

Copy link
Copy Markdown

Motivation

When a cluster runs multiple instances of the same model, the master should send new requests to the least busy instance. But with several requests arriving at the same time, most of them were sent to one instance while the other stayed idle.

Root cause:
The master checks self.state.tasks to see how many tasks each instance has, then emits a TaskCreated event. However, self.state is only updated later by _event_processor.
Because of that delay, several requests can all see the same old state: node1=0, node2=0. Since ties always pick the first instance, the whole burst can go to one node.
This happens inside the master scheduler, so it does not matter which node receives the HTTP request.

Changes

  • Added Master._pending_assignments to track tasks that were assigned but are not yet visible in self.state.
  • Added _in_flight_counts(model_id, exclude) to count both tasks in self.state and pending assignments.
  • Used this helper in the TextGeneration, ImageGeneration, and ImageEdits schedulers.
  • Remove pending assignments when tasks are cancelled or finished.

Why It Works

The master now counts tasks immediately after assigning them, instead of waiting for the event to update self.state.
Example:

  • Request 1 sees node1=0, node2=0 and picks node 1.
  • Request 2 now sees node1=1, node2=0 because request 1 is pending, so it picks node 2.
    This prevents bursts from all going to the same instance. Once the TaskCreated event is applied, the task is counted through self.state and the pending entry is removed.

Test Plan

Manual Testing

  • Hardware: 2 mac mini's, both running Gemma 4.
  • Started two instances of Gemma 4
  • Send a few concurrent /v1/chat/completions requests at the same time.
  • Before: most requests go to instance 1.
  • After: requests are split roughly 2/2.
  • Repeat with HTTP requests sent to both nodes. The distribution should stay balanced.

Automated Testing

  • Existing src/exo/master/tests/test_master.py still covers the single-instance TextGeneration path.

@RWL-Dittrich RWL-Dittrich force-pushed the feature/round-robin-instances branch from 20300c1 to df05dcc Compare June 8, 2026 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant