BatchScheduler: garbage output at high --concurrent with short/early-finishing generations (exposed by --no-think)

## Summary

Under `--concurrent` at high batch size (B=8), **short / early-finishing** generations produce corrupted output (repeated tokens, e.g. `!!!!!!!`). Discovered during v0.9.13 release validation on `mlx-community/Qwen3.6-35B-A3B-4bit`.

## Repro

```bash
# GARBAGE at B=8:
afm mlx -m mlx-community/Qwen3.6-35B-A3B-4bit --port 9999 --no-think --concurrent 8
python3 Scripts/feature-mlx-concurrent-batch/validate_responses.py
# -> 22/32, B=8: 3/8, e.g. "capital of France" -> '!!!!!!!!!!!!...'

# CLEAN with thinking on (long generations):
afm mlx -m mlx-community/Qwen3.6-35B-A3B-4bit --port 9999 --concurrent 8
python3 Scripts/feature-mlx-concurrent-batch/validate_responses.py
# -> 30/32, B=8: 8/8
```

## Analysis

- **Not caused by `--no-think`** per se. `--no-think` (fixed in this release so it actually disables thinking) makes generations short and varied-length, which **exposes** a latent BatchScheduler bug. With thinking on, all sequences are long and finish together, so the bug doesn't trigger.
- The `!!!!` repeated-token signature points at a **slot/KV lifecycle bug**: when one sequence in the batch finishes early, its slot's KV state appears to corrupt a still-running sequence (or the finished slot is re-decoded).
- Same family as the long-standing `Concurrent x8 shared-prefix` assertion failures and the #86 concurrent issues.

## Impact

- **Default behavior is unaffected** — without `--no-think`, batched decode is clean (30/32 at B=8).
- Narrow combination: opt-in `--no-think` **and** high `--concurrent`. Lower concurrency or omitting `--no-think` avoids it.

## Suggested fix area

`BatchScheduler` slot retirement / KV-cache handling when a sequence hits EOS before others in the batch (eviction or masking of the finished slot's contribution).

Found during v0.9.13 validation; documented as a known limitation for that release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchScheduler: garbage output at high --concurrent with short/early-finishing generations (exposed by --no-think) #140

Summary

Repro

Analysis

Impact

Suggested fix area

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

BatchScheduler: garbage output at high --concurrent with short/early-finishing generations (exposed by --no-think) #140

Description

Summary

Repro

Analysis

Impact

Suggested fix area

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions