You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Under --concurrent at high batch size (B=8), short / early-finishing generations produce corrupted output (repeated tokens, e.g. !!!!!!!). Discovered during v0.9.13 release validation on mlx-community/Qwen3.6-35B-A3B-4bit.
Repro
# GARBAGE at B=8:
afm mlx -m mlx-community/Qwen3.6-35B-A3B-4bit --port 9999 --no-think --concurrent 8
python3 Scripts/feature-mlx-concurrent-batch/validate_responses.py
# -> 22/32, B=8: 3/8, e.g. "capital of France" -> '!!!!!!!!!!!!...'# CLEAN with thinking on (long generations):
afm mlx -m mlx-community/Qwen3.6-35B-A3B-4bit --port 9999 --concurrent 8
python3 Scripts/feature-mlx-concurrent-batch/validate_responses.py
# -> 30/32, B=8: 8/8
Analysis
Not caused by --no-think per se. --no-think (fixed in this release so it actually disables thinking) makes generations short and varied-length, which exposes a latent BatchScheduler bug. With thinking on, all sequences are long and finish together, so the bug doesn't trigger.
The !!!! repeated-token signature points at a slot/KV lifecycle bug: when one sequence in the batch finishes early, its slot's KV state appears to corrupt a still-running sequence (or the finished slot is re-decoded).
Default behavior is unaffected — without --no-think, batched decode is clean (30/32 at B=8).
Narrow combination: opt-in --no-thinkand high --concurrent. Lower concurrency or omitting --no-think avoids it.
Suggested fix area
BatchScheduler slot retirement / KV-cache handling when a sequence hits EOS before others in the batch (eviction or masking of the finished slot's contribution).
Found during v0.9.13 validation; documented as a known limitation for that release.
Summary
Under
--concurrentat high batch size (B=8), short / early-finishing generations produce corrupted output (repeated tokens, e.g.!!!!!!!). Discovered during v0.9.13 release validation onmlx-community/Qwen3.6-35B-A3B-4bit.Repro
Analysis
--no-thinkper se.--no-think(fixed in this release so it actually disables thinking) makes generations short and varied-length, which exposes a latent BatchScheduler bug. With thinking on, all sequences are long and finish together, so the bug doesn't trigger.!!!!repeated-token signature points at a slot/KV lifecycle bug: when one sequence in the batch finishes early, its slot's KV state appears to corrupt a still-running sequence (or the finished slot is re-decoded).Concurrent x8 shared-prefixassertion failures and the Concurrent x8 prefix cache + grammar returns empty responses #86 concurrent issues.Impact
--no-think, batched decode is clean (30/32 at B=8).--no-thinkand high--concurrent. Lower concurrency or omitting--no-thinkavoids it.Suggested fix area
BatchSchedulerslot retirement / KV-cache handling when a sequence hits EOS before others in the batch (eviction or masking of the finished slot's contribution).Found during v0.9.13 validation; documented as a known limitation for that release.