Summary
The Claude Messages API (/v1/messages) mishandles stop_sequences in three distinct ways:
stop_reason is never "stop_sequence" — it is always reported as "end_turn".
- The
stop_sequence response field is never populated — it is always null.
- Multi-token stop sequences leak their leading bytes into the output, and are not fully stripped.
(1) and (2) affect every stop sequence, including single-token ones. (3) additionally corrupts the output text whenever a stop sequence spans more than one decoded token.
Reproduction
curl http://localhost:52415/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "<any mlx model>",
"max_tokens": 1024,
"messages": [{"role": "user",
"content": [{"type": "text",
"text": "Output the whole alphabet from A to Z, without spaces, then immediately output END"}]}],
"stop_sequences": ["END"],
"thinking": {"type": "disabled"}
}'
Observed response:
{
"content": [{"type": "text", "text": "ABCDEFGHIJKLMNOPQRSTUVWXYZ"}],
"stop_reason": "end_turn",
"stop_sequence": null,
...
}
Expected (per the Anthropic Messages API):
{
"content": [{"type": "text", "text": "ABCDEFGHIJKLMNOPQRSTUVWXYZ"}],
"stop_reason": "stop_sequence",
"stop_sequence": "END",
...
}
The single-token "END" is correctly stripped, but stop_reason/stop_sequence are wrong. With a stop sequence that tokenizes into multiple tokens, the leading bytes of the sequence additionally leak into content[0].text.
Root cause
Bugs 1 & 2 — reporting. When a stop sequence matches, the generator collapses it into the generic "stop" finish reason (src/exo/worker/engines/mlx/generator/generate.py, in the stop-sequence loop) and the matched string is discarded. The Claude adapter then maps it:
# src/exo/api/adapters/claude.py (finish_reason_to_claude_stop_reason)
mapping: dict[FinishReason, ClaudeStopReason] = {
"stop": "end_turn", # <- always end_turn; "stop_sequence" is never produced
...
}
"stop_sequence" is a valid ClaudeStopReason but is unreachable, and neither collect_claude_response nor generate_claude_stream ever sets the stop_sequence field (it defaults to None). There is no way to distinguish a stop-sequence stop from a natural EOS because both arrive as finish_reason == "stop".
Bug 3 — multi-token leak. Generated text is emitted one token at a time, and the stop check is a substring test over the accumulated text:
# generate.py (and the equivalent in batch_generate.py via potential_stop_sequence_text)
if stop_seq in accumulated_text:
...
chunk_start = len(accumulated_text) - len(out.text)
text = text_before_stop[chunk_start:] # trims only the CURRENT chunk
If "END" arrives as "E" then "ND", the "E" token is emitted as ordinary text before "END" is ever present in accumulated_text. When the final token completes the match, the trim only affects the current chunk and cannot retract the already-emitted "E". So the leading bytes of any multi-token stop sequence leak into output. The same incremental-emit flaw exists in the batch path.
Test coverage gap
There is currently no test that drives a stop sequence end-to-end. test_claude_api.py::test_stop_maps_to_end_turn asserts the mapping in isolation, and no test exercises the generator's stop-sequence matching or asserts the response stop_sequence field. So all three bugs are silently uncovered.
Affected scope
/v1/messages (Claude Messages API) — both streaming and non-streaming.
- OpenAI (
/v1/chat/completions) and Ollama endpoints are affected by Bug 3 only; for them "stop" is the correct finish reason, so their reporting is fine.
A fix is in progress (threading the matched sequence through, plus a streaming-safe hold-back scanner for the multi-token case) and will be linked as a PR.
Summary
The Claude Messages API (
/v1/messages) mishandlesstop_sequencesin three distinct ways:stop_reasonis never"stop_sequence"— it is always reported as"end_turn".stop_sequenceresponse field is never populated — it is alwaysnull.(1) and (2) affect every stop sequence, including single-token ones. (3) additionally corrupts the output text whenever a stop sequence spans more than one decoded token.
Reproduction
Observed response:
{ "content": [{"type": "text", "text": "ABCDEFGHIJKLMNOPQRSTUVWXYZ"}], "stop_reason": "end_turn", "stop_sequence": null, ... }Expected (per the Anthropic Messages API):
{ "content": [{"type": "text", "text": "ABCDEFGHIJKLMNOPQRSTUVWXYZ"}], "stop_reason": "stop_sequence", "stop_sequence": "END", ... }The single-token
"END"is correctly stripped, butstop_reason/stop_sequenceare wrong. With a stop sequence that tokenizes into multiple tokens, the leading bytes of the sequence additionally leak intocontent[0].text.Root cause
Bugs 1 & 2 — reporting. When a stop sequence matches, the generator collapses it into the generic
"stop"finish reason (src/exo/worker/engines/mlx/generator/generate.py, in the stop-sequence loop) and the matched string is discarded. The Claude adapter then maps it:"stop_sequence"is a validClaudeStopReasonbut is unreachable, and neithercollect_claude_responsenorgenerate_claude_streamever sets thestop_sequencefield (it defaults toNone). There is no way to distinguish a stop-sequence stop from a natural EOS because both arrive asfinish_reason == "stop".Bug 3 — multi-token leak. Generated text is emitted one token at a time, and the stop check is a substring test over the accumulated text:
If
"END"arrives as"E"then"ND", the"E"token is emitted as ordinary text before"END"is ever present inaccumulated_text. When the final token completes the match, the trim only affects the current chunk and cannot retract the already-emitted"E". So the leading bytes of any multi-token stop sequence leak into output. The same incremental-emit flaw exists in the batch path.Test coverage gap
There is currently no test that drives a stop sequence end-to-end.
test_claude_api.py::test_stop_maps_to_end_turnasserts the mapping in isolation, and no test exercises the generator's stop-sequence matching or asserts the responsestop_sequencefield. So all three bugs are silently uncovered.Affected scope
/v1/messages(Claude Messages API) — both streaming and non-streaming./v1/chat/completions) and Ollama endpoints are affected by Bug 3 only; for them"stop"is the correct finish reason, so their reporting is fine.A fix is in progress (threading the matched sequence through, plus a streaming-safe hold-back scanner for the multi-token case) and will be linked as a PR.