[BUG] Claude Messages API: stop_sequences report stop_reason=end_turn / stop_sequence=null, and multi-token stop sequences leak into output

## Summary

The Claude Messages API (`/v1/messages`) mishandles `stop_sequences` in three distinct ways:

1. **`stop_reason` is never `"stop_sequence"`** — it is always reported as `"end_turn"`.
2. **The `stop_sequence` response field is never populated** — it is always `null`.
3. **Multi-token stop sequences leak their leading bytes** into the output, and are not fully stripped.

(1) and (2) affect every stop sequence, including single-token ones. (3) additionally corrupts the output text whenever a stop sequence spans more than one decoded token.

## Reproduction

```bash
curl http://localhost:52415/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<any mlx model>",
    "max_tokens": 1024,
    "messages": [{"role": "user",
      "content": [{"type": "text",
        "text": "Output the whole alphabet from A to Z, without spaces, then immediately output END"}]}],
    "stop_sequences": ["END"],
    "thinking": {"type": "disabled"}
  }'
```

Observed response:

```json
{
  "content": [{"type": "text", "text": "ABCDEFGHIJKLMNOPQRSTUVWXYZ"}],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  ...
}
```

Expected (per the Anthropic Messages API):

```json
{
  "content": [{"type": "text", "text": "ABCDEFGHIJKLMNOPQRSTUVWXYZ"}],
  "stop_reason": "stop_sequence",
  "stop_sequence": "END",
  ...
}
```

The single-token `"END"` is correctly stripped, but `stop_reason`/`stop_sequence` are wrong. With a stop sequence that tokenizes into multiple tokens, the leading bytes of the sequence additionally leak into `content[0].text`.

## Root cause

**Bugs 1 & 2 — reporting.** When a stop sequence matches, the generator collapses it into the generic `"stop"` finish reason (`src/exo/worker/engines/mlx/generator/generate.py`, in the stop-sequence loop) and the matched string is discarded. The Claude adapter then maps it:

```python
# src/exo/api/adapters/claude.py  (finish_reason_to_claude_stop_reason)
mapping: dict[FinishReason, ClaudeStopReason] = {
    "stop": "end_turn",   # <- always end_turn; "stop_sequence" is never produced
    ...
}
```

`"stop_sequence"` is a valid `ClaudeStopReason` but is unreachable, and neither `collect_claude_response` nor `generate_claude_stream` ever sets the `stop_sequence` field (it defaults to `None`). There is no way to distinguish a stop-sequence stop from a natural EOS because both arrive as `finish_reason == "stop"`.

**Bug 3 — multi-token leak.** Generated text is emitted one token at a time, and the stop check is a substring test over the accumulated text:

```python
# generate.py (and the equivalent in batch_generate.py via potential_stop_sequence_text)
if stop_seq in accumulated_text:
    ...
    chunk_start = len(accumulated_text) - len(out.text)
    text = text_before_stop[chunk_start:]   # trims only the CURRENT chunk
```

If `"END"` arrives as `"E"` then `"ND"`, the `"E"` token is emitted as ordinary text before `"END"` is ever present in `accumulated_text`. When the final token completes the match, the trim only affects the current chunk and cannot retract the already-emitted `"E"`. So the leading bytes of any multi-token stop sequence leak into output. The same incremental-emit flaw exists in the batch path.

## Test coverage gap

There is currently no test that drives a stop sequence end-to-end. `test_claude_api.py::test_stop_maps_to_end_turn` asserts the mapping in isolation, and no test exercises the generator's stop-sequence matching or asserts the response `stop_sequence` field. So all three bugs are silently uncovered.

## Affected scope

- `/v1/messages` (Claude Messages API) — both streaming and non-streaming.
- OpenAI (`/v1/chat/completions`) and Ollama endpoints are affected by Bug 3 only; for them `"stop"` is the correct finish reason, so their reporting is fine.

A fix is in progress (threading the matched sequence through, plus a streaming-safe hold-back scanner for the multi-token case) and will be linked as a PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Claude Messages API: stop_sequences report stop_reason=end_turn / stop_sequence=null, and multi-token stop sequences leak into output #2122

Summary

Reproduction

Root cause

Test coverage gap

Affected scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Claude Messages API: stop_sequences report stop_reason=end_turn / stop_sequence=null, and multi-token stop sequences leak into output #2122

Description

Summary

Reproduction

Root cause

Test coverage gap

Affected scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions