Skip to content

fix(responses): drain streaming chunks in background task handler#2114

Open
glaziermag wants to merge 1 commit into
EricLBuehler:masterfrom
glaziermag:fix/responses-stream-panic-clean
Open

fix(responses): drain streaming chunks in background task handler#2114
glaziermag wants to merge 1 commit into
EricLBuehler:masterfrom
glaziermag:fix/responses-stream-panic-clean

Conversation

@glaziermag

@glaziermag glaziermag commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Note

Agent 4 A100 validation update (2026-05-13 UTC): classification ACTUAL, feasibility FEASIBLE_NOW. On A100 base 2d4ba4f16f61e5e18be085d0dd137bc95cba038a, the exact /v1/responses request with background=true and stream=true failed when polled with unknown_error: Unexpected response type. On PR head 61219e17256b67920e56d630a8c63b89b1e81be2, the same request completed with output_text:"4" and background:true. Controls passed: background=true, stream=false, background=false, stream=true, and normal /v1/chat/completions; server log scan for error|panic|channel_closed|failed had no matches. “Fixes #1945” is honest for this path.


Fix: Force Non-Streaming for Background Request Generation

This PR forces stream=false for background tasks inside the Responses API handler (/v1/responses with background=true) to prevent the generator from hanging indefinitely while waiting for an HTTP stream that does not exist.

Root Cause

When background=true is set in a Responses API request, the handler spawns a background task that expects a terminal Response::Done. If the upstream request includes stream=true, the engine can send chunk events instead of the non-streaming terminal shape expected by that background task.

Fix

  1. Force oairequest.stream = Some(false) before parse_openresponses_request in the if background { branch.
  2. Drain any stray Response::Chunk events and only complete the task on Response::Done.
  3. If the response channel closes before a terminal message, mark the background task failed instead of leaving it stuck in progress.

Runtime Validation (GCP g2-standard-8, L4 GPU, Qwen2.5-0.5B Q4_K_M GGUF)

Branch head tested: bd8ad18517e8c1e10f8fe4d6dcd6528d8afa73bf
Hardware: GCP g2-standard-8, 1x NVIDIA L4, driver 580.126.09, CUDA 12.9

Build:

cargo build --release --features cuda -p mistralrs-server

Result: build completed successfully on the L4 VM.

Server:

./target/release/mistralrs-server --port 18082 gguf \
  -m Qwen/Qwen2.5-0.5B-Instruct-GGUF \
  -f qwen2.5-0.5b-instruct-q4_k_m.gguf

Server log included:

Model loaded.
git revision: bd8ad18517e8c1e10f8fe4d6dcd6528d8afa73bf
Dummy run completed in 7.515720829s.
OpenAI-compatible server listening on http://0.0.0.0:18082.

Exact bug-trigger request: background=true + stream=true via /v1/responses:

curl -sS -X POST http://127.0.0.1:18082/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model":"default","input":"What is 2+2? Answer with just the number.","background":true,"stream":true}'

Initial response:

{"id":"resp_ffa42766-f0cc-4935-8c72-8c3c355ab22d","object":"response","created_at":1777337731,"model":"default","status":"queued","output":[],"metadata":null}

Poll command:

curl -sS http://127.0.0.1:18082/v1/responses/resp_ffa42766-f0cc-4935-8c72-8c3c355ab22d

Final poll response:

{"id":"resp_ffa42766-f0cc-4935-8c72-8c3c355ab22d","object":"response","created_at":1777337731,"completed_at":1777337731,"model":"Qwen/Qwen2.5-0.5B-Instruct-GGUF","status":"completed","output":[{"type":"message","id":"msg_d24e9970-4040-4fd9-b0b4-d2c16c986fa5","role":"assistant","content":[{"type":"output_text","text":"4"}],"status":"completed"}],"output_text":"4","usage":{"input_tokens":42,"output_tokens":2,"total_tokens":44},"background":true}

Server log scan:

grep -Ei "error|panic|channel_closed|failed" /tmp/mistral_responses_1777337716.log

Result: no matches.

This validates the intended /v1/responses background + streaming request path on the PR head.

A100 validation (2026-05-01)

Additional validation on GCP Spot a2-highgpu-1g in [redacted-region] with 1x NVIDIA A100-SXM4-40GB, driver 580.126.20, branch head 61219e17256b67920e56d630a8c63b89b1e81be2.

cargo build --release --features cuda -p mistralrs-server
./target/release/mistralrs-server --port 18082 gguf \
  -m Qwen/Qwen2.5-0.5B-Instruct-GGUF \
  -f qwen2.5-0.5b-instruct-q4_k_m.gguf

Result: build passed. The exact /v1/responses bug shape with background=true and stream=true returned an initial queued response, then completed when polled with output_text":"4" and background":true. A normal /v1/chat/completions smoke also returned OK. Server log scan for error|panic|channel_closed|failed had no matches.

@github-actions

github-actions Bot commented Apr 16, 2026

Copy link
Copy Markdown
Code Metrics Report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 C Header                  5          305          210           52           43
 CSS                       3          281          252            5           24
 CUDA                     59        17661        13824         1637         2200
 Dockerfile                1           38           21            8            9
 HTML                      2           27           27            0            0
 JavaScript                3          392          387            2            3
 Jinja2                    7          694          656            5           33
 JSON                     25         9346         9343            0            3
 Makefile                  1            6            5            0            1
 MDX                       1          147            0          132           15
 Metal Shading Lan|       31        11647         9007         1064         1576
 PowerShell                1          300          227           30           43
 Python                  129         9969         8194          456         1319
 Shell                     2          489          331           96           62
 Plain Text                3         3723            0         2413         1310
 TOML                     27         1309         1145           36          128
 TypeScript               11         1607         1371           66          170
 YAML                      3           25           23            2            0
─────────────────────────────────────────────────────────────────────────────────
 Jupyter Notebooks         3          122           83           23           16
 |- Markdown               1           60           30           22            8
 |- Python                 1          122          113            1            8
 (Total)                              304          226           46           32
─────────────────────────────────────────────────────────────────────────────────
 Markdown                119         8232            0         5591         2641
 |- BASH                  52          491          432           34           25
 |- Dockerfile             2            5            5            0            0
 |- JSON                  16          582          582            0            0
 |- PowerShell             3            5            5            0            0
 |- Python                22          687          604            5           78
 |- Rust                  13          415          362            1           52
 |- TOML                   9          107           83            3           21
 |- YAML                   1            9            9            0            0
 (Total)                            10533         2082         5634         2817
─────────────────────────────────────────────────────────────────────────────────
 Rust                    571       245656       216375         6437        22844
 |- Markdown             379         9235          452         7653         1130
 (Total)                           254891       216827        14090        23974
─────────────────────────────────────────────────────────────────────────────────
 Svelte                   18         1831         1696           50           85
 |- CSS                    1            4            4            0            0
 |- JavaScript            18          876          727           24          125
 (Total)                             2711         2427           74          210
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                  1025       326405       266585        25848        33972
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

@glaziermag

Copy link
Copy Markdown
Contributor Author

Housekeeping note: This branch currently bundles the CI fix from #2115 (.typos.toml, openapi_doc.rs, distributed/layers.rs). Once #2115 is merged, this branch will need a rebase onto updated master to drop the duplicate CI fix commit and resolve the resulting conflicts.

@glaziermag glaziermag force-pushed the fix/responses-stream-panic-clean branch from d6b3865 to 1c15df6 Compare April 18, 2026 02:30
@glaziermag glaziermag force-pushed the fix/responses-stream-panic-clean branch from 6e868fd to 763c42c Compare April 27, 2026 16:49

Copy link
Copy Markdown
Contributor Author

Agent 6 follow-up on existing A100 validation: this remains valid. Classification: ACTUAL; feasibility: FEASIBLE_NOW. The recorded A100 before/after evidence reproduces the /v1/responses background=true + stream=true failure on base and shows current head completing successfully, so Fixes #1945 is honest for that path. Recommendation: keep open for review/merge.

@glaziermag

glaziermag commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

Wave 1 evidence bundle index:

PR: #2114
Linked issue: #1945
Base SHA: 2d4ba4f16f61e5e18be085d0dd137bc95cba038a
Current PR-head SHA: 61219e17256b67920e56d630a8c63b89b1e81be2
Fixed-head SHA, if changed: N/A

Exact server command recorded in the PR body:

cargo build --release --features cuda -p mistralrs-server
./target/release/mistralrs-server --port 18082 gguf \
  -m Qwen/Qwen2.5-0.5B-Instruct-GGUF \
  -f qwen2.5-0.5b-instruct-q4_k_m.gguf

The exact client path was /v1/responses with background=true and stream=true, then polling the queued response.

Environment: GCP Spot a2-highgpu-1g, zone [redacted-region], 1x NVIDIA A100-SXM4-40GB, driver 580.126.20, CUDA from nvidia-smi 13.0, Rust/Cargo from PR-body A100 run, model Qwen/Qwen2.5-0.5B-Instruct-GGUF / qwen2.5-0.5b-instruct-q4_k_m.gguf.
A100 category: A100_GPU_REQUIRED.

Base result: exact /v1/responses request with background=true and stream=true failed when polled with unknown_error: Unexpected response type.
Current PR-head result: same request completed with output_text: "4" and background: true.
Tests added/changed: Responses API background streaming handler behavior.
Tests passed: A100 server/client runtime path; controls for background=true, stream=false, background=false, stream=true, and normal /v1/chat/completions; server log scan had no error|panic|channel_closed|failed matches.
Side-effect controls: non-background streaming and background non-streaming controls passed; normal chat completions smoke passed.
Raw logs/artifacts: existing validation comment #2114 (comment) plus PR-body server/client excerpts. No separate raw log file is attached in this comment.
Remaining risks: scoped to /v1/responses with background=true and stream=true; broader Responses API behavior is not claimed.
Can say “Fixes #issue”: yes for #1945 on this path.
Safe wording: “Fixes #1945 for /v1/responses with background=true and stream=true.”
Readiness status: ready-now if existing A100 comments/PR-body excerpts are accepted as artifacts; otherwise standalone raw server/client logs remain to attach.

@glaziermag

Copy link
Copy Markdown
Contributor Author

Ready for maintainer review. Evidence attached shows the original/base failure and current-head pass under the scoped issue conditions. The PR may use the Fixes #... wording already present in the body.

@glaziermag glaziermag force-pushed the fix/responses-stream-panic-clean branch 2 times, most recently from 4e61107 to c2545db Compare May 20, 2026 02:25
@glaziermag glaziermag force-pushed the fix/responses-stream-panic-clean branch from c2545db to e8cf506 Compare May 20, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Responses API: background=true + stream=true fails with 'Unexpected response type'

1 participant