-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Summary
When wren-engine is unresponsive (e.g., due to CPU deadlock), wren-ai-service silently times out and returns NO_RELEVANT_SQL — the same error code used for legitimate LLM SQL generation failures. This makes it nearly impossible to distinguish an infrastructure problem from an LLM quality problem without significant manual investigation.
Impact
Users and operators are misled into debugging LLM configuration (model selection, prompts, parameters) when the actual root cause is a failed or frozen wren-engine instance. In our case, this led to multiple unnecessary model switches before the real issue was identified.
Steps to Reproduce
- Start the full WrenAI stack via Docker Compose.
- Allow
wren-engineto enter a CPU-saturated deadlock state (observed at ~200% CPU, confirmed unresponsive viacurl). - Submit any natural language query from the UI.
- Observe the result: UI shows "Clarification needed";
wren-ai-servicelogsNO_RELEVANT_SQL.
Debug Timeline
Symptom: Every query returns "Clarification needed" in the UI.
Step 1 — Checked wren-ai-service logs
E wren-ai-service] ask pipeline - NO_RELEVANT_SQL: How many orders are there in total?
Log only shows the final outcome. No indication of why SQL generation failed.
Step 2 — Mistakenly suspected LLM quality
Switched models in order: gpt-4o-mini → gpt-4o → claude-3-haiku → back to gpt-4o-mini. All produced the same failure. Time spent: significant.
Step 3 — Checked wren-ui logs, found the actual SQL
[DEBUG] WrenEngineAdaptor - Dry run wren engine with body:
"SELECT COUNT(DISTINCT "olist_orders_dataset"."order_id") AS "total_orders" FROM "olist_orders_dataset""
The generated SQL was syntactically valid. The dry run was sent to wren-engine at 07:50:38 and the task failed at 07:51:09 — exactly 31 seconds later (matching the engine_timeout: 30 setting).
Step 4 — Tested wren-engine directly
docker exec wrenai-wren-ui-1 curl -s --max-time 5 http://wren-engine:8080/v1/mdl
# exit code 28 (timeout) — engine is not respondingStep 5 — Checked container resource usage
docker stats wrenai-wren-engine-1 --no-stream
# CPU: 203%, MEM: 674MB — engine is deadlockedResolution: Restarted wren-engine. CPU dropped to normal, engine became responsive, queries succeeded.
Root Cause
wren-engine entered a CPU deadlock and stopped responding to HTTP requests. wren-ai-service waited until engine_timeout (30s) and then classified the result as NO_RELEVANT_SQL with no logging of the actual cause.
Proposed Improvements
-
Distinguish timeout from SQL failure: When a dry run times out or returns a connection error, log and surface a distinct error code (e.g.,
ENGINE_TIMEOUTorENGINE_UNAVAILABLE) instead ofNO_RELEVANT_SQL. -
Add explicit timeout logging in
wren-ai-service: When the engine call fails due to timeout, log the actual exception/reason, not just the pipeline outcome. -
Surface the error to the UI: Show a different message when the engine is unreachable (e.g., "Unable to reach SQL engine") vs. when no SQL could be generated. This gives operators an actionable signal immediately.
Environment
wren-ai-serviceversion: 0.29.0wren-engineversion: 0.22.0wren-uiversion: 0.32.2- LLM: OpenRouter /
gpt-4o-mini - Embedder: Gemini /
gemini-embedding-001