Skip to content

NO_RELEVANT_SQL Masks Wren Engine Failures #2153

@humorless

Description

@humorless

Summary

When wren-engine is unresponsive (e.g., due to CPU deadlock), wren-ai-service silently times out and returns NO_RELEVANT_SQL — the same error code used for legitimate LLM SQL generation failures. This makes it nearly impossible to distinguish an infrastructure problem from an LLM quality problem without significant manual investigation.

Impact

Users and operators are misled into debugging LLM configuration (model selection, prompts, parameters) when the actual root cause is a failed or frozen wren-engine instance. In our case, this led to multiple unnecessary model switches before the real issue was identified.

Steps to Reproduce

  1. Start the full WrenAI stack via Docker Compose.
  2. Allow wren-engine to enter a CPU-saturated deadlock state (observed at ~200% CPU, confirmed unresponsive via curl).
  3. Submit any natural language query from the UI.
  4. Observe the result: UI shows "Clarification needed"; wren-ai-service logs NO_RELEVANT_SQL.

Debug Timeline

Symptom: Every query returns "Clarification needed" in the UI.

Step 1 — Checked wren-ai-service logs

E wren-ai-service] ask pipeline - NO_RELEVANT_SQL: How many orders are there in total?

Log only shows the final outcome. No indication of why SQL generation failed.

Step 2 — Mistakenly suspected LLM quality
Switched models in order: gpt-4o-minigpt-4oclaude-3-haiku → back to gpt-4o-mini. All produced the same failure. Time spent: significant.

Step 3 — Checked wren-ui logs, found the actual SQL

[DEBUG] WrenEngineAdaptor - Dry run wren engine with body:
"SELECT COUNT(DISTINCT "olist_orders_dataset"."order_id") AS "total_orders" FROM "olist_orders_dataset""

The generated SQL was syntactically valid. The dry run was sent to wren-engine at 07:50:38 and the task failed at 07:51:09 — exactly 31 seconds later (matching the engine_timeout: 30 setting).

Step 4 — Tested wren-engine directly

docker exec wrenai-wren-ui-1 curl -s --max-time 5 http://wren-engine:8080/v1/mdl
# exit code 28 (timeout) — engine is not responding

Step 5 — Checked container resource usage

docker stats wrenai-wren-engine-1 --no-stream
# CPU: 203%, MEM: 674MB — engine is deadlocked

Resolution: Restarted wren-engine. CPU dropped to normal, engine became responsive, queries succeeded.

Root Cause

wren-engine entered a CPU deadlock and stopped responding to HTTP requests. wren-ai-service waited until engine_timeout (30s) and then classified the result as NO_RELEVANT_SQL with no logging of the actual cause.

Proposed Improvements

  1. Distinguish timeout from SQL failure: When a dry run times out or returns a connection error, log and surface a distinct error code (e.g., ENGINE_TIMEOUT or ENGINE_UNAVAILABLE) instead of NO_RELEVANT_SQL.

  2. Add explicit timeout logging in wren-ai-service: When the engine call fails due to timeout, log the actual exception/reason, not just the pipeline outcome.

  3. Surface the error to the UI: Show a different message when the engine is unreachable (e.g., "Unable to reach SQL engine") vs. when no SQL could be generated. This gives operators an actionable signal immediately.

Environment

  • wren-ai-service version: 0.29.0
  • wren-engine version: 0.22.0
  • wren-ui version: 0.32.2
  • LLM: OpenRouter / gpt-4o-mini
  • Embedder: Gemini / gemini-embedding-001

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions