[Bug] LoadSkillResourceTool retries RESOURCE_NOT_FOUND indefinitely; default max_llm_calls=500 is the only backstop

## 🔴 Required Information

**Describe the Bug:**

`LoadSkillResourceTool.run_async` returns `RESOURCE_NOT_FOUND` as a structured soft-error string when a path passed by the LLM does not exist inside the skill's bundled resources. Because the response is a normal tool result (not an exception or terminal signal), the LLM treats it as a transient/recoverable failure and retries — but critically, **it hallucinates a different plausible path on every retry**, not the same path. Nothing in `SkillToolset` tracks total failures across paths, so the loop continues until `RunConfig.max_llm_calls` is exhausted.

`max_llm_calls` defaults to **500** (`src/google/adk/agents/run_config.py:314`). This means a single invocation can silently consume the entire per-invocation call budget on repeated failing tool calls before the framework intervenes — and `max_llm_calls` is a global cap on legitimate reasoning, not a defense against a repeated-failure loop on one specific tool.

**Steps to Reproduce:**

1. Install `google-adk` (any version that ships `SkillToolset` — verified on `1.32.0`).
2. Create an agent with a `SkillToolset` containing a skill whose `SKILL.md` references files by natural-language names (e.g. "Document 1", "the reference guide") without exact filenames.
3. Issue a query that prompts the model to read one of those resources.
4. Observe in the trace that the model calls `load_skill_resource` with a hallucinated path, receives `RESOURCE_NOT_FOUND`, then calls it again with a **different** hallucinated path, receives `RESOURCE_NOT_FOUND` again, and loops.

**Expected Behavior:**

After the first `RESOURCE_NOT_FOUND` within an invocation, any subsequent `load_skill_resource` failure should return a terminal error code that unambiguously instructs the LLM to stop retrying and report the error. The agent's overall reasoning budget (`max_llm_calls`) should not be the only thing standing between an imperfect prompt and a runaway invocation.

**Observed Behavior:**

The same `RESOURCE_NOT_FOUND` soft error is returned on every attempt regardless of path or how many times it has already failed. The loop terminates only when `max_llm_calls` is exceeded.

**Live trace evidence** (captured via `GET /debug/trace/session/{session_id}` against `adk web`):

```
SPAN: execute_tool load_skill_resource
  args:       {'file_path': 'references/reference_doc.md', 'skill_name': 'document-classifier'}
  error_code: RESOURCE_NOT_FOUND
  error:      Resource 'references/reference_doc.md' not found in skill 'document-classifier'.

SPAN: execute_tool load_skill_resource
  args:       {'skill_name': 'document-classifier', 'file_path': 'references/Document1.md'}
  error_code: RESOURCE_NOT_FOUND
  error:      Resource 'references/Document1.md' not found in skill 'document-classifier'.
```

The model tried `references/reference_doc.md` first, then hallucinated a completely different path (`references/Document1.md`) on the retry. Both returned the same soft error — the LLM had no signal to stop. This pattern continues indefinitely.

**Environment Details:**

- ADK Library Version: `1.32.0` (issue exists on `main` as of commit `2d61cb69`)
- Desktop OS: Linux (defect is in framework logic, not OS-specific)
- Python Version: `3.12.3`

**Model Information:**

- Are you using LiteLLM: N/A (defect is provider-agnostic)
- Which model: `gemini-3-flash-preview` (observed; reproducible across any function-calling model — the retry behavior is a consequence of the soft error signal, not model-specific)

---

## 🟡 Optional Information

**Regression:**

Not a regression. The defect has existed since `SkillToolset` was introduced — `LoadSkillResourceTool.run_async` has never had any retry-guard logic.

**Additional Context:**

Four factors combine to make this loop reachable through ordinary use:

1. **No resource manifest at L2** — the `load_skill` response intentionally omits available file paths (progressive-disclosure spec). The LLM must infer paths from prose, and inferred paths are routinely wrong.
2. **Soft error string** — `RESOURCE_NOT_FOUND` looks transient and recoverable to the model; retry is its default response.
3. **No terminal signal** — nothing escalates after the first miss.
4. **No scope boundary in default prompt** — the system instruction doesn't distinguish skill-bundled files from runtime user inputs (e.g. a PDF the user is processing), so the model sometimes routes runtime documents through `load_skill_resource` and loops on them.

Considered and rejected alternatives:

| Alternative | Why not |
|---|---|
| Per-path retry guard | LLM hallucinates a different path on each retry — confirmed in live trace; a per-path list never triggers |
| Tighten or default-lower `max_llm_calls` | Caps overall reasoning budget; punishes legitimate long-running agents |
| User-side `after_tool_callback` workaround | Symptomatic; pushes the fix onto every `SkillToolset` user |
| Add `available_resources` manifest to L2 `load_skill` | Defeats the lazy-loading / token-saving design |
| New `list_skill_resources` tool | Violates the L1→L2→L3 progressive disclosure contract |

**Minimal Reproduction Code:**

```python
import asyncio
from unittest import mock
from google.adk.skills import models
from google.adk.tools import skill_toolset, tool_context

skill = mock.create_autospec(models.Skill, instance=True)
skill.name = "demo"
skill.resources = mock.MagicMock()
skill.resources.get_reference.return_value = None  # every path "missing"

ctx = mock.MagicMock(spec=tool_context.ToolContext)
ctx.state = {}
ctx.invocation_id = "inv1"
ctx._invocation_context = mock.MagicMock()
ctx.agent_name = "agent"

toolset_obj = skill_toolset.SkillToolset([skill])
tool = skill_toolset.LoadSkillResourceTool(toolset_obj)

async def main():
    paths = [
        "references/missing.md",
        "references/other_guess.md",   # different path — LLM hallucination pattern
        "references/yet_another.md",
    ]
    for i, path in enumerate(paths):
        r = await tool.run_async(
            args={"skill_name": "demo", "file_path": path},
            tool_context=ctx,
        )
        print(i, r["error_code"])
    # On main (unpatched): all 3 print RESOURCE_NOT_FOUND — LLM has no reason to stop
    # With fix applied:    call 0 → RESOURCE_NOT_FOUND, calls 1-2 → RESOURCE_NOT_FOUND_FATAL

asyncio.run(main())
```

**How often has this issue occurred?:** Always (100%) — deterministic given any skill whose `SKILL.md` lets the model infer plausible-looking paths that don't literally exist.

---

## Proposed Fix

A two-layer fix is in linked PR #5651:

**Code**: an invocation-scoped **total failure counter** inside `LoadSkillResourceTool.run_async`. The counter tracks the number of `RESOURCE_NOT_FOUND` responses across **all paths** within an invocation (not per-path — live testing confirmed the LLM uses a different path on each retry). State key:

```
temp:_adk_skill_resource_not_found_count_<invocation_id>
```

- First failure → `RESOURCE_NOT_FOUND` (unchanged behavior).
- Any subsequent failure → `RESOURCE_NOT_FOUND_FATAL` with an explicit stop instruction and failure count.

The `temp:` prefix uses ADK's existing convention to prevent persistence to durable storage. The `<invocation_id>` suffix isolates in-memory backends where `temp:` keys are not auto-cleared between invocations.

**Prompt**: a no-retry rule and a scope boundary added to `_DEFAULT_SKILL_SYSTEM_INSTRUCTION`.

**Live trace with fix applied** (same session, patched build):

```
SPAN: execute_tool load_skill_resource
  args:       {'file_path': 'references/reference_doc.md', 'skill_name': 'document-classifier'}
  error_code: RESOURCE_NOT_FOUND
  error:      Resource 'references/reference_doc.md' not found in skill 'document-classifier'.

SPAN: execute_tool load_skill_resource
  args:       {'skill_name': 'document-classifier', 'file_path': 'references/Document1.md'}
  error_code: RESOURCE_NOT_FOUND_FATAL
  error:      Resource 'references/Document1.md' not found in skill 'document-classifier'.
              This is resource lookup failure #2 this invocation. Do not retry any path
              — report the error to the user and stop.
```

Loop terminated on the second call. The model attempted a different path (`Document1.md` vs `reference_doc.md`) — exactly the hallucination pattern that a per-path guard would have missed.

Linked PR: google/adk-python#5651

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] LoadSkillResourceTool retries RESOURCE_NOT_FOUND indefinitely; default max_llm_calls=500 is the only backstop #5652

🔴 Required Information

🟡 Optional Information

Proposed Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alternative	Why not
Per-path retry guard	LLM hallucinates a different path on each retry — confirmed in live trace; a per-path list never triggers
Tighten or default-lower `max_llm_calls`	Caps overall reasoning budget; punishes legitimate long-running agents
User-side `after_tool_callback` workaround	Symptomatic; pushes the fix onto every `SkillToolset` user
Add `available_resources` manifest to L2 `load_skill`	Defeats the lazy-loading / token-saving design
New `list_skill_resources` tool	Violates the L1→L2→L3 progressive disclosure contract

[Bug] LoadSkillResourceTool retries RESOURCE_NOT_FOUND indefinitely; default max_llm_calls=500 is the only backstop #5652

Description

🔴 Required Information

🟡 Optional Information

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions