Skip to content

fix: terminate infinite retry loop in RunSkillScriptTool on SCRIPT_NOT_FOUND#6

Open
Raman369AI wants to merge 1 commit into
mainfrom
fix/skill-script-not-found-infinite-loop
Open

fix: terminate infinite retry loop in RunSkillScriptTool on SCRIPT_NOT_FOUND#6
Raman369AI wants to merge 1 commit into
mainfrom
fix/skill-script-not-found-infinite-loop

Conversation

@Raman369AI
Copy link
Copy Markdown
Owner

Fixes google#5684

Companion to google#5651 — closes the same defect class for RunSkillScriptTool.

Summary

RunSkillScriptTool.run_async returns SCRIPT_NOT_FOUND as a structured soft-error string when a script path passed by the LLM does not exist in the skill's scripts/ directory. Because the response is a normal tool result (not an exception or terminal signal), the LLM treats it as transient and retries — and in practice it hallucinates a different plausible script path on each retry (e.g. scripts/setup.py, scripts/build.py, scripts/run.sh). Nothing in RunSkillScriptTool tracks failures across paths, so the loop continues until RunConfig.max_llm_calls (default 500) is exhausted.

This is the same defect mode as LoadSkillResourceTool (google#5652 / google#5651), in the same file, with the same fix shape. Filing as a separate PR for a clean history.

Why this matters

The four conditions that made the resource-loop reachable apply identically to scripts:

  • No script manifest at L2 — the load_skill response intentionally omits a list of available scripts (progressive-disclosure spec). The LLM infers script filenames from prose in SKILL.md, and inferred filenames like scripts/setup.py or scripts/build.sh are routinely wrong.
  • Soft error stringSCRIPT_NOT_FOUND looks transient to the model.
  • No terminal signal — nothing escalates after the first miss.
  • max_llm_calls=500 is the only backstop — a global reasoning cap, not a defense against a known failure mode on one tool.

What changed

Two layers in src/google/adk/tools/skill_toolset.py, plus tests:

1. Code: invocation-scoped failure counter in RunSkillScriptTool.run_async

A total SCRIPT_NOT_FOUND count (across all paths) is tracked on tool_context.state under:

temp:_adk_skill_script_not_found_count_<invocation_id>
  • The temp: prefix uses ADK's existing convention so the value is trimmed from the persisted event delta and never reaches durable session storage.
  • The <invocation_id> suffix ensures correctness on in-memory session backends, where temp: keys are not auto-cleared between invocations.

Behavior:

  • First SCRIPT_NOT_FOUND within an invocation (any path) → returns SCRIPT_NOT_FOUND (unchanged).
  • Any subsequent miss in the same invocation → returns the new SCRIPT_NOT_FOUND_FATAL error code with an explicit "do not retry — report the error and stop" message, including the failure number.

The guard is invocation-scoped, path-agnostic (catches hallucinated variants), and adds no overhead on the success path. The counter key is independent of the resource counter introduced in google#5651 — a single miss on each tool does not immediately escalate.

2. Prompt: no-retry rule for run_skill_script

A new rule 5 in _DEFAULT_SKILL_SYSTEM_INSTRUCTION:

"If run_skill_script returns an error (for example SCRIPT_NOT_FOUND), do not retry the same script or guess a different script path. Report the error to the user and stop."

Why both layers

Defense-in-depth, identical reasoning to google#5651: code-only termination produces confusing downstream behavior when the LLM has no semantic reason to stop; prompt-only termination relies on the LLM following the rule, which imperfect upstream prompts can override.

Considered and rejected

Alternative Why not
Per-path retry guard LLM hallucinates a different script path on every retry — a per-path list never escalates
Tighten or default-lower max_llm_calls Caps overall reasoning budget; punishes legitimate long-running tasks
Symptomatic after_tool_callback workaround Pushes the fix onto every SkillToolset user
Add an available_scripts manifest to L2 load_skill Defeats the lazy-loading / token-saving design
Introduce a list_skill_scripts tool Violates the L1→L2→L3 progressive disclosure contract
Share one counter with load_skill_resource Two distinct tools, two distinct hallucination surfaces; separate counters keep the failure semantics aligned per tool

Test plan

New tests in tests/unittests/tools/test_skill_toolset.py:

  • test_execute_script_repeated_failure_escalates_to_fatal — second call on the same missing path within an invocation returns SCRIPT_NOT_FOUND_FATAL.
  • test_execute_script_different_path_also_escalates_to_fatal — second call on a different missing path also returns SCRIPT_NOT_FOUND_FATAL; proves the counter is path-agnostic and catches hallucinated variants.
  • test_execute_script_failures_isolated_per_invocation — failures from invocation A do not escalate the first attempt in invocation B even when sharing the same session state dict.
  • test_execute_script_counter_uses_temp_prefix — every key written by the guard starts with temp: so it is trimmed from the persisted event delta.
  • test_execute_script_script_not_found — tightened: first miss returns the unchanged SCRIPT_NOT_FOUND code and message.

Verification:

  • All 95 tests in test_skill_toolset.py pass (90 pre-existing + 5 added/tightened), 0 regressions.
  • pyink --check clean on both modified files.
  • isort --check-only clean on both modified files.
  • mypy src/google/adk/tools/skill_toolset.py reports the same 19 pre-existing errors as main; zero new errors introduced.

Backwards compatibility

  • The first-failure behavior is unchanged: same SCRIPT_NOT_FOUND error code, same error string. Existing callers see no difference.
  • The new SCRIPT_NOT_FOUND_FATAL code is purely additive.
  • The new state key uses the temp: prefix and is guaranteed not to leak into persisted session storage.
  • The new prompt rule is additive; the prior four rules are unchanged.

Relationship to google#5651

This PR is logically independent and merges cleanly against current main. If google#5651 lands first, the prompts will need a trivial merge (both PRs append a new rule to the same instruction string); the code paths are in different methods and do not conflict.

…T_FOUND

Mirrors the LoadSkillResourceTool fix (google#5651): when the LLM hallucinates a
script path that does not exist in the skill's scripts/ directory,
RunSkillScriptTool returns SCRIPT_NOT_FOUND as a soft error and the model
retries with a different plausible path each turn. Nothing escalates between
attempts, so the loop terminates only when RunConfig.max_llm_calls (default
500) is exhausted.

Adds an invocation-scoped failure counter under
temp:_adk_skill_script_not_found_count_<invocation_id>. The first miss within
an invocation still returns SCRIPT_NOT_FOUND (unchanged); any subsequent miss
returns the new SCRIPT_NOT_FOUND_FATAL with an explicit "do not retry, report
and stop" message and the failure count. The counter is path-agnostic, so it
fires even when the model hallucinates a different script path on each retry.

The temp: prefix keeps the key out of durable session storage; the
invocation_id suffix isolates in-memory backends where temp: keys are not
auto-cleared between invocations.

The default system prompt also gains a no-retry rule for run_skill_script
errors so the model has a semantic reason to stop, complementing the
code-level guard.

Tests cover: first-miss soft error preserved, 2nd-miss-same-path escalates,
2nd-miss-different-path escalates (path-agnostic), per-invocation isolation,
counter key uses temp: prefix.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the project version to 1.33.0, introduces a SkillRegistry interface for dynamic skill discovery, and enhances the SkillToolset with a turn-scoped cache. It adds support for PKCE in OAuth2 authentication, makes truncation limits configurable in environment tools, and refactors CLI onboarding logic. Telemetry is improved by updating semantic conventions and streamlining metric attributes. Numerous samples and tests are updated to use newer Gemini models, and various bug fixes are included across A2A, session management, and code execution components. Feedback suggests making the hardcoded cache size in SkillToolset configurable to better handle high-concurrency server environments.

str,
dict[str, models.Skill | asyncio.Future[models.Skill | None] | None],
] = {}
self._max_cache_turns = 16
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _max_cache_turns limit is hardcoded to 16. In a server environment where a single SkillToolset instance is shared across many concurrent users, this small cache size will lead to frequent evictions and redundant registry calls. Consider increasing this default or making it configurable to better support high-concurrency scenarios.

@Raman369AI Raman369AI closed this May 13, 2026
@Raman369AI Raman369AI reopened this May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] RunSkillScriptTool retries SCRIPT_NOT_FOUND indefinitely; default max_llm_calls=500 is the only backstop

1 participant