-
Notifications
You must be signed in to change notification settings - Fork 190
Description
Context
Two evaluation runs of nvidia/nemotron-3-super-120b-a12b via the LiteLLM proxy show a 63–67% conversation error rate, almost entirely caused by the model calling tool names that do not exist in OpenHands:
| Run | Benchmark | Instances | Error rate | Top errors |
|---|---|---|---|---|
| 23368683367 | SWE-bench | 47/500 | 63.8% | str_replace, command, grep not found |
| 23270655200 | SWT-bench | 156 conv | 66.7% | str_replace, command, execute, bash, grep -n not found |
Full tool-name mismatch breakdown from the SWT-bench run (156 conversations):
| Count | Tool name model called | Should have called |
|---|---|---|
| 9 | str_replace |
file_editor (with command="str_replace") |
| 4 | command |
terminal |
| 4 | execute |
terminal |
| 3 | bash |
terminal |
| 3 | grep -n |
terminal |
| 1 | ls |
terminal |
Because every wrong-name call is an instant hard failure, execution_status: stuck appears 14 times (the model loops on the same call until CONVERSATION_TIMEOUT = 2 h). That tight retry loop also generates rapid-fire LLM requests that trigger LLMRateLimitError (17 occurrences), which in turn burns MAX_RETRIES=3 against a problem that will never self-correct.
Root cause
Nemotron-3 Super was fine-tuned on tool-use trajectories using Anthropic's str_replace_based_edit_tool / bash schema. In that schema:
- Shell execution is a standalone tool named
bash(notterminal) - File editing is a standalone tool named
str_replace(not a sub-command offile_editor)
The model's behavior is correct for that schema. The problem is a pure name mismatch: the parameters it passes (path, old_str, new_str, command, file_text) are semantically correct and map 1-to-1 to existing OpenHands tool arguments — only the tool names differ.
Proposed fix: a nemotron tool preset
Following the existing pattern — gemini.py exposes read_file/write_file/edit/list_directory instead of file_editor; gpt5.py exposes apply_patch instead of file_editor — add a nemotron preset that exposes tools under the names the model expects.
New tools needed
1. BashTool — name: "bash"
Thin wrapper around TerminalExecutor. Accepts a single command: str parameter, matching Anthropic's bash tool schema exactly.
name: bash
description: Run a shell command and return stdout/stderr.
params:
command (str, required): The shell command to execute.
The model will not pass security_risk or summary (those are OpenHands-specific), so the schema must not require them.
2. StrReplaceTool — name: "str_replace"
Exposes the same operations as FileEditorTool under the Anthropic-compatible name. Parameter schema matches str_replace_based_edit_tool exactly:
name: str_replace
description: View, create and edit plain-text files.
params:
command (str, required) : "view" | "create" | "str_replace" | "insert" | "undo_edit"
path (str, required) : Absolute path to file or directory.
old_str (str | None) : For str_replace — exact text to find and replace.
new_str (str | None) : For str_replace/insert — replacement or inserted text.
file_text (str | None) : For create — full file content.
insert_line (int | None) : For insert — line number to insert after.
view_range (list[int] | None) : For view — [start_line, end_line].
Internally delegates to FileEditorExecutor (identical backing to FileEditorTool).
New preset file: openhands-tools/openhands/tools/preset/nemotron.py
"""Nemotron-3 Super preset.
Nemotron-3 Super (nvidia/nemotron-3-super-120b-a12b) was fine-tuned on
trajectories that use the Anthropic str_replace_based_edit_tool / bash
tool schema. This preset exposes those exact tool names so the model's
calls succeed without any prompt engineering or model-side changes.
bash → TerminalExecutor (model calls "bash", not "terminal")
str_replace → FileEditorExecutor (model calls "str_replace", not "file_editor")
task_tracker, finish, think — unchanged; model already calls these correctly.
"""
from openhands.sdk import Agent
from openhands.sdk.context.condenser import LLMSummarizingCondenser
from openhands.sdk.llm.llm import LLM
from openhands.sdk.tool import Tool
def get_nemotron_tools() -> list[Tool]:
from openhands.tools.nemotron import BashTool, StrReplaceTool
from openhands.tools.task_tracker import TaskTrackerTool
return [
Tool(name=BashTool.name), # "bash"
Tool(name=StrReplaceTool.name), # "str_replace"
Tool(name=TaskTrackerTool.name), # "task_tracker"
]
def get_nemotron_agent(llm: LLM, cli_mode: bool = False) -> Agent:
tools = get_nemotron_tools()
return Agent(
llm=llm,
tools=tools,
system_prompt_kwargs={"cli_mode": cli_mode},
condenser=LLMSummarizingCondenser(
llm=llm.model_copy(update={"usage_id": "condenser"}),
max_size=80, keep_first=4,
),
)Export get_nemotron_agent / get_nemotron_tools from preset/__init__.py.
Wire into the eval pipeline
Add "nemotron" to the tool_preset choice list in the eval workflow (eval-job.yml) and in run_swebench.sh / run_swtbench.sh. Nemotron evaluations should then be dispatched with TOOL_PRESET=nemotron.
Expected impact
| Metric | Current (default preset) | With nemotron preset |
|---|---|---|
| Tool-name errors | 53% of conversations | ~0% |
stuck conversations |
9% | ~0% |
LLMRateLimitError |
11% | drops sharply |
| Useful work per API call | ~0.3× | ~1× |
The two eval runs resolved ~25/47 instances (~53%) on the fraction that did run. With tool names fixed, the actual solve rate on a full 500-instance run is unknown but the model demonstrated sound reasoning on the instances it could execute.
Notes
task_tracker,finish, andthinkdo not need aliasing — the model uses those names correctly in both runs.- No system-prompt changes are needed. The fix is entirely in the tool name exposed in the schema.
BashToolandStrReplaceToolshould live in a newopenhands-tools/openhands/tools/nemotron/directory following thegemini/directory pattern.
References
- Eval issue tracking root cause: OpenHands/evaluation#343
- Anthropic
str_replace_based_edit_tooldocs: https://docs.anthropic.com/en/docs/build-with-claude/tool-use/text-editor-tool - Existing model-specific presets:
preset/gemini.py,preset/gpt5.py