Skip to content

feat: add nemotron tool preset (bash + str_replace aliases for Anthropic schema compatibility) #2553

@juanmichelini

Description

@juanmichelini

Context

Two evaluation runs of nvidia/nemotron-3-super-120b-a12b via the LiteLLM proxy show a 63–67% conversation error rate, almost entirely caused by the model calling tool names that do not exist in OpenHands:

Run Benchmark Instances Error rate Top errors
23368683367 SWE-bench 47/500 63.8% str_replace, command, grep not found
23270655200 SWT-bench 156 conv 66.7% str_replace, command, execute, bash, grep -n not found

Full tool-name mismatch breakdown from the SWT-bench run (156 conversations):

Count Tool name model called Should have called
9 str_replace file_editor (with command="str_replace")
4 command terminal
4 execute terminal
3 bash terminal
3 grep -n terminal
1 ls terminal

Because every wrong-name call is an instant hard failure, execution_status: stuck appears 14 times (the model loops on the same call until CONVERSATION_TIMEOUT = 2 h). That tight retry loop also generates rapid-fire LLM requests that trigger LLMRateLimitError (17 occurrences), which in turn burns MAX_RETRIES=3 against a problem that will never self-correct.

Root cause

Nemotron-3 Super was fine-tuned on tool-use trajectories using Anthropic's str_replace_based_edit_tool / bash schema. In that schema:

  • Shell execution is a standalone tool named bash (not terminal)
  • File editing is a standalone tool named str_replace (not a sub-command of file_editor)

The model's behavior is correct for that schema. The problem is a pure name mismatch: the parameters it passes (path, old_str, new_str, command, file_text) are semantically correct and map 1-to-1 to existing OpenHands tool arguments — only the tool names differ.

Proposed fix: a nemotron tool preset

Following the existing pattern — gemini.py exposes read_file/write_file/edit/list_directory instead of file_editor; gpt5.py exposes apply_patch instead of file_editor — add a nemotron preset that exposes tools under the names the model expects.

New tools needed

1. BashTool — name: "bash"

Thin wrapper around TerminalExecutor. Accepts a single command: str parameter, matching Anthropic's bash tool schema exactly.

name:        bash
description: Run a shell command and return stdout/stderr.
params:
  command (str, required): The shell command to execute.

The model will not pass security_risk or summary (those are OpenHands-specific), so the schema must not require them.

2. StrReplaceTool — name: "str_replace"

Exposes the same operations as FileEditorTool under the Anthropic-compatible name. Parameter schema matches str_replace_based_edit_tool exactly:

name:        str_replace
description: View, create and edit plain-text files.
params:
  command     (str, required)       : "view" | "create" | "str_replace" | "insert" | "undo_edit"
  path        (str, required)       : Absolute path to file or directory.
  old_str     (str | None)          : For str_replace — exact text to find and replace.
  new_str     (str | None)          : For str_replace/insert — replacement or inserted text.
  file_text   (str | None)          : For create — full file content.
  insert_line (int | None)          : For insert — line number to insert after.
  view_range  (list[int] | None)    : For view — [start_line, end_line].

Internally delegates to FileEditorExecutor (identical backing to FileEditorTool).

New preset file: openhands-tools/openhands/tools/preset/nemotron.py

"""Nemotron-3 Super preset.

Nemotron-3 Super (nvidia/nemotron-3-super-120b-a12b) was fine-tuned on
trajectories that use the Anthropic str_replace_based_edit_tool / bash
tool schema. This preset exposes those exact tool names so the model's
calls succeed without any prompt engineering or model-side changes.

  bash        → TerminalExecutor    (model calls "bash", not "terminal")
  str_replace → FileEditorExecutor  (model calls "str_replace", not "file_editor")
  task_tracker, finish, think — unchanged; model already calls these correctly.
"""
from openhands.sdk import Agent
from openhands.sdk.context.condenser import LLMSummarizingCondenser
from openhands.sdk.llm.llm import LLM
from openhands.sdk.tool import Tool

def get_nemotron_tools() -> list[Tool]:
    from openhands.tools.nemotron import BashTool, StrReplaceTool
    from openhands.tools.task_tracker import TaskTrackerTool
    return [
        Tool(name=BashTool.name),        # "bash"
        Tool(name=StrReplaceTool.name),  # "str_replace"
        Tool(name=TaskTrackerTool.name), # "task_tracker"
    ]

def get_nemotron_agent(llm: LLM, cli_mode: bool = False) -> Agent:
    tools = get_nemotron_tools()
    return Agent(
        llm=llm,
        tools=tools,
        system_prompt_kwargs={"cli_mode": cli_mode},
        condenser=LLMSummarizingCondenser(
            llm=llm.model_copy(update={"usage_id": "condenser"}),
            max_size=80, keep_first=4,
        ),
    )

Export get_nemotron_agent / get_nemotron_tools from preset/__init__.py.

Wire into the eval pipeline

Add "nemotron" to the tool_preset choice list in the eval workflow (eval-job.yml) and in run_swebench.sh / run_swtbench.sh. Nemotron evaluations should then be dispatched with TOOL_PRESET=nemotron.

Expected impact

Metric Current (default preset) With nemotron preset
Tool-name errors 53% of conversations ~0%
stuck conversations 9% ~0%
LLMRateLimitError 11% drops sharply
Useful work per API call ~0.3× ~1×

The two eval runs resolved ~25/47 instances (~53%) on the fraction that did run. With tool names fixed, the actual solve rate on a full 500-instance run is unknown but the model demonstrated sound reasoning on the instances it could execute.

Notes

  • task_tracker, finish, and think do not need aliasing — the model uses those names correctly in both runs.
  • No system-prompt changes are needed. The fix is entirely in the tool name exposed in the schema.
  • BashTool and StrReplaceTool should live in a new openhands-tools/openhands/tools/nemotron/ directory following the gemini/ directory pattern.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions