feat: add nemotron tool preset (bash + str_replace aliases for Anthropic schema compatibility)

## Context

Two evaluation runs of `nvidia/nemotron-3-super-120b-a12b` via the LiteLLM proxy show a 63–67% conversation error rate, almost entirely caused by the model calling **tool names that do not exist** in OpenHands:

| Run | Benchmark | Instances | Error rate | Top errors |
|-----|-----------|-----------|------------|------------|
| [23368683367](https://github.com/OpenHands/evaluation/actions/runs/23368683367) | SWE-bench | 47/500 | 63.8% | `str_replace`, `command`, `grep` not found |
| [23270655200](https://github.com/OpenHands/evaluation/actions/runs/23270655200) | SWT-bench | 156 conv | 66.7% | `str_replace`, `command`, `execute`, `bash`, `grep -n` not found |

Full tool-name mismatch breakdown from the SWT-bench run (156 conversations):

| Count | Tool name model called | Should have called |
|-------|----------------------|--------------------|
| 9 | `str_replace` | `file_editor` (with `command="str_replace"`) |
| 4 | `command` | `terminal` |
| 4 | `execute` | `terminal` |
| 3 | `bash` | `terminal` |
| 3 | `grep -n` | `terminal` |
| 1 | `ls` | `terminal` |

Because every wrong-name call is an instant hard failure, `execution_status: stuck` appears 14 times (the model loops on the same call until `CONVERSATION_TIMEOUT` = 2 h). That tight retry loop also generates rapid-fire LLM requests that trigger `LLMRateLimitError` (17 occurrences), which in turn burns `MAX_RETRIES=3` against a problem that will never self-correct.

## Root cause

Nemotron-3 Super was fine-tuned on tool-use trajectories using **Anthropic's `str_replace_based_edit_tool` / `bash` schema**. In that schema:

- Shell execution is a standalone tool **named `bash`** (not `terminal`)
- File editing is a standalone tool **named `str_replace`** (not a sub-command of `file_editor`)

The model's behavior is correct for that schema. The problem is a pure **name mismatch**: the parameters it passes (`path`, `old_str`, `new_str`, `command`, `file_text`) are semantically correct and map 1-to-1 to existing OpenHands tool arguments — only the tool names differ.

## Proposed fix: a `nemotron` tool preset

Following the existing pattern — `gemini.py` exposes `read_file`/`write_file`/`edit`/`list_directory` instead of `file_editor`; `gpt5.py` exposes `apply_patch` instead of `file_editor` — add a `nemotron` preset that exposes tools under the names the model expects.

### New tools needed

#### 1. `BashTool` — name: `"bash"`

Thin wrapper around `TerminalExecutor`. Accepts a single `command: str` parameter, matching Anthropic's `bash` tool schema exactly.

```
name:        bash
description: Run a shell command and return stdout/stderr.
params:
  command (str, required): The shell command to execute.
```

The model will not pass `security_risk` or `summary` (those are OpenHands-specific), so the schema must not require them.

#### 2. `StrReplaceTool` — name: `"str_replace"`

Exposes the same operations as `FileEditorTool` under the Anthropic-compatible name. Parameter schema matches `str_replace_based_edit_tool` exactly:

```
name:        str_replace
description: View, create and edit plain-text files.
params:
  command     (str, required)       : "view" | "create" | "str_replace" | "insert" | "undo_edit"
  path        (str, required)       : Absolute path to file or directory.
  old_str     (str | None)          : For str_replace — exact text to find and replace.
  new_str     (str | None)          : For str_replace/insert — replacement or inserted text.
  file_text   (str | None)          : For create — full file content.
  insert_line (int | None)          : For insert — line number to insert after.
  view_range  (list[int] | None)    : For view — [start_line, end_line].
```

Internally delegates to `FileEditorExecutor` (identical backing to `FileEditorTool`).

### New preset file: `openhands-tools/openhands/tools/preset/nemotron.py`

```python
"""Nemotron-3 Super preset.

Nemotron-3 Super (nvidia/nemotron-3-super-120b-a12b) was fine-tuned on
trajectories that use the Anthropic str_replace_based_edit_tool / bash
tool schema. This preset exposes those exact tool names so the model's
calls succeed without any prompt engineering or model-side changes.

  bash        → TerminalExecutor    (model calls "bash", not "terminal")
  str_replace → FileEditorExecutor  (model calls "str_replace", not "file_editor")
  task_tracker, finish, think — unchanged; model already calls these correctly.
"""
from openhands.sdk import Agent
from openhands.sdk.context.condenser import LLMSummarizingCondenser
from openhands.sdk.llm.llm import LLM
from openhands.sdk.tool import Tool

def get_nemotron_tools() -> list[Tool]:
    from openhands.tools.nemotron import BashTool, StrReplaceTool
    from openhands.tools.task_tracker import TaskTrackerTool
    return [
        Tool(name=BashTool.name),        # "bash"
        Tool(name=StrReplaceTool.name),  # "str_replace"
        Tool(name=TaskTrackerTool.name), # "task_tracker"
    ]

def get_nemotron_agent(llm: LLM, cli_mode: bool = False) -> Agent:
    tools = get_nemotron_tools()
    return Agent(
        llm=llm,
        tools=tools,
        system_prompt_kwargs={"cli_mode": cli_mode},
        condenser=LLMSummarizingCondenser(
            llm=llm.model_copy(update={"usage_id": "condenser"}),
            max_size=80, keep_first=4,
        ),
    )
```

Export `get_nemotron_agent` / `get_nemotron_tools` from `preset/__init__.py`.

### Wire into the eval pipeline

Add `"nemotron"` to the `tool_preset` choice list in the eval workflow (`eval-job.yml`) and in `run_swebench.sh` / `run_swtbench.sh`. Nemotron evaluations should then be dispatched with `TOOL_PRESET=nemotron`.

## Expected impact

| Metric | Current (default preset) | With nemotron preset |
|--------|--------------------------|----------------------|
| Tool-name errors | 53% of conversations | ~0% |
| `stuck` conversations | 9% | ~0% |
| `LLMRateLimitError` | 11% | drops sharply |
| Useful work per API call | ~0.3× | ~1× |

The two eval runs resolved ~25/47 instances (~53%) on the fraction that did run. With tool names fixed, the actual solve rate on a full 500-instance run is unknown but the model demonstrated sound reasoning on the instances it could execute.

## Notes

- `task_tracker`, `finish`, and `think` do **not** need aliasing — the model uses those names correctly in both runs.
- No system-prompt changes are needed. The fix is entirely in the tool name exposed in the schema.
- `BashTool` and `StrReplaceTool` should live in a new `openhands-tools/openhands/tools/nemotron/` directory following the `gemini/` directory pattern.

## References

- Eval issue tracking root cause: OpenHands/evaluation#343
- Anthropic `str_replace_based_edit_tool` docs: https://docs.anthropic.com/en/docs/build-with-claude/tool-use/text-editor-tool
- Existing model-specific presets: `preset/gemini.py`, `preset/gpt5.py`


Count	Tool name model called	Should have called
9	`str_replace`	`file_editor` (with `command="str_replace"`)
4	`command`	`terminal`
4	`execute`	`terminal`
3	`bash`	`terminal`
3	`grep -n`	`terminal`
1	`ls`	`terminal`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add nemotron tool preset (bash + str_replace aliases for Anthropic schema compatibility) #2553

Context

Root cause

Proposed fix: a `nemotron` tool preset

New tools needed

1. `BashTool` — name: `"bash"`

2. `StrReplaceTool` — name: `"str_replace"`

New preset file: `openhands-tools/openhands/tools/preset/nemotron.py`

Wire into the eval pipeline

Expected impact

Notes

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run	Benchmark	Instances	Error rate	Top errors
23368683367	SWE-bench	47/500	63.8%	`str_replace`, `command`, `grep` not found
23270655200	SWT-bench	156 conv	66.7%	`str_replace`, `command`, `execute`, `bash`, `grep -n` not found

Metric	Current (default preset)	With nemotron preset
Tool-name errors	53% of conversations	~0%
`stuck` conversations	9%	~0%
`LLMRateLimitError`	11%	drops sharply
Useful work per API call	~0.3×	~1×

feat: add nemotron tool preset (bash + str_replace aliases for Anthropic schema compatibility) #2553

Description

Context

Root cause

Proposed fix: a nemotron tool preset

New tools needed

1. BashTool — name: "bash"

2. StrReplaceTool — name: "str_replace"

New preset file: openhands-tools/openhands/tools/preset/nemotron.py

Wire into the eval pipeline

Expected impact

Notes

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposed fix: a `nemotron` tool preset

1. `BashTool` — name: `"bash"`

2. `StrReplaceTool` — name: `"str_replace"`

New preset file: `openhands-tools/openhands/tools/preset/nemotron.py`