feat: add nemotron tool preset (bash + str_replace aliases for Anthropic schema compatibility)#2554
Closed
juanmichelini wants to merge 2 commits intomainfrom
Closed
feat: add nemotron tool preset (bash + str_replace aliases for Anthropic schema compatibility)#2554juanmichelini wants to merge 2 commits intomainfrom
juanmichelini wants to merge 2 commits intomainfrom
Conversation
Contributor
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
Contributor
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Collaborator
Author
Contributor
Coverage Report •
|
||||||||||||||||||||||||||||||||||||||||||||||||||
Collaborator
Author
|
Need to retest (missed LiteLLM param) |
be1e9f0 to
0b1429e
Compare
juanmichelini
pushed a commit
to OpenHands/benchmarks
that referenced
this pull request
Mar 26, 2026
Add gpt5 and nemotron to: - ToolPresetType literal in benchmarks/utils/models.py - get_tools_for_preset() in benchmarks/swebench/run_infer.py - get_tools_for_preset() in benchmarks/swebenchmultilingual/run_infer.py This enables evaluations with: - gpt5: uses apply_patch tool for file editing - nemotron: uses bash/str_replace tools (Anthropic-compatible) These presets are already supported in the software-agent-sdk but were missing from the benchmarks implementation. Related: OpenHands/software-agent-sdk#2554 Co-authored-by: openhands <openhands@all-hands.dev>
This was referenced Mar 26, 2026
0b1429e to
c8f7f90
Compare
juanmichelini
pushed a commit
to OpenHands/benchmarks
that referenced
this pull request
Mar 27, 2026
Add gpt5 and nemotron to: - ToolPresetType literal in benchmarks/utils/models.py - get_tools_for_preset() in benchmarks/swebench/run_infer.py - get_tools_for_preset() in benchmarks/swebenchmultilingual/run_infer.py This enables evaluations with: - gpt5: uses apply_patch tool for file editing - nemotron: uses bash/str_replace tools (Anthropic-compatible) These presets are already supported in the software-agent-sdk but were missing from the benchmarks implementation. Related: OpenHands/software-agent-sdk#2554 Co-authored-by: openhands <openhands@all-hands.dev>
enyst
reviewed
Mar 27, 2026
Collaborator
enyst
left a comment
There was a problem hiding this comment.
@juanmichelini Please see the discussion here: #2584 (comment)
I suggest we could think how we do this, first? Otherwise we’d have to re-eval if we change implementation deeply, I guess.
c8f7f90 to
6611469
Compare
juanmichelini
pushed a commit
to OpenHands/benchmarks
that referenced
this pull request
Mar 27, 2026
Add nemotron tool preset back to enable testing before merge. This PR and OpenHands/software-agent-sdk#2554 will be merged simultaneously or not at all. Changes: - Added 'nemotron' to ToolPresetType in benchmarks/utils/models.py - Added nemotron case to get_tools_for_preset() in benchmarks/utils/tools.py - Added test_get_tools_for_preset_nemotron() to tests/test_tools.py Co-authored-by: openhands <openhands@all-hands.dev>
…pic schema compatibility) Add a new 'nemotron' tool preset for Nemotron-3 Super (nvidia/nemotron-3-super-120b-a12b) which was fine-tuned on trajectories using Anthropic's tool schema. The preset exposes: - BashTool: A tool named 'bash' (instead of 'terminal') that wraps TerminalExecutor - StrReplaceTool: A tool named 'str_replace' (instead of 'file_editor') that wraps FileEditorExecutor This fixes the 63-67% conversation error rate observed in Nemotron evaluations, caused entirely by tool name mismatches where the model called tools like 'bash', 'str_replace', 'command', 'execute' that don't exist in the default OpenHands schema. New files: - openhands-tools/openhands/tools/nemotron/bash/ - BashTool implementation - openhands-tools/openhands/tools/nemotron/str_replace/ - StrReplaceTool implementation - openhands-tools/openhands/tools/preset/nemotron.py - Preset configuration - tests/tools/nemotron/ - Test coverage for new tools and preset Exports added: - get_nemotron_agent, get_nemotron_tools from openhands.tools.preset Fixes #2553 Co-authored-by: openhands <openhands@all-hands.dev>
Add nemotron to: - ToolPresetType literal in tests/integration/base.py - get_tools_for_preset() function to return nemotron tools - run-eval.yml workflow tool_preset dropdown - integration-runner.yml workflow tool_preset dropdown - run_infer.py argparse choices - test_tool_presets.py for nemotron validation This enables running evaluations with TOOL_PRESET=nemotron to test the Nemotron-3 Super model with its native tool names (bash, str_replace). Co-authored-by: openhands <openhands@all-hands.dev>
15e1460 to
7d83d44
Compare
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a new
nemotrontool preset for Nemotron-3 Super (nvidia/nemotron-3-super-120b-a12b) which was fine-tuned on trajectories using Anthropic's tool schema. The preset exposes:bash(instead ofterminal) that wraps TerminalExecutorstr_replace(instead offile_editor) that wraps FileEditorExecutorProblem
Two evaluation runs of
nvidia/nemotron-3-super-120b-a12bshowed a 63-67% conversation error rate, almost entirely caused by the model calling tool names that don't exist in OpenHands:str_replacefile_editor(withcommand="str_replace")bashterminalcommandterminalexecuteterminalThe model's behavior is correct for Anthropic's schema - it was trained on the
str_replace_based_edit_tool/bashtool interface. The problem is a pure name mismatch.Solution
Following the existing pattern (
gemini.py,gpt5.py), this PR adds anemotronpreset that exposes tools under the names the model expects:New files:
openhands-tools/openhands/tools/nemotron/bash/- BashTool implementationopenhands-tools/openhands/tools/nemotron/str_replace/- StrReplaceTool implementationopenhands-tools/openhands/tools/preset/nemotron.py- Preset configurationtests/tools/nemotron/- Test coverage for new tools and preset (21 tests)CI integration:
nemotrontotool_presetdropdown inrun-eval.ymlworkflownemotrontotool_presetdropdown inintegration-runner.ymlworkflowToolPresetTypeandget_tools_for_preset()intests/integration/base.pytests/integration/run_infer.pyExports added:
get_nemotron_agent,get_nemotron_toolsfromopenhands.tools.presetUsage
Or using the tools directly:
Testing via CI
To test the nemotron preset through the evaluation workflow:
openhands/nemotron-tool-presetassdk_ref(check 'Allow unreleased branches')tool_presettonemotronNo additional PRs needed in evaluation or benchmarks repos - the tool_preset is passed through to the SDK.
Fixes #2553
Checklist
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:ebba9d4-pythonRun
All tags pushed for this build
About Multi-Architecture Support
ebba9d4-python) is a multi-arch manifest supporting both amd64 and arm64ebba9d4-python-amd64) are also available if needed