Add NewtonBench Resource Server by Kelvin0110 · Pull Request #650 · NVIDIA-NeMo/Gym

Kelvin0110 · 2026-02-05T07:53:47Z

Contributing To NeMo-Gym (NewtonBench Resource Server)

1) Basic information

i. Description of the environment

A resource server wrapping the NewtonBench benchmark

Tasks: 324 scientific law discovery tasks across 12 physics domains.
Observation Space: Experimental results (numeric or structured dictionaries) returned after tool use.
Tools:
- run_experiment: Query the environment with specific parameters to receive physical observations.
- execute_python: (Optional) Python code-assisted discovery for complex data analysis.
Server: FastAPI resource server following NeMo Gym conventions.

ii. Description of the verification logic

The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law:

Law Extraction: Attempts to find a law within <final_law> tags in the assistant's final response.
Success Criteria: Evaluates both symbolic equivalence (via an LLM judge) and numeric accuracy (Root Mean Square Logarithmic Error - RMSLE).
Reward Calculation:
- reward = 0.3 * R_symbolic + 0.7 * R_numeric.
  - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise.
  - $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$.
/verify endpoint processes the agent's submission and returns these detailed performance metrics.

iii. Description of the prompts/tasks (source + domain)

Domain: Maths (Scientific Law Discovery).
Source: Tasks and prompts adapted from the NewtonBench benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments.

iv. License information

Code: Apache 2.0.
Data: Apache 2.0
NewtonBench Benchmark: MIT (Copyright (c) 2025 HKUST-KnowComp).

2) Environment validity check

i. Commands used to collect rollouts

# Start NeMo Gym servers (agent + NewtonBench)
config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts \
    +agent_name=newton_bench_simple_agent \
    +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \
    +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \
    +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl

ii. Resulting rollouts (5 examples)

See resources_servers/newton_bench/data/example_rollouts.jsonl
Expected behavior:

Agent performs several experiments, analyzes data, and submits a scientific law.
Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0).
Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative.

3) Tests

i. Commands used to run the tests

source resources_servers/newton_bench/.venv/bin/activate 
pytest resources_servers/newton_bench/tests/test_app.py

Coverage notes:
Resource server tests provide comprehensive coverage of the following areas:

Session Lifecycle: Successful seeding, error handling for invalid modules, session ending, and background cleanup.
Experiment Execution: Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc.
Python Sandbox: Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations).
Verification Logic: Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE.

4) Reward profiling

Models: Qwen/Qwen3-VL-8B-Thinking

Method:

108 prompts based on version v0 of scientific laws.
4 rollouts per prompt (432 total).
Tool calling of run_experiment enabled and agent loops until law submission.

Results:
Overall Metrics

Total Rollouts: 432
Mean Reward: $\approx$ 0.0675
Median Reward: 0.0
Min Reward: $\approx$ -0.8786
Max Reward: 1.0

Tool Call Statistics

Average Tool Calls: 22.95 per rollout
Min Tool Calls: 0
Max Tool Calls: 1770
Correlation (tool calls $\leftrightarrow$ reward): $\approx$ -0.0211 (Weak negative correlation)

Reward Distribution (Buckets)

Reward Range	Count
[-1.0, -0.8)	16
[-0.8, -0.6)	16
[-0.6, -0.4)	60
[-0.4, -0.2)	39
[-0.2, 0.0)	24
[0.0, 0.2)	150
[0.2, 0.4)	46
[0.4, 0.6)	2
[0.6, 0.8)	1
[0.8, 1.0]	78

Performance by Tool Call Count Bins

Tool Call Range	Rollouts (n)	Mean Reward
0	23	$\approx$ -0.1112
1–10	329	$\approx$ 0.0824
11–50	60	$\approx$ 0.1308
51–200	15	$\approx$ -0.1959
201–2000	5	$\approx$ -0.0600

Key observations:

Symbolic Accuracy: Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior.
Reward Distribution: Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries.
Tool Usage Sweet Spot: Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws.
Diminishing Returns: Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume.

copy-pr-bot · 2026-02-05T07:53:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cmunley1 · 2026-02-05T22:48:36Z

can you please merge main?

Kelvin0110 · 2026-02-06T10:16:48Z

Sure, I’ve merged the latest main branch. Please let me know if you’d like me to take any further steps.

cmunley1 · 2026-02-06T18:07:05Z

have you tried training with NeMo RL (ideally we can test training before merging)? Also, I see you used a vision language model, does anything require vision here (not an issue, just curious) ?

cmunley1 · 2026-02-06T20:06:03Z

DCO is faililng can you try to resolve that? https://docs.nvidia.com/nemo/gym/latest/contribute/development-setup.html#dco-and-commit-signing

Also see here https://docs.nvidia.com/nemo/gym/latest/contribute/environments/new-environment.html#contribution-workflow

cmunley1 · 2026-02-06T20:41:19Z

please also run pre-commit check like ruff https://docs.nvidia.com/nemo/gym/latest/contribute/development-setup.html#pre-commit-hook-failures

cmunley1

need to pass dco and precommit

Kelvin0110 · 2026-02-10T05:52:47Z

Thanks for checking.
Our resource server doesn’t require any vision. We selected Qwen/Qwen3-VL-8B-Thinking because this vision language model provides stronger pure text performance than the corresponding non‑VL models (e.g., qwen3-8b-thinking). Since our tasks involve relatively complex reasoning, using the stronger model helps ensure more stable and reliable reward distribution.

newtdes · 2026-02-10T06:01:36Z

Also for DCO and precommit, we will handle that to ensure our pull request will pass both checking

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

Kelvin0110 · 2026-02-18T17:18:34Z

The DCO and precommit checking are solved.
For the Unit tests / Test, some tests fail because they require NewtonBench to exist in the root directory. Currently, we assume that users will manually clone NewtonBench themselves. When NewtonBench is cloned, all unit tests can pass successfully.
Hence, to ensure that automated unit test checks pass consistently, should we add NewtonBench as a submodule?

Kelvin0110 requested a review from a team as a code owner February 5, 2026 07:53

cmunley1 self-requested a review February 5, 2026 22:41

cmunley1 requested changes Feb 10, 2026

View reviewed changes

Kelvin0110 force-pushed the cmunley1/newton branch 2 times, most recently from 62303b5 to ae2236a Compare February 16, 2026 12:05

Add new resource_server: newton_bench

561beb2

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>

Kelvin0110 force-pushed the cmunley1/newton branch from ae2236a to 561beb2 Compare February 16, 2026 14:57

Merge branch 'main' into cmunley1/newton

fa68305

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NewtonBench Resource Server #650

Add NewtonBench Resource Server #650
Kelvin0110 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Kelvin0110:cmunley1/newton

Kelvin0110 commented Feb 5, 2026

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

cmunley1 commented Feb 5, 2026

Uh oh!

Kelvin0110 commented Feb 6, 2026

Uh oh!

cmunley1 commented Feb 6, 2026 •

edited

Loading

Uh oh!

cmunley1 commented Feb 6, 2026

Uh oh!

cmunley1 commented Feb 6, 2026

Uh oh!

cmunley1 left a comment

Uh oh!

Kelvin0110 commented Feb 10, 2026

Uh oh!

newtdes commented Feb 10, 2026

Uh oh!

Kelvin0110 commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Kelvin0110 commented Feb 5, 2026

Contributing To NeMo-Gym (NewtonBench Resource Server)

1) Basic information

i. Description of the environment

ii. Description of the verification logic

iii. Description of the prompts/tasks (source + domain)

iv. License information

2) Environment validity check

i. Commands used to collect rollouts

ii. Resulting rollouts (5 examples)

3) Tests

i. Commands used to run the tests

4) Reward profiling

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

cmunley1 commented Feb 5, 2026

Uh oh!

Kelvin0110 commented Feb 6, 2026

Uh oh!

cmunley1 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmunley1 commented Feb 6, 2026

Uh oh!

cmunley1 commented Feb 6, 2026

Uh oh!

cmunley1 left a comment

Choose a reason for hiding this comment

Uh oh!

Kelvin0110 commented Feb 10, 2026

Uh oh!

newtdes commented Feb 10, 2026

Uh oh!

Kelvin0110 commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

cmunley1 commented Feb 6, 2026 •

edited

Loading