Skip to content

Add NewtonBench Resource Server #650

Open
Kelvin0110 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Kelvin0110:cmunley1/newton
Open

Add NewtonBench Resource Server #650
Kelvin0110 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Kelvin0110:cmunley1/newton

Conversation

@Kelvin0110
Copy link

Contributing To NeMo-Gym (NewtonBench Resource Server)

1) Basic information

i. Description of the environment

A resource server wrapping the NewtonBench benchmark

  • Tasks: 324 scientific law discovery tasks across 12 physics domains.
  • Observation Space: Experimental results (numeric or structured dictionaries) returned after tool use.
  • Tools:
    • run_experiment: Query the environment with specific parameters to receive physical observations.
    • execute_python: (Optional) Python code-assisted discovery for complex data analysis.
  • Server: FastAPI resource server following NeMo Gym conventions.

ii. Description of the verification logic

The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law:

  • Law Extraction: Attempts to find a law within <final_law> tags in the assistant's final response.
  • Success Criteria: Evaluates both symbolic equivalence (via an LLM judge) and numeric accuracy (Root Mean Square Logarithmic Error - RMSLE).
  • Reward Calculation:
    • reward = 0.3 * R_symbolic + 0.7 * R_numeric.
      • $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise.
      • $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$.
  • /verify endpoint processes the agent's submission and returns these detailed performance metrics.

iii. Description of the prompts/tasks (source + domain)

Domain: Maths (Scientific Law Discovery).
Source: Tasks and prompts adapted from the NewtonBench benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments.

iv. License information

  • Code: Apache 2.0.
  • Data: Apache 2.0
  • NewtonBench Benchmark: MIT (Copyright (c) 2025 HKUST-KnowComp).

2) Environment validity check

i. Commands used to collect rollouts

# Start NeMo Gym servers (agent + NewtonBench)
config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts \
    +agent_name=newton_bench_simple_agent \
    +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \
    +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \
    +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl

ii. Resulting rollouts (5 examples)

See resources_servers/newton_bench/data/example_rollouts.jsonl
Expected behavior:

  • Agent performs several experiments, analyzes data, and submits a scientific law.
  • Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0).
  • Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative.

3) Tests

i. Commands used to run the tests

source resources_servers/newton_bench/.venv/bin/activate 
pytest resources_servers/newton_bench/tests/test_app.py

Coverage notes:
Resource server tests provide comprehensive coverage of the following areas:

  • Session Lifecycle: Successful seeding, error handling for invalid modules, session ending, and background cleanup.
  • Experiment Execution: Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc.
  • Python Sandbox: Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations).
  • Verification Logic: Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE.

4) Reward profiling

Models: Qwen/Qwen3-VL-8B-Thinking

Method:

  • 108 prompts based on version v0 of scientific laws.
  • 4 rollouts per prompt (432 total).
  • Tool calling of run_experiment enabled and agent loops until law submission.

Results:
Overall Metrics

  • Total Rollouts: 432
  • Mean Reward: $\approx$ 0.0675
  • Median Reward: 0.0
  • Min Reward: $\approx$ -0.8786
  • Max Reward: 1.0

Tool Call Statistics

  • Average Tool Calls: 22.95 per rollout
  • Min Tool Calls: 0
  • Max Tool Calls: 1770
  • Correlation (tool calls $\leftrightarrow$ reward): $\approx$ -0.0211 (Weak negative correlation)

Reward Distribution (Buckets)

Reward Range Count
[-1.0, -0.8) 16
[-0.8, -0.6) 16
[-0.6, -0.4) 60
[-0.4, -0.2) 39
[-0.2, 0.0) 24
[0.0, 0.2) 150
[0.2, 0.4) 46
[0.4, 0.6) 2
[0.6, 0.8) 1
[0.8, 1.0] 78

Performance by Tool Call Count Bins

Tool Call Range Rollouts (n) Mean Reward
0 23 $\approx$ -0.1112
1–10 329 $\approx$ 0.0824
11–50 60 $\approx$ 0.1308
51–200 15 $\approx$ -0.1959
201–2000 5 $\approx$ -0.0600

Key observations:

  • Symbolic Accuracy: Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior.
  • Reward Distribution: Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries.
  • Tool Usage Sweet Spot: Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws.
  • Diminishing Returns: Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume.

@Kelvin0110 Kelvin0110 requested a review from a team as a code owner February 5, 2026 07:53
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cmunley1 cmunley1 self-requested a review February 5, 2026 22:41
@cmunley1
Copy link
Contributor

cmunley1 commented Feb 5, 2026

can you please merge main?

@Kelvin0110
Copy link
Author

Sure, I’ve merged the latest main branch. Please let me know if you’d like me to take any further steps.

@cmunley1
Copy link
Contributor

cmunley1 commented Feb 6, 2026

have you tried training with NeMo RL (ideally we can test training before merging)? Also, I see you used a vision language model, does anything require vision here (not an issue, just curious) ?

@cmunley1
Copy link
Contributor

cmunley1 commented Feb 6, 2026

@cmunley1
Copy link
Contributor

cmunley1 commented Feb 6, 2026

Copy link
Contributor

@cmunley1 cmunley1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to pass dco and precommit

@Kelvin0110
Copy link
Author

Thanks for checking.
Our resource server doesn’t require any vision. We selected Qwen/Qwen3-VL-8B-Thinking because this vision language model provides stronger pure text performance than the corresponding non‑VL models (e.g., qwen3-8b-thinking). Since our tasks involve relatively complex reasoning, using the stronger model helps ensure more stable and reliable reward distribution.

@newtdes
Copy link

newtdes commented Feb 10, 2026

Also for DCO and precommit, we will handle that to ensure our pull request will pass both checking

@Kelvin0110 Kelvin0110 force-pushed the cmunley1/newton branch 2 times, most recently from 62303b5 to ae2236a Compare February 16, 2026 12:05
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
@Kelvin0110
Copy link
Author

The DCO and precommit checking are solved.
For the Unit tests / Test, some tests fail because they require NewtonBench to exist in the root directory. Currently, we assume that users will manually clone NewtonBench themselves. When NewtonBench is cloned, all unit tests can pass successfully.
Hence, to ensure that automated unit test checks pass consistently, should we add NewtonBench as a submodule?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments