Add NewtonBench Resource Server #650
Conversation
|
can you please merge main? |
|
Sure, I’ve merged the latest main branch. Please let me know if you’d like me to take any further steps. |
|
have you tried training with NeMo RL (ideally we can test training before merging)? Also, I see you used a vision language model, does anything require vision here (not an issue, just curious) ? |
|
DCO is faililng can you try to resolve that? https://docs.nvidia.com/nemo/gym/latest/contribute/development-setup.html#dco-and-commit-signing Also see here https://docs.nvidia.com/nemo/gym/latest/contribute/environments/new-environment.html#contribution-workflow |
|
please also run pre-commit check like ruff https://docs.nvidia.com/nemo/gym/latest/contribute/development-setup.html#pre-commit-hook-failures |
cmunley1
left a comment
There was a problem hiding this comment.
need to pass dco and precommit
|
Thanks for checking. |
|
Also for DCO and precommit, we will handle that to ensure our pull request will pass both checking |
62303b5 to
ae2236a
Compare
Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
ae2236a to
561beb2
Compare
|
The DCO and precommit checking are solved. |
Contributing To NeMo-Gym (NewtonBench Resource Server)
1) Basic information
i. Description of the environment
A resource server wrapping the NewtonBench benchmark
run_experiment: Query the environment with specific parameters to receive physical observations.execute_python: (Optional) Python code-assisted discovery for complex data analysis.ii. Description of the verification logic
The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law:
<final_law>tags in the assistant's final response.reward = 0.3 * R_symbolic + 0.7 * R_numeric./verifyendpoint processes the agent's submission and returns these detailed performance metrics.iii. Description of the prompts/tasks (source + domain)
Domain: Maths (Scientific Law Discovery).
Source: Tasks and prompts adapted from the NewtonBench benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments.
iv. License information
2) Environment validity check
i. Commands used to collect rollouts
ii. Resulting rollouts (5 examples)
See
resources_servers/newton_bench/data/example_rollouts.jsonlExpected behavior:
3) Tests
i. Commands used to run the tests
source resources_servers/newton_bench/.venv/bin/activate pytest resources_servers/newton_bench/tests/test_app.pyCoverage notes:
Resource server tests provide comprehensive coverage of the following areas:
4) Reward profiling
Models: Qwen/Qwen3-VL-8B-Thinking
Method:
run_experimentenabled and agent loops until law submission.Results:
Overall Metrics
Tool Call Statistics
Reward Distribution (Buckets)
Performance by Tool Call Count Bins
Key observations: