-
Notifications
You must be signed in to change notification settings - Fork 258
Description
Problem
The current AgentNativeStepEnvManager (used in run_agentic_pipeline_rock_swe_qwen35_2b.sh) creates one training sample per interaction chunk (agent turn). For a trajectory with K turns, this means:
- K separate forward/backward passes through the model per trajectory
- K optimizer steps per pipeline step (LR scheduler advances K× too fast)
- K× more compute than necessary — each chunk is padded to
sequence_lengthindependently - Increased async staleness — K optimizer steps between weight syncs means more policy drift
With typical multi-turn agentic tasks (10-25 turns per trajectory), this overhead is substantial. For example, with a cosine LR schedule over 200 pipeline steps and 15 turns/trajectory, the LR scheduler effectively runs for 3,000 steps — exhausting the schedule in ~13 pipeline steps.
Proposal
Add a formulate_mode: "traj" option to AgentNativeStepEnvManager that packs all chunks from a trajectory into a single training sample:
[prompt₁ | response₁ | prompt₂ | response₂ | ... | promptₖ | responseₖ | padding]
With this layout:
response_maskhas K contiguous segments of 1s (one per chunk)- Each chunk's IS ratio is computed independently via
compute_segment_masked_mean(geometric mean within each segment) - Per-chunk discounted returns
G_k = γ^{K-k} × R_finalare assigned to each response segment - One forward/backward pass per trajectory, one optimizer step per pipeline step
Config example
custom_envs:
MyEnv:
formulate_mode: "traj" # new option (default: "step" for backward compat)
env_type: "openreward_env"
max_steps: 25
# ...What needs to change
formulate_rolloutsdispatcher inagent_native_env_manager.py— route to step or traj formulation based on config- Traj-mode token assembly — concatenate all chunks'
prompt_ids + response_idsinto one sequence, build multi-segmentresponse_mask compute_discounted_returnsinutils.py— detect traj mode (step_scores is a list vs scalar), compute token-levelstep_rewardsusingresponse_masksegment boundaries- Edge case guards — handle trajectories with missing
response_ids(observation-only trailing entries), empty trajectories, andadjust_batchsample duplication
Benefits
| Step mode | Traj mode | |
|---|---|---|
| Forward/backward per trajectory | K | 1 |
| Optimizer steps per pipeline step | K | 1 |
| LR scheduler accuracy | K× too fast | Correct |
| Compute (relative) | 7× | 1× |
| Async staleness | High | Low |
Our experience
We implemented this for IPA chunk-level loss training on kanishk/EndlessTerminals (2,490 terminal-based coding tasks) with Qwen3.5-2B on 8× A100-40GB. Results over 700 training steps:
- Success rate: 28% → 42% (+50% relative improvement)
- Action efficiency: 18.2 → 11.2 avg actions/task (-39%)
- Validation score: 50-75% on held-out tasks
- Zero training instability — grad_norm stable at ~1.0, IS ratio ~0.99
The trajectory-level formulation was essential for making IPA work correctly — without it, the LR scheduler was exhausted by step ~13, and the model failed to learn.
Environment
- ROLL version: latest main branch
- Python 3.12, PyTorch 2.10, vLLM, Megatron-Core
- Tested with OpenReward environments and
adv_estimator: "step_reinforce"+pg_variant: "ipa_chunk"
I'd be happy to discuss the design or contribute a PR if this feature direction is accepted.
Author: @shamanez (gshasiri@gmail.com)