Skip to content

Feature Request: Trajectory-level formulation mode for AgentNativeStepEnvManager #409

@shamanez

Description

@shamanez

Problem

The current AgentNativeStepEnvManager (used in run_agentic_pipeline_rock_swe_qwen35_2b.sh) creates one training sample per interaction chunk (agent turn). For a trajectory with K turns, this means:

  • K separate forward/backward passes through the model per trajectory
  • K optimizer steps per pipeline step (LR scheduler advances K× too fast)
  • K× more compute than necessary — each chunk is padded to sequence_length independently
  • Increased async staleness — K optimizer steps between weight syncs means more policy drift

With typical multi-turn agentic tasks (10-25 turns per trajectory), this overhead is substantial. For example, with a cosine LR schedule over 200 pipeline steps and 15 turns/trajectory, the LR scheduler effectively runs for 3,000 steps — exhausting the schedule in ~13 pipeline steps.

Proposal

Add a formulate_mode: "traj" option to AgentNativeStepEnvManager that packs all chunks from a trajectory into a single training sample:

[prompt₁ | response₁ | prompt₂ | response₂ | ... | promptₖ | responseₖ | padding]

With this layout:

  • response_mask has K contiguous segments of 1s (one per chunk)
  • Each chunk's IS ratio is computed independently via compute_segment_masked_mean (geometric mean within each segment)
  • Per-chunk discounted returns G_k = γ^{K-k} × R_final are assigned to each response segment
  • One forward/backward pass per trajectory, one optimizer step per pipeline step

Config example

custom_envs:
  MyEnv:
    formulate_mode: "traj"  # new option (default: "step" for backward compat)
    env_type: "openreward_env"
    max_steps: 25
    # ...

What needs to change

  1. formulate_rollouts dispatcher in agent_native_env_manager.py — route to step or traj formulation based on config
  2. Traj-mode token assembly — concatenate all chunks' prompt_ids + response_ids into one sequence, build multi-segment response_mask
  3. compute_discounted_returns in utils.py — detect traj mode (step_scores is a list vs scalar), compute token-level step_rewards using response_mask segment boundaries
  4. Edge case guards — handle trajectories with missing response_ids (observation-only trailing entries), empty trajectories, and adjust_batch sample duplication

Benefits

Step mode Traj mode
Forward/backward per trajectory K 1
Optimizer steps per pipeline step K 1
LR scheduler accuracy K× too fast Correct
Compute (relative)
Async staleness High Low

Our experience

We implemented this for IPA chunk-level loss training on kanishk/EndlessTerminals (2,490 terminal-based coding tasks) with Qwen3.5-2B on 8× A100-40GB. Results over 700 training steps:

  • Success rate: 28% → 42% (+50% relative improvement)
  • Action efficiency: 18.2 → 11.2 avg actions/task (-39%)
  • Validation score: 50-75% on held-out tasks
  • Zero training instability — grad_norm stable at ~1.0, IS ratio ~0.99

The trajectory-level formulation was essential for making IPA work correctly — without it, the LR scheduler was exhausted by step ~13, and the model failed to learn.

Environment

  • ROLL version: latest main branch
  • Python 3.12, PyTorch 2.10, vLLM, Megatron-Core
  • Tested with OpenReward environments and adv_estimator: "step_reinforce" + pg_variant: "ipa_chunk"

I'd be happy to discuss the design or contribute a PR if this feature direction is accepted.

Author: @shamanez (gshasiri@gmail.com)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions