Feature Request: Trajectory-level formulation mode for AgentNativeStepEnvManager

## Problem

The current `AgentNativeStepEnvManager` (used in [`run_agentic_pipeline_rock_swe_qwen35_2b.sh`](https://github.com/alibaba/ROLL/blob/main/examples/agentic_demo/run_agentic_pipeline_rock_swe_qwen35_2b.sh)) creates **one training sample per interaction chunk** (agent turn). For a trajectory with K turns, this means:

- **K separate forward/backward passes** through the model per trajectory
- **K optimizer steps** per pipeline step (LR scheduler advances K× too fast)
- **K× more compute** than necessary — each chunk is padded to `sequence_length` independently
- **Increased async staleness** — K optimizer steps between weight syncs means more policy drift

With typical multi-turn agentic tasks (10-25 turns per trajectory), this overhead is substantial. For example, with a cosine LR schedule over 200 pipeline steps and 15 turns/trajectory, the LR scheduler effectively runs for 3,000 steps — exhausting the schedule in ~13 pipeline steps.

## Proposal

Add a `formulate_mode: "traj"` option to `AgentNativeStepEnvManager` that packs **all chunks from a trajectory into a single training sample**:

```
[prompt₁ | response₁ | prompt₂ | response₂ | ... | promptₖ | responseₖ | padding]
```

With this layout:
- `response_mask` has K contiguous segments of 1s (one per chunk)
- Each chunk's IS ratio is computed independently via `compute_segment_masked_mean` (geometric mean within each segment)
- Per-chunk discounted returns `G_k = γ^{K-k} × R_final` are assigned to each response segment
- **One forward/backward pass** per trajectory, **one optimizer step** per pipeline step

### Config example

```yaml
custom_envs:
  MyEnv:
    formulate_mode: "traj"  # new option (default: "step" for backward compat)
    env_type: "openreward_env"
    max_steps: 25
    # ...
```

## What needs to change

1. **`formulate_rollouts` dispatcher** in `agent_native_env_manager.py` — route to step or traj formulation based on config
2. **Traj-mode token assembly** — concatenate all chunks' `prompt_ids + response_ids` into one sequence, build multi-segment `response_mask`
3. **`compute_discounted_returns` in `utils.py`** — detect traj mode (step_scores is a list vs scalar), compute token-level `step_rewards` using `response_mask` segment boundaries
4. **Edge case guards** — handle trajectories with missing `response_ids` (observation-only trailing entries), empty trajectories, and `adjust_batch` sample duplication

## Benefits

| | Step mode | Traj mode |
|---|---|---|
| Forward/backward per trajectory | K | 1 |
| Optimizer steps per pipeline step | K | 1 |
| LR scheduler accuracy | K× too fast | Correct |
| Compute (relative) | 7× | 1× |
| Async staleness | High | Low |

## Our experience

We implemented this for IPA chunk-level loss training on `kanishk/EndlessTerminals` (2,490 terminal-based coding tasks) with Qwen3.5-2B on 8× A100-40GB. Results over 700 training steps:

- **Success rate**: 28% → 42% (+50% relative improvement)
- **Action efficiency**: 18.2 → 11.2 avg actions/task (-39%)
- **Validation score**: 50-75% on held-out tasks
- **Zero training instability** — grad_norm stable at ~1.0, IS ratio ~0.99

The trajectory-level formulation was essential for making IPA work correctly — without it, the LR scheduler was exhausted by step ~13, and the model failed to learn.

## Environment

- ROLL version: latest main branch
- Python 3.12, PyTorch 2.10, vLLM, Megatron-Core
- Tested with OpenReward environments and `adv_estimator: "step_reinforce"` + `pg_variant: "ipa_chunk"`

I'd be happy to discuss the design or contribute a PR if this feature direction is accepted.

**Author**: [@shamanez](https://github.com/shamanez) (gshasiri@gmail.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Trajectory-level formulation mode for AgentNativeStepEnvManager #409

Problem

Proposal

Config example

What needs to change

Benefits

Our experience

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Step mode	Traj mode
Forward/backward per trajectory	K	1
Optimizer steps per pipeline step	K	1
LR scheduler accuracy	K× too fast	Correct
Compute (relative)	7×	1×
Async staleness	High	Low

Feature Request: Trajectory-level formulation mode for AgentNativeStepEnvManager #409

Description

Problem

Proposal

Config example

What needs to change

Benefits

Our experience

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions