Question: How will intermediate rewards be used?

As far as I know, the rollout is split to n samples ( n denotes the number of rounds of  a conversation).
In daemon.py, I noticed that **the final reward is broadcast to every sample's token level score**.
In your documentation, this is not clearly stated.
And I asked GPT-5.1-codex-max for a plenty of times, it came to the same conclusions.
**Please help me understand your intermediate reward logic.**

Below is the evidence from daemon.py.

`final_reward = self._fillna_reward(rollout)

            if not rollout.triplets:
                finished_id_to_final_reward[rollout_id] = final_reward
                print(f"Warning: No triplets found for training rollout {rollout.rollout_id}, skipping.")
                continue

            # The client should report triplets that contain prompt_ids and response_ids.
            # Example triplet.prompt: {"token_ids": [...]}
            # Example triplet.response: {"token_ids": [...]}
            trace_list = [
                {"prompt_ids": t.prompt.get("token_ids", []), "response_ids": t.response.get("token_ids", [])}
                for t in rollout.triplets
            ]
            info = {
                "reward": final_reward,
                "trace_list": trace_list,
                "data_id": original_sample["data_id"],
            }
            finished_id_to_sample_info[rollout_id] = info
            finished_id_to_final_reward[rollout_id] = final_reward`

...

`        for rollout_id, sample_info in finished_id_to_sample_info.items():
            for turn_index, trace in enumerate(sample_info["trace_list"]):

                reward_list.append(sample_info["reward"])`

...

`        scores = torch.tensor(reward_list, dtype=torch.bfloat16).to(device)

     
        token_level_scores = torch.zeros_like(attention_mask, dtype=scores.dtype)
  
        eos_mask_idx = torch.argmax(position_ids * attention_mask, dim=-1)  
        token_level_scores[torch.arange(n_transition), eos_mask_idx] = scores
    
        token_level_scores = token_level_scores[:, -max_response_length:]`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: How will intermediate rewards be used? #390

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: How will intermediate rewards be used? #390

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions