Skip to content

Question: How will intermediate rewards be used? #390

@chenhr18thu

Description

@chenhr18thu

As far as I know, the rollout is split to n samples ( n denotes the number of rounds of a conversation).
In daemon.py, I noticed that the final reward is broadcast to every sample's token level score.
In your documentation, this is not clearly stated.
And I asked GPT-5.1-codex-max for a plenty of times, it came to the same conclusions.
Please help me understand your intermediate reward logic.

Below is the evidence from daemon.py.

`final_reward = self._fillna_reward(rollout)

        if not rollout.triplets:
            finished_id_to_final_reward[rollout_id] = final_reward
            print(f"Warning: No triplets found for training rollout {rollout.rollout_id}, skipping.")
            continue

        # The client should report triplets that contain prompt_ids and response_ids.
        # Example triplet.prompt: {"token_ids": [...]}
        # Example triplet.response: {"token_ids": [...]}
        trace_list = [
            {"prompt_ids": t.prompt.get("token_ids", []), "response_ids": t.response.get("token_ids", [])}
            for t in rollout.triplets
        ]
        info = {
            "reward": final_reward,
            "trace_list": trace_list,
            "data_id": original_sample["data_id"],
        }
        finished_id_to_sample_info[rollout_id] = info
        finished_id_to_final_reward[rollout_id] = final_reward`

...

` for rollout_id, sample_info in finished_id_to_sample_info.items():
for turn_index, trace in enumerate(sample_info["trace_list"]):

            reward_list.append(sample_info["reward"])`

...

` scores = torch.tensor(reward_list, dtype=torch.bfloat16).to(device)

    token_level_scores = torch.zeros_like(attention_mask, dtype=scores.dtype)

    eos_mask_idx = torch.argmax(position_ids * attention_mask, dim=-1)  
    token_level_scores[torch.arange(n_transition), eos_mask_idx] = scores

    token_level_scores = token_level_scores[:, -max_response_length:]`

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions