-
Notifications
You must be signed in to change notification settings - Fork 779
Open
Labels
credit assignmentquestionQuestion about a feature or some usageQuestion about a feature or some usageverl
Description
As far as I know, the rollout is split to n samples ( n denotes the number of rounds of a conversation).
In daemon.py, I noticed that the final reward is broadcast to every sample's token level score.
In your documentation, this is not clearly stated.
And I asked GPT-5.1-codex-max for a plenty of times, it came to the same conclusions.
Please help me understand your intermediate reward logic.
Below is the evidence from daemon.py.
`final_reward = self._fillna_reward(rollout)
if not rollout.triplets:
finished_id_to_final_reward[rollout_id] = final_reward
print(f"Warning: No triplets found for training rollout {rollout.rollout_id}, skipping.")
continue
# The client should report triplets that contain prompt_ids and response_ids.
# Example triplet.prompt: {"token_ids": [...]}
# Example triplet.response: {"token_ids": [...]}
trace_list = [
{"prompt_ids": t.prompt.get("token_ids", []), "response_ids": t.response.get("token_ids", [])}
for t in rollout.triplets
]
info = {
"reward": final_reward,
"trace_list": trace_list,
"data_id": original_sample["data_id"],
}
finished_id_to_sample_info[rollout_id] = info
finished_id_to_final_reward[rollout_id] = final_reward`
...
` for rollout_id, sample_info in finished_id_to_sample_info.items():
for turn_index, trace in enumerate(sample_info["trace_list"]):
reward_list.append(sample_info["reward"])`
...
` scores = torch.tensor(reward_list, dtype=torch.bfloat16).to(device)
token_level_scores = torch.zeros_like(attention_mask, dtype=scores.dtype)
eos_mask_idx = torch.argmax(position_ids * attention_mask, dim=-1)
token_level_scores[torch.arange(n_transition), eos_mask_idx] = scores
token_level_scores = token_level_scores[:, -max_response_length:]`
Metadata
Metadata
Assignees
Labels
credit assignmentquestionQuestion about a feature or some usageQuestion about a feature or some usageverl