Skip to content

Question: What is the linking rule between intermediate rewards and the triplet? #392

@kyk666123

Description

@kyk666123

Hello, I encountered some issues when adding intermediate rewards to the SQL agent in the official example:

(1) I would like to know if the intermediate rewards emitted by emit_reward and the final reward returned by rollout are by default linked to the closest triplet with a None reward, when no link is specified. In the SQL agent example, in the check_query function, I analyze the check result according to certain rules and then add emit_reward (I only added emit_reward in the check function!). When I print the trajectory of a sample triplet, I noticed that the rewrite closest to the final_reward (which has a value of 1) also received a reward of 1, even though I didn’t add a reward to the rewrite. Why is this happening?

(2) Still in the SQL agent example: I only added emit_reward in check_query (with a value of 0.2, -0.2, or -0.1). I also modified the get_train_data_batch function in daemon.py of VERL, changing the reward_list. Originally, only the final reward was added, but I changed it to add the sum of the final reward and the intermediate rewards in the trace. However, I found that in the training/reward curve in wandb (with the value of np.mean(finished_id_to_final_reward.values())), negative numbers appear, even though I haven’t changed anything related to the final reward, which should only be 0 or 1. Therefore, negative numbers shouldn’t appear. Why is this happening?

I made two main changes to the code.

  1. Extracted the intermediate rewards from the trace and saved them
  2. Summed the intermediate and final rewards to form the actual reward of the triplet, then appended it to the reward_list.

step_reward = float(trace.get("step_reward", 0.0))
total_reward = final_reward + step_reward
reward_list.append(total_reward)

Thank you for helping with my confusion!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions