Question: What is the linking rule between intermediate rewards and the triplet?

Hello, I encountered some issues when adding intermediate rewards to the SQL agent in the official example:

(1) I would like to know if the intermediate rewards emitted by emit_reward and the final reward returned by rollout are by default linked to the closest triplet with a None reward, when no link is specified. In the SQL agent example, in the check_query function, I analyze the check result according to certain rules and then add emit_reward (I only added emit_reward in the check function!). When I print the trajectory of a sample triplet, I noticed that the rewrite closest to the final_reward (which has a value of 1) also received a reward of 1, even though I didn’t add a reward to the rewrite. Why is this happening?

(2) Still in the SQL agent example: I only added emit_reward in check_query (with a value of 0.2, -0.2, or -0.1). I also modified the get_train_data_batch function in daemon.py of VERL, changing the reward_list. Originally, only the final reward was added, but I changed it to add the sum of the final reward and the intermediate rewards in the trace. However, I found that in the training/reward curve in wandb (with the value of np.mean(finished_id_to_final_reward.values())), negative numbers appear, even though I haven’t changed anything related to the final reward, which should only be 0 or 1. Therefore, negative numbers shouldn’t appear. Why is this happening?

I made two main changes to the code.

1. Extracted the intermediate rewards from the trace and saved them
2. Summed the intermediate and final rewards to form the actual reward of the triplet, then appended it to the reward_list.  

step_reward = float(trace.get("step_reward", 0.0))
total_reward = final_reward + step_reward
reward_list.append(total_reward)

Thank you for helping with my confusion!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: What is the linking rule between intermediate rewards and the triplet? #392

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: What is the linking rule between intermediate rewards and the triplet? #392

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions