-
Notifications
You must be signed in to change notification settings - Fork 779
Description
Hello, I encountered some issues when adding intermediate rewards to the SQL agent in the official example:
(1) I would like to know if the intermediate rewards emitted by emit_reward and the final reward returned by rollout are by default linked to the closest triplet with a None reward, when no link is specified. In the SQL agent example, in the check_query function, I analyze the check result according to certain rules and then add emit_reward (I only added emit_reward in the check function!). When I print the trajectory of a sample triplet, I noticed that the rewrite closest to the final_reward (which has a value of 1) also received a reward of 1, even though I didn’t add a reward to the rewrite. Why is this happening?
(2) Still in the SQL agent example: I only added emit_reward in check_query (with a value of 0.2, -0.2, or -0.1). I also modified the get_train_data_batch function in daemon.py of VERL, changing the reward_list. Originally, only the final reward was added, but I changed it to add the sum of the final reward and the intermediate rewards in the trace. However, I found that in the training/reward curve in wandb (with the value of np.mean(finished_id_to_final_reward.values())), negative numbers appear, even though I haven’t changed anything related to the final reward, which should only be 0 or 1. Therefore, negative numbers shouldn’t appear. Why is this happening?
I made two main changes to the code.
- Extracted the intermediate rewards from the trace and saved them
- Summed the intermediate and final rewards to form the actual reward of the triplet, then appended it to the reward_list.
step_reward = float(trace.get("step_reward", 0.0))
total_reward = final_reward + step_reward
reward_list.append(total_reward)
Thank you for helping with my confusion!