New Algorithm: MAPO implementation#388
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
Ready for review. |
| # since the advantages is same within same trajectory, we can get the trajectory_level advantage from first token | ||
| # base on assumption that the advantage on last dim are totally same | ||
|
|
||
| advantages_ = advantages[:, 0] # advantages shape [batch_size*group_size] |
There was a problem hiding this comment.
This line does not take any effect and should be removed.
There was a problem hiding this comment.
No, useful. Please see the code comment.
There was a problem hiding this comment.
The adv of first token is extract and use for below logic.
| return ( | ||
| 1 - trajectory_reweight | ||
| ) * deviation_base_norm + trajectory_reweight * mean_base_norm |
There was a problem hiding this comment.
double-check the formula. Since the trajectory_weight is computed as 4p(1-p) rather than 1-4p(1-p), should we reverse the weighting of these two norms?
There was a problem hiding this comment.
my mistake. thank you
|
This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days. Please add a comment or push new commits to keep it active. Thank you for your contribution! |
|
This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days. Please add a comment or push new commits to keep it active. Thank you for your contribution! |
This is the implementation to paper MAPO.
It doesn't require much code refactoring, still in the process.....
More "reay-to-go" algorithms and examples are the key to making a repo popular, especially RL-related.