weixin_39629352
weixin_39629352
2020-12-09 02:30

PPO training

Hi, thanks much for sharing. I have a quick question, when training policy using PPO, here, https://github.com/openai/lm-human-preferences/blob/f774e89b9e3762a03b7f3953189861979bee1775/lm_human_preferences/train_policy.py#L341), aren't you using same policy as to generate rollout (https://github.com/openai/lm-human-preferences/blob/f774e89b9e3762a03b7f3953189861979bee1775/lm_human_preferences/train_policy.py#L285) ?; except for training you are dividing logits by Temperature. Maybe I am missing sth but for PPO, to calculate loss we need to have pi(theta)/pi(old) as in ( https://github.com/openai/lm-human-preferences/blob/f774e89b9e3762a03b7f3953189861979bee1775/lm_human_preferences/train_policy.py#L351), but it seems the old and new policy are the same, I really appreciate your answer

该提问来源于开源项目:openai/lm-human-preferences

  • 点赞
  • 回答
  • 收藏
  • 复制链接分享

5条回答

为你推荐