weixin_39703982
weixin_39703982
2020-12-07 07:34

Possible flaws in PPO implementation

Current implementation of PPO may contains some flaws. We notice that it is nearly impossible to train a policy that uses deep net such as ResNet (18, 34) as a preprocessing layers. Or similar experience works with the the PP02 implementation of openai baseline.

According to this paper https://openreview.net/pdf?id=r1etN1rtPB there are some important optimization not mentioned in the paper but present in baseline implementation. One of them is value clipping.

Currently here is what we have in TF-Agents PPO's for the value estimation


value_preds, _ = self._collect_policy.apply_value_network(
        time_steps.observation,
        time_steps.step_type,
        value_state=value_state,
        training=training)
value_estimation_error = tf.math.squared_difference(returns, value_preds)
value_estimation_error *= weights

Instead here is the baseline version


# Clip the value to reduce variability during Critic training
# Get the predicted value
vpred = train_model.vf
vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
# Unclipped value
vf_losses1 = tf.square(vpred - R)
# Clipped value
vf_losses2 = tf.square(vpredclipped - R)

vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))

This difference may clearly explain why we failed to train an agents that uses deep nets.

Having this option with TF-Agent PPO would be interesting.

该提问来源于开源项目:tensorflow/agents

  • 点赞
  • 回答
  • 收藏
  • 复制链接分享

4条回答

为你推荐

换一换