Possible flaws in PPO implementation
Current implementation of PPO may contains some flaws. We notice that it is nearly impossible to train a policy that uses deep net such as ResNet (18, 34) as a preprocessing layers. Or similar experience works with the the PP02 implementation of openai baseline.
According to this paper https://openreview.net/pdf?id=r1etN1rtPB there are some important optimization not mentioned in the paper but present in baseline implementation. One of them is value clipping.
Currently here is what we have in TF-Agents PPO's for the value estimation
value_preds, _ = self._collect_policy.apply_value_network( time_steps.observation, time_steps.step_type, value_state=value_state, training=training) value_estimation_error = tf.math.squared_difference(returns, value_preds) value_estimation_error *= weights
Instead here is the baseline version
# Clip the value to reduce variability during Critic training # Get the predicted value vpred = train_model.vf vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE) # Unclipped value vf_losses1 = tf.square(vpred - R) # Clipped value vf_losses2 = tf.square(vpredclipped - R) vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))
This difference may clearly explain why we failed to train an agents that uses deep nets.
Having this option with TF-Agent PPO would be interesting.