I've always wondered what policy entropy really means in the context of PPO. From other posts (e.g. here by AurelianTactics - btw very cool guy saw him on the official reddit RL Discord) I know that it should continuously go down. It also corresponds to the exploration the PPO agent is taking --- high entropy means high exploration.
In this case I will be looking at continuous action spaces. Here, each action in the action space is represented by a gaussian distribution. The policy is then called a gaussian policy. In my case, I have two continuous actions (for more information, check out the post Emergent Behaviour in Multi Agent Reinforcement Learning - Independent PPO).
Now, entropy for a gaussian distribution is defined as follows: . What baselines PPO does now is to simply sum the entropies across the actions axis (e.g. given a batch of 700 actions s.t. the shape is (700,2), this becomes (700,)) and then take the mean over the batch. Their inner code is equal to the definition above since . For reference, this is using the natural log.
This is a single entropy calculation:
tf.reduce_sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
And this is taking the mean over the whole batch:
entropy = tf.reduce_mean(pd.entropy())
The right side (
.5 * np.log(2.0 * np.pi * np.e)) ends up being roughly 1.41.
You can now deduce, given an entropy reading, how uncertain (and thus
exploratory) each action is. High entropy equals high exploration and vice
versa. To quantify the above, check out the figure below:
You can see that at roughly variance = 0.242 entropy is 0.
This is how policy entropy looks for me when training two predator PPOs to catch prey:
As you can see, exploration decreases continuously. It starts with variance = 1 for both actions and ends up at variance = 0.3 after training. (starts at 2.82 = 1.41 * 2, which is incidentally the entropy for 2 actions summed up, given variance of both is 1). Nice!
Update: Some notes regarding why we take the mean.
One could argue that the
logstd does not change over the 700 batch, since the
policy did not change. And that is true, we actually get 700 times the same
entropy number. The reason why every PPO implementation takes the mean here is
First of all, check out the loss PPO optimizes (see paper):
The blue part is the entropy, and observe how it is in the expectation , for which you usually take the mean.
Secondly, autograd. One optimization idea would be to just take the first entropy number and throw away the others (so an indexing operation instead of the mean), but that would not play well with autograd, which uses the gradients to update all parameters (including the logstd parameter and thus your policy itself) in the backwards pass. Calculating the gradient of a mean is easy, but the gradient of an indexing operation? The key word here is 'differentiable indexing' and it is possible, but you only update part of your parameters (the part you indexed). In this case my hunch is that you end up only updating your policy once instead of 700 times, which defeats the purpose of a 700 batch. At that point you should just do singular updates and no batching.