## # PPO - a Note on Policy Entropy in Continuous Action Spaces

I've always wondered what policy entropy really means in the context of PPO. From other posts (e.g. here by AurelianTactics - btw very cool guy saw him on the official reddit RL Discord) I know that it should continuously go down. It also corresponds to the exploration the PPO agent is taking --- high entropy means high exploration.

In this case I will be looking at continuous action spaces. Here, each action in the action space is represented by a gaussian distribution. The policy is then called a gaussian policy. In my case, I have two continuous actions (for more information, check out the post Emergent Behaviour in Multi Agent Reinforcement Learning - Independent PPO).

Now, entropy for a gaussian distribution is defined as follows: $\frac{1}{2} \log(2\pi{}e\sigma{}^2)$. What baselines PPO does now is to simply sum the entropies across the actions axis (e.g. given a batch of 700 actions s.t. the shape is (700,2), this becomes (700,)) and then take the mean over the batch. Their inner code is equal to the definition above since $\log(a*b) = \log(a) + \log(b)$. For reference, this is using the natural log.

This is a single entropy calculation:

tf.reduce_sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)


And this is taking the mean over the whole batch:

entropy = tf.reduce_mean(pd.entropy())


The right side (.5 * np.log(2.0 * np.pi * np.e)) ends up being roughly 1.41. You can now deduce, given an entropy reading, how uncertain (and thus exploratory) each action is. High entropy equals high exploration and vice versa. To quantify the above, check out the figure below: You can see that at roughly variance = 0.242 entropy is 0.

This is how policy entropy looks for me when training two predator PPOs to catch prey: As you can see, exploration decreases continuously. It starts with variance = 1 for both actions and ends up at variance = 0.3 after training. (starts at 2.82 = 1.41 * 2, which is incidentally the entropy for 2 actions summed up, given variance of both is 1). Nice!

Update: Some notes regarding why we take the mean.

One could argue that the logstd does not change over the 700 batch, since the policy did not change. And that is true, we actually get 700 times the same entropy number. The reason why every PPO implementation takes the mean here is two-fold.

First of all, check out the loss PPO optimizes (see paper):

$L_t^{CLIP+VF+H}(\theta{}) = \mathbb{E}_t[L_t^{CLIP}(\theta{}) - c_1L_t^{VF}(\theta{}) + \textcolor{blue}{c_2H[\pi_{\theta}](s_t)}]$

The blue part is the entropy, and observe how it is in the expectation $\mathbb{E}$, for which you usually take the mean.