I've always wondered what policy entropy really means in the context of PPO. From other posts (e.g. here by AurelianTactics - btw very cool guy saw him on the official reddit RL Discord) I know that it should continuously go down. It also corresponds to the exploration the PPO agent is taking --- high entropy means high exploration.
In this case I will be looking at continuous action spaces. Here, each action in the action space is represented by a gaussian distribution. The policy is then called a gaussian policy. In my case, I have two continuous actions (for more information, check out the post Emergent Behaviour in Multi Agent Reinforcement Learning - Independent PPO).
Now, entropy for a gaussian distribution is defined as follows: . What baselines PPO does now is to simply sum the entropies across the actions axis (e.g. shape (700,2) becomes (700,)) and then take the mean. Their inner code is equal to the definition above since . For reference, this is using the natural log.
tf.reduce_sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
The right side (
.5 * np.log(2.0 * np.pi * np.e)) ends up being roughly 1.41.
You can now deduce, given an entropy reading, how uncertain (and thus
exploratory) each action is. High entropy equals high exploration and vice
versa. To quantify the above, check out the figure below:
You can see that at roughly variance = 0.242 entropy is 0.
This is how policy entropy looks for me when training two predator PPOs to catch prey:
As you can see, exploration decreases continuously. It starts with variance = 1 for both actions and ends up at variance = 0.3 after training. (starts at 2.82 = 1.41 * 2, which is incidentally the entropy for 2 actions summed up, given variance of both is 1). Nice!