Policy Gradient Baseline

# Policy Gradient Baseline - The Intuition

If you are in a bad state, the agent should still try and do the best action given the circumstances. If you don't use baseline, you will get a pretty bad reward just because you're in a bad state. However, we want to reward good actions in bad states, still.

Example: Lets take $V(s) = \mathbb{E}_{\pi}[G_t | s]$ for state $s$ as our baseline $b.$ Basically you take the mean returns of all possible actions at $s.$ You would expect the return of your action to be slightly better or worse than $b.$ So if $V(s)$ = 5 and reward of our action = 4: 4-5=-1. If $V(s)$ = -5 and reward of our action = -6: -6-(-5)=-1. So it's two actions that give wildly different returns as is (4 vs -6) but in the context of their situation they are only a bit bad (-1). Without baseline the second action would be extremely bad (-6) even though given the context it is only slightly bad (-1).

Essentially this is like centering our data.

By the way, $V(s)$ can come from a neural network that has learnt it.

Published on 2020-03-08