jrnl · home about list

# Policy Gradient Baseline - The Intuition

If you are in a bad state, the agent should still try and do the best action given the circumstances. If you don't use baseline, you will get a pretty bad reward just because you're in a bad state. However, we want to reward good actions in bad states, still.

Example: Lets take V(s)=Eπ[Gts]V(s) = \mathbb{E}_{\pi}[G_t | s] for state ss as our baseline b.b. Basically you take the mean returns of all possible actions at s.s. You would expect the return of your action to be slightly better or worse than b.b. So if V(s)V(s) = 5 and reward of our action = 4: 4-5=-1. If V(s)V(s) = -5 and reward of our action = -6: -6-(-5)=-1. So it's two actions that give wildly different returns as is (4 vs -6) but in the context of their situation they are only a bit bad (-1). Without baseline the second action would be extremely bad (-6) even though given the context it is only slightly bad (-1).

Essentially this is like centering our data.

By the way, V(s)V(s) can come from a neural network that has learnt it.

Published on