LOLA is the shorthand for
Learning with Opponent-Learning Awareness.
Let's get some intuition.
First of all, policy gradient, what is that. Well, you repeatedly update the policy in the direction the total expected reward is going when following the policy in a given state s (resulting in some action a). Wew what a sentence !!! So, let's assume our policy is represented by something that can be parameterized, such as a neural net with softmax output (so you have a probability for each action). Well, obviously by watching the total reward go up and down when taking different actions you can adjust the weights of that neural net.
In LOLA when you update your policy gradient you don't just take into account the world as is (statically). So in a static scenario your value function returns the expected value given two parameters (one for each agent). And you simply update your own parameter. But the other parameter is static.. The other agent is inactive. But in a dynamic scenario you take into account how you would change the update step of the other agent. You suddenly notice what the other agent is learning based on what you're doing. So the total reward now changes based on your own actions, but also on what effect that has on the other agent's learning.
Check out the official video at roughly 12:05. LOLA learning rule is defined as:
(our agents parameter update) + (How does our agent change the learning step of the other agent) * (How much our agents return depends on the other agents learning step)
This post is a WIP. As I get more intuition I will update this post.