About the author:
Daniel is CTO at rhome GmbH, and Co-Founder at Aqarios GmbH. He holds a M.Sc. in Computer Science from LMU Munich, and has published papers in reinforcement learning and quantum computing. He writes about technical topics in quantum computing and startups.
jrnl · home about list ventures publications LinkedIn Join my Slack

# Emergent Behaviour in Multi Agent Reinforcement Learning - Independent PPO

I've been trying to get a bunch of sharks to cooperate in hunting using independent learning. And it has been quite a blast. Specifically, I've adapted the baselines repo to add support for multiple agents (MARL) and some other things to PPO: https://github.com/Instance-contrib/baselines/tree/tf2-instance.

Now, the environment consists of fishes (green cycles) and sharks (orange cycles):

Fishes use a pretty good static algorithm to evade sharks, while sharks are trained independently using PPO. The environment is a POMDP - the observation space only includes the nearest few fishes and sharks. The reward per eaten fish is 10. If there are no sharks in sight and the time is right, fishes will also procreate. Thus, with smart ecosystem management the sharks could eat infinitely.

A few new additions were made: Not only is there two or three sharks instead of just one, there is also starvation and thus pressure to eat. There is also a stun move that enables a shark to stun the other for a few seconds and a killzone: If two sharks are in that zone when one of them kills a fish, the rewards are split (so each shark gets 5). The question here is, what do the sharks learn to do? Will they completely greedily swim around, stun each other and steal fishes until there is none left. Or will they learn ecosystem management together, i.e. leave the fish population intact such that they can eat for a long time? Will they learn to hunt together in some settings?

The chart below shows that indeed ecosystem management, or what I call herding, is possible.

This is the left-over fish population at the end of an episode, over all training steps. One can see that while at the beginning the sharks aren't able to hunt well yet (section a), in section b they learn to greedily eat all fishes (and become quite proficient at that). What is interesting, is that in section c they learn to keep 2 fishes around. And that is with two independent neural networks (PPOs)! Quite awesome.

The second awesome emergent behaviour is cooperation. One way of creating that is reward shaping, but I did not go down this path. Instead, I forced cooperation by making it much tougher to hunt. The sharks are now slower by half, which makes it pretty much impossible to get a fish on its own. The next chart shows how two sharks learn to hunt from two sides to catch fish. This is completely learned behaviour in spite of the split killzone reward and the stun move!

There is a lot more analysis to be done here. For instance, the exact parameters that induce cooperation can be quantified in much more detail. See the chart below - there are 4 dimensions explored here, 3 of which have a significant influence on the average cooperation rate (over 20 models with each model evaluated 20 times). First, a trivial influence is the radius in which cooperation is registered. Increasing it increases the cooperation rate. More interesting is the number of initial fishes in the population and the speed of the shark. Especially setting the speed of sharks to 0.03, which effectively halves the speed, increases the rate to a point where they only cooperate. Having a lower number of fishes initially makes the aquarium less crowded and thus catching a fish tougher; This increases the cooperation rate.

There is a lot more to write here, but for now I will have to leave it at that. The repository can be found here. It also contains a summary of the results and a few nice animations.

Published on