About the author:
Daniel is CTO at rhome GmbH, and Co-Founder at Aqarios GmbH. He holds a M.Sc. in Computer Science from LMU Munich, and has published papers in reinforcement learning and quantum computing. He writes about technical topics in quantum computing and startups.
jrnl · home about list ventures publications LinkedIn Join my Slack

# Conflicting Samples and More Exploration

You should know when you see a local minimum. See the following example that I've been battling with for a while. In my opinion it's a symptom of not enough exploration.

This is a separated list of 3-item tuples that represent the likelihoods of 3 actions, rotating left, rotating right and going forward. The fat tuple is interesting, as it's not confident at all between right and forward. This is the cause of the sample we're learning with: 2, 2, 2, 2, 2, 1, 0, 2, 2, .. That sample rotates to the right, then to the left and then goes forward. Basically the two rotations are useless. Even more, we do two different actions in the same state, rotating right (1) and going forward (2). See where I'm going with this? Those are exactly the two actions that the fat tuple is not confident with.

0.064 0.199 0.739 | 0.029 0.011 0.962 | 0.03 0.06 0.912 | 0.018 0.023 0.961 | 0.013 0.022 0.967 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495

The result of a greedy evaluation of the above policy is: 22222101010101010101010101010101010101010... So you end up rotating until the end, not reaching the goal. Below is an every so slightly different policy (check out the unconfident tuple - this time we go forward instead of rotating right when greedily evaluating).

0.064 0.188 0.749 | 0.029 0.01 0.962 | 0.03 0.057 0.914 | 0.018 0.022 0.962 | 0.013 0.021 0.968 | 0.008 0.486 0.507 | 0.002 0.001 0.998 | 0.002 0.001 0.999 | 0.001 0.999 0.002 | 0.001 0.001 1 | 0.001 0.001 0.999 | 1 0.001 0.001 | 0.001 0.001 1 | 0.001 0.001 1 | 0.001 1 0.001 | 0.001 0.001 1 | 0.001 0.001 1 | 1 0.001 0.001 | 0.001 0.001 1 | 0.001 1 0.001 | 0.001 0.001 1 | 0.001 0.001 1 | 0.498 0.001 0.503 | 0.001 0.001 1 | 0.001 0.001 1 | 0.001 0.001 1 | 0.001 0.001 1 | 1 0.001 0.001 |

Evaluation: 222222221220221220212222222022122

So, this training sample sucks and I need more exploration based on it. Let's see if my hypothesis is correct. I might update this post at some point.

Update: My theory was right, but the reason was wrong. See the issue here and the resolving commit here. What I saw above was this off by one mistake, as mentioned in the issue. Simply a logical mistake by myself. What finally fixed it however, was 'fuzzing' the gradient bandit in GRAB0 (bear in mind, the above topic came from that project) and noticing that when given a policy, the gradient bandit needs at least 2000 simulations to find a better policy (and even then, only most of the time, but that's good enough).

Published on