pb_c_init and pb_c

# pb_c_init and pb_c_base

Just a short note on those two variables of the AlphaZero pseudo code. Let's dive right into it.

Decreasing pb_c_base gives visits a higher priority. Check out the C++ snippet below. (it's from here btw, a research project I am working on)

double pb_c = std::log((parent_node->visits + base + 1) / base) + pb_c_init;
pb_c *= std::sqrt(parent_node->visits) / (child_node->visits + 1);

double prior_score = pb_c * action_probs[0][child_node->action].item<double>();
return mean_q + prior_score;

You see that visits is getting normalized with base. If base is the default 19625 from the AlphaZero pseudo code, but your MCTS only does 50 simulations (like in my case), you at most get 50 visits and thus this never becomes very significant:

double pb_c = std::log((50 + 19625 + 1) / 19625) + pb_c_init;
= double pb_c = 0.00086772153 + pb_c_init;.

With pb_c_base set to 500 we suddenly get:

double pb_c = std::log((50 + 500 + 1) / 500) + pb_c_init;
= double pb_c = 0.04218159451 + pb_c_init;.

Definitely a difference.

Now, pb_c_init on the other hand handles mean_q. The default here is 1.25., i.e. with above results we have a factor of 2.25 before multiplying with visits and the action probability. Basically, one can adjust the significance of the prior score in relation to the mean Q value. If pb_c_init is set lower, the MCTS will base its exploration more on the mean Q value. If it is set higher, the Q value is less significant.

For the interested reader, I have both a Python and C++ version of AlphaZero implemented here: https://github.com/instance01/GRAB0/

Published on 2020-06-27