What a rookie mistake, but a funny one indeed. So it all started with me seeing what I'd mistakenly called avoidable bias.
The train error rate bottoms out at roughly 28, which is NOT ~0 and definitely not satisfying. This looks like underfitting and a big chunk of avoidable bias, at first. As per theory, one should increase the number of parameters to deal with underfitting, so that's exactly what I did. One thing to keep in mind - statistical error (or variance), i.e. the difference between train and test errors increase proportionally with the number of parameters. So if we overshoot, we overfit badly. In that case, increasing the number of data points might help.
With all this in mind, I just went ahead. Now, what I kind of forgot is how quickly the number of parameters grow in fully connected networks. The issue with the data is that it is very high dimensional - the input size is 4096 and the output size was 180 at this stage. With literally zero hidden layers and just that input and output, we already have 4096 * 180 = 737280 parameters. So anyways, I started with a hidden layer of 2000, 5000, 10000 and at some point ended up with two massive 50000 node layers. Also tested 10 10000 layers at some point too. Let's do some quick maths:
4096 * 50000 ** 2 * 180 = 1.8432e+15
That's ~1.8 Quadrillion parameters. I was astonished as to why the so called 'avoidable' bias was not going away. And of course, training the models became very slow too. Further, the bias stayed at an even higher level with a higher number of parameters!
Two main takeaways here:
- This was structural bias (unavoidable)
- Training 1.8 Quadrillion parameters lead to the bias staying elevated since training is too inhibited. After the same amount of time (when compared to the simple 700k parameter model) we simple haven't learnt anything and thus the train error stays high.
After changing my data I ended up with no structural bias and a bit of avoidable bias. I upped the number of training epochs and got to close to 0 error loss.
Observe below the relationships between different types of errors and model complexity (number of parameters) or number of training epochs.
The data was related to this post and this project. Specifically, I tried to predict the parameters (i.e. the exact graph) that lead to a 64 64 QUBO matrix of a Maximum Cut problem. At first, I encoded the output as the list of edge tuples (since we had 90 edges pre-defined, that's 180 nodes). I am not 100% sure where the structural bias is coming from, but my hunch is that it is the ordering of edges. Now, it's not as simple as saying that the edges are stored in a dict and thus they're ordered in a random way - that's not the case since Python 3.7. The edges (=dict) keep insertion order. But what if the insertion itself was random? That is exactly the case here - I am using gnm_random_graph which generates edges randomly in random order and inserts them randomly. And randomness is structural bias that cannot be learnt by a neural network. So there you have it. I ended up predicting a full adjacency matrix, which gets rid of the structural bias. Works now, nice!