## # Multiplying large numbers with Neural Networks

When working on QUBO-NN, one of the problem types (specifically: Quadratic Knapsack) was causing issues in that a good neural network (with a decent $R^2$ coefficient) could not be trained. One of my ideas was that a possible cause could be the large numbers that had to be multiplied to get to a solution. Judging from this stackexchange post, I was kind of right. In the end, it was not the main cause, but it did influence the number of nodes in the neural network needed to train decent models.

I decided to do a small experiment to test this out.

First, and this is also known in the literature (though the publication is rather old), it gets tougher to train a neural network to multiply numbers, the larger they become. The first experiment is set up as follows. The dataset consists of 10000 randomly generated numbers in the range 1 to n, where n is varied across configurations (50, 200, 500, 2000). The neural network architecture consists of just one hidden layer of size 20 (and this is fixed). The optimizer of choice is Adam with a learning rate of 0.005, and the activation function is ReLU. I always train for 500 epochs and use a batch size of 10. The dataset is normalized between 0 and 1.

The next figure shows an interesting result.

In the first 300 epochs, one would assume that the model for n=2000 is a failure with a $R^2$ coefficient below 0.96. The jumps are also extremely interesting --- each model has its own jump in the $R^2$ coefficient, and the lower n, the earlier the jump happens.

Future work would include training for thousands of epochs and observing whether the worst model (with n=2000) still improves further.

A next experiment shows that including more nodes helps immensely (c.f. the next figure). The worst model (n=2000) is trained on a neural network with hidden layer size 100 (instead of 20).

The solution proposed in the previously linked stackexchange post to use the logarithm works extremely well: For the worst model (with n=2000), after just 6 epochs the $R^2$ coefficient is at $0.9999999999987522$. Damn. The same previously cited publication supports this idea of using the logarithm with similar results. The idea is simple: Before training, transform the input and target using the natural logarithm. Afterwards, if one requires the actual numbers, simply take the exponential of the result.

In summary, training neural networks to multipy large numbers quickly leads to huge training times (with a large number of epochs required to reach a good $R^2$ coefficient). The logarithm trick helps, since addition is trivial for a neural network to learn. Adding more nodes also helps.

The source code for the experiments can be found here.