About the author:
Daniel is CTO at rhome GmbH, and Co-Founder at Aqarios GmbH. He
holds a M.Sc. in Computer Science from LMU Munich, and has
published papers in reinforcement learning and quantum
computing. He writes about technical topics in quantum
computing and startups.
I find that the product rule is always forgotten in popular blog posts (see 1 and 2) discussing RNNs and backpropagation through time (BPTT). It is clear what is happening in those posts, but WHY exactly, in a mathematical sense, does the last output depend on all previous states? For this, let us look at the product rule3.
Consider the following unrolled RNN.
Assume the following:
ht=σ(W∗ht−1+Uxt)
yt=softmax(V∗ht)
Using a mix of Leibniz' and Langrange's notation, I now derive:
Chain rule happens in line 1 to 2, product rule in line 4 to 5. Line 3 is simply explained by Ux not containing W (which we're deriving for). Now, it can be immediately seen that each summand of the last result keeps referencing further and further into the past.
Lastly, since this assumes the reader is familiar with the topic, a really nice further explanation of BPTT for the interested reader can be found here.