I find that the product rule is always forgotten in popular blog posts (see 1 and 2) discussing RNNs and backpropagation through time (BPTT). It is clear what is happening in those posts, but WHY exactly, in a mathematical sense, does the last output depend on all previous states? For this, let us look at the product rule3.
Consider the following unrolled RNN.
Assume the following:
Using a mix of Leibniz' and Langrange's notation, I now derive:
// Chain rule
// Product rule
Chain rule happens in line 1 to 2, product rule in line 4 to 5. Line 3 is simply explained by Ux not containing W (which we're deriving for). Now, it can be immediately seen that each summand of the last result keeps referencing further and further into the past.
Lastly, since this assumes the reader is familiar with the topic, a really nice further explanation of BPTT for the interested reader can be found here.