GRU vs LSTM

LSTM & GRU

Different from LSTM, GRU doesn’t maintain a memory content to control information flow, and it only has two gates rather than 3 gates in LSTM. Because of its less parameters and comparable performance to LSTM, when using fixed number of parameters for these two models, GRU generally shares similar final performance to LSTM but outperforms it both in terms of convergence in CPU time and in terms of parameter updates.

How LSTM Helps Eliminate Gradient Vanishing

There are many variations of LSTM. This answer takes as example the version described in this paper Page on arxiv.org. Take a close look as the formula to compute 𝑐𝑙𝑡ctl:

Now suppose that at a time step 𝑡t, you have the error ∂𝐿∂𝑐𝑙𝑡∂L∂ctl, then by chain rule, the term 𝑐𝑙𝑡−1ct−1l would receive the error:

Given that 𝑓f, the forget gate, is the rate at which you want the neural network to forget its past memory, the error signal as described above is propagated perfectly to the previous time step. In many LSTM papers, this is referred to as the linear carousel that prevents the vanish of gradient through many time steps. In most versions of LSTM that I am aware of, the LSTM formula shares the similar property.

Last updated