GRU vs LSTM
LSTM & GRU
Different from LSTM, GRU doesnβt maintain a memory content to control information flow, and it only has two gates rather than 3 gates in LSTM. Because of its less parameters and comparable performance to LSTM, when using fixed number of parameters for these two models, GRU generally shares similar final performance to LSTM but outperforms it both in terms of convergence in CPU time and in terms of parameter updates.



How LSTM Helps Eliminate Gradient Vanishing
There are many variations of LSTM. This answer takes as example the version described in this paper Page on arxiv.org. Take a close look as the formula to compute πππ‘ctl:

Now suppose that at a time step π‘t, you have the error βπΏβπππ‘βLβctl, then by chain rule, the term πππ‘β1ctβ1l would receive the error:

Given that πf, the forget gate, is the rate at which you want the neural network to forget its past memory, the error signal as described above is propagated perfectly to the previous time step. In many LSTM papers, this is referred to as the linear carousel that prevents the vanish of gradient through many time steps. In most versions of LSTM that I am aware of, the LSTM formula shares the similar property.
Last updated