# GRU vs LSTM

## LSTM & GRU&#x20;

Different from LSTM, GRU doesn’t maintain a memory content to control information flow, and it only has two gates rather than 3 gates in LSTM.  Because of its less parameters and comparable performance to LSTM, when using fixed number of parameters for these two models, GRU generally shares similar final performance to LSTM but outperforms it both in terms of convergence in CPU time and in terms of parameter updates.

![LSTM](https://443921002-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LGHUhl6VYqrZm4Re77O%2F-LGWGVF7wn8w4ZmO43iZ%2F-LGWJ5Sc3RM-RX13lIJ9%2FScreen%20Shot%202018-07-03%20at%2011.03.12%20AM.png?alt=media\&token=7bfa482a-35b7-4da5-98c6-20bbbfed5eb4)

<div align="left"><img src="https://443921002-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LGHUhl6VYqrZm4Re77O%2F-LUpfVBxtxnD6fSnHI6h%2F-LUpggnw2w1cHmjQmTgu%2FScreen%20Shot%202018-12-28%20at%208.42.54%20AM.png?alt=media&#x26;token=57f26e33-0bc8-468c-86d0-765f99f56106" alt=""></div>

![GRU](https://443921002-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LGHUhl6VYqrZm4Re77O%2F-LGWGVF7wn8w4ZmO43iZ%2F-LGWJAbIm2vHjMDyvq4j%2FScreen%20Shot%202018-07-03%20at%2011.02.23%20AM.png?alt=media\&token=2dabc766-71ed-465a-baa6-8dad0e752bcd)

## How LSTM Helps Eliminate Gradient Vanishing

There are many variations of LSTM. This answer takes as example the version described in this paper [Page on arxiv.org](http://arxiv.org/pdf/1409.2329v5.pdf).\
\
Take a close look as the formula to compute 𝑐𝑙𝑡ctl:

<div align="left"><img src="https://443921002-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LGHUhl6VYqrZm4Re77O%2F-LUpfVBxtxnD6fSnHI6h%2F-LUpg5M5NtzMwIObrP3T%2FScreen%20Shot%202018-12-28%20at%208.40.15%20AM.png?alt=media&#x26;token=13ecafe8-51c8-438a-bbba-f9d72b7c8316" alt=""></div>

Now suppose that at a time step 𝑡t, you have the error ∂𝐿∂𝑐𝑙𝑡∂L∂ctl, then by chain rule, the term 𝑐𝑙𝑡−1ct−1l would receive the error:

<div align="left"><img src="https://443921002-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LGHUhl6VYqrZm4Re77O%2F-LUpfVBxtxnD6fSnHI6h%2F-LUpgHZEfq7vA9xsmCkn%2FScreen%20Shot%202018-12-28%20at%208.40.20%20AM.png?alt=media&#x26;token=891bbadf-5ecf-4af6-8831-99cc4cb419d7" alt=""></div>

Given that 𝑓f, the forget gate, is the rate at which you want the neural network to forget its past memory, the error signal as described above is propagated perfectly to the previous time step. In many LSTM papers, this is referred to as the **linear carousel** that prevents the vanish of gradient through many time steps.\
\
In most versions of LSTM that I am aware of, the LSTM formula shares the similar property.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sisyphus.gitbook.io/project/deep-learning-basics/rnn-related/gru-vs-lstm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
