9.3. Deep Recurrent Neural Networks
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab

Up to now, we only discussed recurrent neural networks with a single unidirectional hidden layer. In it the specific functional form of how latent variables and observations interact was rather arbitrary. This is not a big problem as long as we have enough flexibility to model different types of interactions. With a single layer, however, this can be quite challenging. In the case of the perceptron, we fixed this problem by adding more layers. Within RNNs this is a bit trickier, since we first need to decide how and where to add extra nonlinearity. Our discussion below focuses primarily on LSTMs, but it applies to other sequence models, too.

  • We could add extra nonlinearity to the gating mechanisms. That is, instead of using a single perceptron we could use multiple layers. This leaves the mechanism of the LSTM unchanged. Instead it makes it more sophisticated. This would make sense if we were led to believe that the LSTM mechanism describes some form of universal truth of how latent variable autoregressive models work.

  • We could stack multiple layers of LSTMs on top of each other. This results in a mechanism that is more flexible, due to the combination of several simple layers. In particular, data might be relevant at different levels of the stack. For instance, we might want to keep high-level data about financial market conditions (bear or bull market) available, whereas at a lower level we only record shorter-term temporal dynamics.

Beyond all this abstract discussion it is probably easiest to understand the family of models we are interested in by reviewing Fig. 9.3.1. It describes a deep recurrent neural network with \(L\) hidden layers. Each hidden state is continuously passed to both the next timestep of the current layer and the current timestep of the next layer.


Fig. 9.3.1 Architecture of a deep recurrent neural network.

9.3.1. Functional Dependencies

At timestep \(t\) we assume that we have a minibatch \(\mathbf{X}_t \in \mathbb{R}^{n \times d}\) (number of examples: \(n\), number of inputs: \(d\)). The hidden state of hidden layer \(\ell\) (\(\ell=1,\ldots, L\)) is \(\mathbf{H}_t^{(\ell)} \in \mathbb{R}^{n \times h}\) (number of hidden units: \(h\)), the output layer variable is \(\mathbf{O}_t \in \mathbb{R}^{n \times q}\) (number of outputs: \(q\)) and a hidden layer activation function \(f_l\) for layer \(l\). We compute the hidden state of layer \(1\) as before, using \(\mathbf{X}_t\) as input. For all subsequent layers, the hidden state of the previous layer is used in its place.

(9.3.1)\[\begin{split}\begin{aligned} \mathbf{H}_t^{(1)} & = f_1\left(\mathbf{X}_t, \mathbf{H}_{t-1}^{(1)}\right), \\ \mathbf{H}_t^{(l)} & = f_l\left(\mathbf{H}_t^{(l-1)}, \mathbf{H}_{t-1}^{(l)}\right). \end{aligned}\end{split}\]

Finally, the output layer is only based on the hidden state of hidden layer \(L\). We use the output function \(g\) to address this:

(9.3.2)\[\mathbf{O}_t = g \left(\mathbf{H}_t^{(L)}\right).\]

Just as with multilayer perceptrons, the number of hidden layers \(L\) and number of hidden units \(h\) are hyper parameters. In particular, we can pick a regular RNN, a GRU, or an LSTM to implement the model.

9.3.2. Concise Implementation

Fortunately many of the logistical details required to implement multiple layers of an RNN are readily available in Gluon. To keep things simple we only illustrate the implementation using such built-in functionality. The code is very similar to the one we used previously for LSTMs. In fact, the only difference is that we specify the number of layers explicitly rather than picking the default of a single layer. Let us begin by importing the appropriate modules and loading data.

from d2l import mxnet as d2l
from mxnet import npx
from mxnet.gluon import rnn

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

The architectural decisions (such as choosing parameters) are very similar to those of previous sections. We pick the same number of inputs and outputs as we have distinct tokens, i.e., vocab_size. The number of hidden units is still 256. The only difference is that we now select a nontrivial number of layers num_layers = 2.

vocab_size, num_hiddens, num_layers = len(vocab), 256, 2
device = d2l.try_gpu()
lstm_layer = rnn.LSTM(num_hiddens, num_layers)
model = d2l.RNNModel(lstm_layer, len(vocab))

9.3.3. Training

The actual invocation logic is identical to before. The only difference is that we now instantiate two layers with LSTMs. This rather more complex architecture and the large number of epochs slow down training considerably.

num_epochs, lr = 500, 2
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
perplexity 1.1, 76138.9 tokens/sec on gpu(0)
time traveller for so it will be convenient to speak of him was
traveller but now you begin to seethe object of my investig

9.3.4. Summary

  • In deep recurrent neural networks, hidden state information is passed to the next timestep of the current layer and the current timestep of the next layer.

  • There exist many different flavors of deep RNNs, such as LSTMs, GRUs, or regular RNNs. Conveniently these models are all available as parts of the rnn module in Gluon.

  • Initialization of the models requires care. Overall, deep RNNs require considerable amount of work (such as learning rate and clipping) to ensure proper convergence.

9.3.5. Exercises

  1. Try to implement a two-layer RNN from scratch using the single layer implementation we discussed in Section 8.5.

  2. Replace the LSTM by a GRU and compare the accuracy.

  3. Increase the training data to include multiple books. How low can you go on the perplexity scale?

  4. Would you want to combine sources of different authors when modeling text? Why is this a good idea? What could go wrong?