.. _sec_deep_rnn:
Deep Recurrent Neural Networks
==============================
Up to now, we only discussed recurrent neural networks with a single
unidirectional hidden layer. In it the specific functional form of how
latent variables and observations interact was rather arbitrary. This is
not a big problem as long as we have enough flexibility to model
different types of interactions. With a single layer, however, this can
be quite challenging. In the case of the perceptron, we fixed this
problem by adding more layers. Within RNNs this is a bit trickier, since
we first need to decide how and where to add extra nonlinearity. Our
discussion below focuses primarily on LSTMs, but it applies to other
sequence models, too.
- We could add extra nonlinearity to the gating mechanisms. That is,
instead of using a single perceptron we could use multiple layers.
This leaves the *mechanism* of the LSTM unchanged. Instead it makes
it more sophisticated. This would make sense if we were led to
believe that the LSTM mechanism describes some form of universal
truth of how latent variable autoregressive models work.
- We could stack multiple layers of LSTMs on top of each other. This
results in a mechanism that is more flexible, due to the combination
of several simple layers. In particular, data might be relevant at
different levels of the stack. For instance, we might want to keep
high-level data about financial market conditions (bear or bull
market) available, whereas at a lower level we only record
shorter-term temporal dynamics.
Beyond all this abstract discussion it is probably easiest to understand
the family of models we are interested in by reviewing
:numref:`fig_deep_rnn`. It describes a deep recurrent neural network
with :math:`L` hidden layers. Each hidden state is continuously passed
to both the next timestep of the current layer and the current timestep
of the next layer.
.. _fig_deep_rnn:
.. figure:: ../img/deep-rnn.svg
Architecture of a deep recurrent neural network.
Functional Dependencies
-----------------------
At timestep :math:`t` we assume that we have a minibatch
:math:`\mathbf{X}_t \in \mathbb{R}^{n \times d}` (number of examples:
:math:`n`, number of inputs: :math:`d`). The hidden state of hidden
layer :math:`\ell` (:math:`\ell=1,\ldots, L`) is
:math:`\mathbf{H}_t^{(\ell)} \in \mathbb{R}^{n \times h}` (number of
hidden units: :math:`h`), the output layer variable is
:math:`\mathbf{O}_t \in \mathbb{R}^{n \times q}` (number of outputs:
:math:`q`) and a hidden layer activation function :math:`f_l` for layer
:math:`l`. We compute the hidden state of layer :math:`1` as before,
using :math:`\mathbf{X}_t` as input. For all subsequent layers, the
hidden state of the previous layer is used in its place.
.. math::
\begin{aligned}
\mathbf{H}_t^{(1)} & = f_1\left(\mathbf{X}_t, \mathbf{H}_{t-1}^{(1)}\right), \\
\mathbf{H}_t^{(l)} & = f_l\left(\mathbf{H}_t^{(l-1)}, \mathbf{H}_{t-1}^{(l)}\right).
\end{aligned}
Finally, the output layer is only based on the hidden state of hidden
layer :math:`L`. We use the output function :math:`g` to address this:
.. math:: \mathbf{O}_t = g \left(\mathbf{H}_t^{(L)}\right).
Just as with multilayer perceptrons, the number of hidden layers
:math:`L` and number of hidden units :math:`h` are hyper parameters. In
particular, we can pick a regular RNN, a GRU, or an LSTM to implement
the model.
Concise Implementation
----------------------
Fortunately many of the logistical details required to implement
multiple layers of an RNN are readily available in Gluon. To keep things
simple we only illustrate the implementation using such built-in
functionality. The code is very similar to the one we used previously
for LSTMs. In fact, the only difference is that we specify the number of
layers explicitly rather than picking the default of a single layer. Let
us begin by importing the appropriate modules and loading data.
.. code:: python
from d2l import mxnet as d2l
from mxnet import npx
from mxnet.gluon import rnn
npx.set_np()
batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
The architectural decisions (such as choosing parameters) are very
similar to those of previous sections. We pick the same number of inputs
and outputs as we have distinct tokens, i.e., ``vocab_size``. The number
of hidden units is still 256. The only difference is that we now select
a nontrivial number of layers ``num_layers = 2``.
.. code:: python
vocab_size, num_hiddens, num_layers, ctx = len(vocab), 256, 2, d2l.try_gpu()
lstm_layer = rnn.LSTM(num_hiddens, num_layers)
model = d2l.RNNModel(lstm_layer, len(vocab))
Training
--------
The actual invocation logic is identical to before. The only difference
is that we now instantiate two layers with LSTMs. This rather more
complex architecture and the large number of epochs slow down training
considerably.
.. code:: python
num_epochs, lr = 500, 2
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, ctx)
.. parsed-literal::
:class: output
Perplexity 1.1, 139008 tokens/sec on gpu(0)
time traveller it s against reason said filby what reason said
traveller it s against reason said filby what reason said
.. figure:: output_deep-rnn_d70a11_5_1.svg
Summary
-------
- In deep recurrent neural networks, hidden state information is passed
to the next timestep of the current layer and the current timestep of
the next layer.
- There exist many different flavors of deep RNNs, such as LSTMs, GRUs,
or regular RNNs. Conveniently these models are all available as parts
of the ``rnn`` module in Gluon.
- Initialization of the models requires care. Overall, deep RNNs
require considerable amount of work (such as learning rate and
clipping) to ensure proper convergence.
Exercises
---------
1. Try to implement a two-layer RNN from scratch using the single layer
implementation we discussed in :numref:`sec_rnn_scratch`.
2. Replace the LSTM by a GRU and compare the accuracy.
3. Increase the training data to include multiple books. How low can you
go on the perplexity scale?
4. Would you want to combine sources of different authors when modeling
text? Why is this a good idea? What could go wrong?
`Discussions `__
-------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_deep-rnn.svg