# 7.5. Implementation of Recurrent Neural Networks from Scratch¶

In this section, we will implement a language model from scratch. It is based on a character-level recurrent neural network that is trained on H. G. Wells’ ‘The Time Machine’. As before, we start by reading the dataset first.

In [1]:

import sys
sys.path.insert(0, '..')

import d2l
import math
from mxnet.gluon import loss as gloss
import time

(corpus_indices, char_to_idx, idx_to_char, vocab_size) = \


## 7.5.1. One-hot Encoding¶

One-hot encoding vectors provide an easy way to express words as vectors in order to process them in a deep network. In a nutshell, we map each word to a different unit vector: assume that the number of different characters in the dictionary is $$N$$ (the vocab_size) and each character has a one-to-one correspondence with a single value in the index of successive integers from 0 to $$N-1$$. If the index of a character is the integer $$i$$, then we create a vector $$\mathbf{e}_i$$ of all 0s with a length of $$N$$ and set the element at position $$i$$ to 1. This vector is the one-hot vector of the original character. The one-hot vectors with indices 0 and 2 are shown below (the length of the vector is equal to the dictionary size).

In [2]:

nd.one_hot(nd.array([0, 2]), vocab_size)

Out[2]:


[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 2x43 @cpu(0)>


The shape of the mini-batch we sample each time is (batch size, time step). The following function transforms such mini-batches into a number of matrices with the shape of (batch size, dictionary size) that can be entered into the network. The total number of vectors is equal to the number of time steps. That is, the input of time step $$t$$ is $$\boldsymbol{X}_t \in \mathbb{R}^{n \times d}$$, where $$n$$ is the batch size and $$d$$ is the number of inputs. That is the one-hot vector length (the dictionary size).

In [3]:

# This function is saved in the d2l package for future use
def to_onehot(X, size):
return [nd.one_hot(x, size) for x in X.T]

X = nd.arange(10).reshape((2, 5))
inputs = to_onehot(X, vocab_size)
len(inputs), inputs[0].shape

Out[3]:

(5, (2, 43))


The code above generates 5 minibatches containing 2 vectors each. Since we have a total of 43 distinct symbols in “The Time Machine” we get 43-dimensional vectors.

## 7.5.2. Initializing the Model Parameters¶

Next, we initialize the model parameters. The number of hidden units num_hiddens is a tunable parameter.

In [4]:

num_inputs, num_hiddens, num_outputs = vocab_size, 512, vocab_size
ctx = d2l.try_gpu()
print('Using', ctx)

# Create the parameters of the model, initialize them and attach gradients
def get_params():
def _one(shape):
return nd.random.normal(scale=0.01, shape=shape, ctx=ctx)

# Hidden layer parameters
W_xh = _one((num_inputs, num_hiddens))
W_hh = _one((num_hiddens, num_hiddens))
b_h = nd.zeros(num_hiddens, ctx=ctx)
# Output layer parameters
W_hq = _one((num_hiddens, num_outputs))
b_q = nd.zeros(num_outputs, ctx=ctx)
params = [W_xh, W_hh, b_h, W_hq, b_q]
for param in params:
return params

Using gpu(0)


## 7.5.3. Sequence Modeling¶

### 7.5.3.1. RNN Model¶

We implement this model based on the definition of an RNN. First, we need an init_rnn_state function to return the hidden state at initialization. It returns a tuple consisting of an NDArray with a value of 0 and a shape of (batch size, number of hidden units). Using tuples makes it easier to handle situations where the hidden state contains multiple NDArrays (e.g. when combining multiple layers in an RNN).

In [5]:

def init_rnn_state(batch_size, num_hiddens, ctx):
return (nd.zeros(shape=(batch_size, num_hiddens), ctx=ctx), )


The following rnn function defines how to compute the hidden state and output in a time step. The activation function here uses the tanh function. As described in the “Multilayer Perceptron” section, the mean value of tanh function values is 0 when the elements are evenly distributed over the real number field.

In [6]:

def rnn(inputs, state, params):
# Both inputs and outputs are composed of num_steps matrices of the shape
# (batch_size, vocab_size)
W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
for X in inputs:
H = nd.tanh(nd.dot(X, W_xh) + nd.dot(H, W_hh) + b_h)
Y = nd.dot(H, W_hq) + b_q
outputs.append(Y)
return outputs, (H,)


Let’s run a simple test to check whether inputs and outputs are accurate. In particular, we check output dimensions, the number of outputs and ensure that the hidden state hasn’t changed.

In [7]:

state = init_rnn_state(X.shape[0], num_hiddens, ctx)
inputs = to_onehot(X.as_in_context(ctx), vocab_size)
params = get_params()
outputs, state_new = rnn(inputs, state, params)
len(outputs), outputs[0].shape, state_new[0].shape

Out[7]:

(5, (2, 43), (2, 512))


### 7.5.3.2. Prediction Function¶

The following function predicts the next num_chars characters based on the prefix (a string containing several characters). This function is a bit more complicated. In it, we set the recurrent neural unit rnn as a function parameter, so that this function can be reused in the other recurrent neural networks described in following sections.

In [8]:

# This function is saved in the d2l package for future use
def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,
num_hiddens, vocab_size, ctx, idx_to_char, char_to_idx):
state = init_rnn_state(1, num_hiddens, ctx)
output = [char_to_idx[prefix[0]]]
for t in range(num_chars + len(prefix) - 1):
# The output of the previous time step is taken as the input of the
# current time step.
X = to_onehot(nd.array([output[-1]], ctx=ctx), vocab_size)
# Calculate the output and update the hidden state
(Y, state) = rnn(X, state, params)
# The input to the next time step is the character in the prefix or
# the current best predicted character
if t < len(prefix) - 1:
output.append(char_to_idx[prefix[t + 1]])
else:
# This is maximum likelihood decoding, not sampling
output.append(int(Y[0].argmax(axis=1).asscalar()))
return ''.join([idx_to_char[i] for i in output])


We test the predict_rnn function first. We will create a lyric with a length of 10 characters (regardless of the prefix length) based on the prefix “separate”. Because the model parameters are random values, the prediction results are also random.

In [9]:

predict_rnn('traveller', 10, rnn, params, init_rnn_state, num_hiddens,
vocab_size, ctx, idx_to_char, char_to_idx)

Out[9]:

'traveller,yx],8i-x]'


When solving an optimization problem we take update steps for the weights $$\mathbf{w}$$ in the general direction of the negative gradient $$\mathbf{g}_t$$ on a minibatch, say $$\mathbf{w} - \eta \cdot \mathbf{g}_t$$. Let’s further assume that the objective is well behaved, i.e. it is Lipschitz continuous with constant $$L$$, i.e.

$|l(\mathbf{w}) - l(\mathbf{w}')| \leq L \|\mathbf{w} - \mathbf{w}'\|.$

In this case we can safely assume that if we update the weight vector by $$\eta \cdot \mathbf{g}_t$$ we will not observe a change by more than $$L \eta \|\mathbf{g}_t\|$$. This is both a curse and a blessing. A curse since it limits the speed with which we can make progress, a blessing since it limits the extent to which things can go wrong if we move in the wrong direction.

Sometimes the gradients can be quite large and the optimization algorithm may fail to converge. We could address this by reducing the learning rate $$\eta$$ or by some other higher order trick. But what if we only rarely get large gradients? In this case such an approach may appear entirely unwarranted. One alternative is to clip the gradients by projecting them back to a ball of a given radius, say $$\theta$$ via

$\mathbf{g} \leftarrow \min\left(1, \frac{\theta}{\|\mathbf{g}\|}\right) \mathbf{g}.$

By doing so we know that the gradient norm never exceeds $$\theta$$ and that the updated gradient is entirely aligned with the original direction $$\mathbf{g}$$. Back to the case at hand - optimization in RNNs. One of the issues is that the gradients in an RNN may either explode or vanish. Consider the chain of matrix-products involved in backpropagation. If the largest eigenvalue of the matrices is typically larger than $$1$$, then the product of many such matrices can be much larger than $$1$$. As a result, the aggregate gradient might explode. Gradient clipping provides a quick fix. While it doesn’t entire solve the problem, it is one of the many techniques to alleviate it.

In [10]:

# This function is saved in the d2l package for future use
norm = nd.array([0], ctx)
for param in params:
norm = norm.sqrt().asscalar()
if norm > theta:
for param in params:


## 7.5.5. Perplexity¶

One way of measuring how well a sequence model works is to check how surprising the text is. A good language model is able to predict with high accuracy what we will see next. Consider the following continuations of the phrase It is raining, as proposed by different language models:

1. It is raining outside
2. It is raining banana tree
3. It is raining piouw;kcj pwepoiut

In terms of quality, example 1 is clearly the best. The words are sensible and logically coherent. While it might not quite so accurately reflect which word follows (in San Francisco and in winter would have been perfectly reasonable extensions), the model is able to capture which kind of word follows. Example 2 is considerably worse by producing a nonsensical and borderline dysgrammatical extension. Nonetheless, at least the model has learned how to spell words and some degree of correlation between words. Lastly, example 3 indicates a poorly trained model that doesn’t fit data.

One way of measuring the quality of the model is to compute $$p(w)$$, i.e. the likelihood of the sequence. Unfortunately this is a number that is hard to understand and difficult to compare. After all, shorter sequences are much more likely than long ones, hence evaluating the model on Tolstoy’s magnum opus ‘War and Peace’ will inevitably produce a much smaller likelihood than, say, on Saint-Exupery’s novella ‘The Little Prince’. What is missing is the equivalent of an average.

Information Theory comes handy here. If we want to compress text we can ask about estimating the next symbol given the current set of symbols. A lower bound on the number of bits is given by $$-\log_2 p(w_t|w_{t-1}, \ldots w_1)$$. A good language model should allow us to predict the next word quite accurately and thus it should allow us to spend very few bits on compressing the sequence. One way of measuring it is by the average number of bits that we need to spend.

$\frac{1}{n} \sum_{t=1}^n -\log p(w_t|w_{t-1}, \ldots w_1) = \frac{1}{|w|} -\log p(w)$

This makes the performance on documents of different lengths comparable. For historical reasons scientists in natural language processing prefer to use a quantity called perplexity rather than bitrate. In a nutshell it is the exponential of the above:

$\mathrm{PPL} := \exp\left(-\frac{1}{n} \sum_{t=1}^n \log p(w_t|w_{t-1}, \ldots w_1)\right)$

It can be best understood as the harmonic mean of the number of real choices that we have when deciding which word to pick next. Note that Perplexity naturally generalizes the notion of the cross entropy loss defined when we introduced Softmax Regression. That is, for a single symbol both definitions are identical bar the fact that one is the exponential of the other. Let’s look at a number of cases:

• In the best case scenario, the model always estimates the probability of the next symbol as $$1$$. In this case the perplexity of the model is $$1$$.
• In the worst case scenario, the model always predicts the probability of the label category as 0. In this situation, the perplexity is infinite.
• At the baseline, the model predicts a uniform distribution over all tokens. In this case the perplexity equals the size of the dictionary vocab_size. In fact, if we were to store the sequence without any compression this would be the best we could do to encode it. Hence this provides a nontrivial upper bound that any model must satisfy.

## 7.5.6. Training the Model¶

Training a sequence model proceeds quite different from previous codes. In particular we need to take care of the following changes due to the fact that the tokens appear in order:

1. We use perplexity to evaluate the model. This ensures that different tests are comparable.
2. We clip the gradient before updating the model parameters. This ensures that the model doesn’t diverge even when gradients blow up at some point during the training process (effectively it reduces the stepsize automatically).
3. Different sampling methods for sequential data (independent sampling and sequential partitioning) will result in differences in the initialization of hidden states. We discussed these issues in detail when we covered data processing.

### 7.5.6.1. Optimization Loop¶

To allow for more flexibility the call signature and the code are slightly longer. This will allow us to replace the various pieces by a Gluon implementation subsequently without the need to change the training logic.

In [11]:

# This function is saved in the d2l package for future use
def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
vocab_size, ctx, corpus_indices, idx_to_char,
char_to_idx, is_random_iter, num_epochs, num_steps,
lr, clipping_theta, batch_size, pred_period,
pred_len, prefixes):
if is_random_iter:
data_iter_fn = d2l.data_iter_random
else:
data_iter_fn = d2l.data_iter_consecutive
params = get_params()
loss = gloss.SoftmaxCrossEntropyLoss()

for epoch in range(num_epochs):
if not is_random_iter:
# If adjacent sampling is used, the hidden state is initialized
# at the beginning of the epoch
state = init_rnn_state(batch_size, num_hiddens, ctx)
l_sum, n, start = 0.0, 0, time.time()
data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, ctx)
for X, Y in data_iter:
if is_random_iter:
# If random sampling is used, the hidden state is initialized
# before each mini-batch update
state = init_rnn_state(batch_size, num_hiddens, ctx)
else:
# Otherwise, the detach function needs to be used to separate
# the hidden state from the computational graph to avoid
# backpropagation beyond the current sample
for s in state:
s.detach()
inputs = to_onehot(X, vocab_size)
# outputs is num_steps terms of shape (batch_size, vocab_size)
(outputs, state) = rnn(inputs, state, params)
# After stitching it is (num_steps * batch_size, vocab_size)
outputs = nd.concat(*outputs, dim=0)
# The shape of Y is (batch_size, num_steps), and then becomes
# a vector with a length of batch * num_steps after
# transposition. This gives it a one-to-one correspondence
# with output rows
y = Y.T.reshape((-1,))
# Average classification error via cross entropy loss
l = loss(outputs, y).mean()
l.backward()
d2l.sgd(params, lr, 1)
# Since the error is the mean, no need to average gradients here
l_sum += l.asscalar() * y.size
n += y.size

if (epoch + 1) % pred_period == 0:
print('epoch %d, perplexity %f, time %.2f sec' % (
epoch + 1, math.exp(l_sum / n), time.time() - start))
for prefix in prefixes:
print(' -', predict_rnn(
prefix, pred_len, rnn, params, init_rnn_state,
num_hiddens, vocab_size, ctx, idx_to_char, char_to_idx))


### 7.5.6.2. Experiments with a Sequence Model¶

Now we can train the model. First, we need to set the model hyper-parameters. To allow for some meaningful amount of context we set the sequence length to 64. To get some intuition of how well the model works, we will have it generate 50 characters every 50 epochs of the training phase. In particular, we will see how training using the ‘separate’ and ‘sequential’ term generation will affect the performance of the model.

In [12]:

num_epochs, num_steps, batch_size, lr, clipping_theta = 500, 64, 32, 1e2, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['traveller', 'time traveller']


Let’s use random sampling to train the model and produce some text.

In [13]:

train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
vocab_size, ctx, corpus_indices, idx_to_char,
char_to_idx, True, num_epochs, num_steps, lr,
clipping_theta, batch_size, pred_period, pred_len,
prefixes)

epoch 50, perplexity 8.949173, time 0.22 sec
- travellere the the the the the the the the the the the the
- time travellere the the the the the the the the the the the the
epoch 100, perplexity 7.488191, time 0.22 sec
- traveller and an the the the the the the the the the the th
- time traveller and an the the the the the the the the the the th
epoch 150, perplexity 5.495706, time 0.22 sec
- traveller some time trough this ghe thite time trove the ti
- time traveller some time trough this ghe thite time trove the ti
epoch 200, perplexity 3.749710, time 0.22 sec
- traveller.  'ith, and that we mover gootima travel the ofic
- time traveller propeedimension d and why _an?'  'int some thall
epoch 250, perplexity 2.429273, time 0.22 sec
- traveller.  'it would be recardallyo. teecuread the pryshec
- time traveller.  'it would be recardallyo. teecurest as he hedin
epoch 300, perplexity 1.775355, time 0.22 sec
- traveller smonee of the trimenssone phest wo doun thal suph
- time traveller proceeded the time traveller.  'you can show blac
epoch 350, perplexity 1.538032, time 0.22 sec
- traveller.  'it's seaig to the others. but some philosophic
- time traveller cemy incagon, ws ighis or soad the graved that as
epoch 400, perplexity 1.465148, time 0.22 sec
- traveller smiled round at us. then, still smiling faintly,
- time traveller (for so it wallen ofle befremer forethe brespesth
epoch 450, perplexity 1.334505, time 0.22 sec
- traveller smiled. 'are you sure we can move freely in space
- time traveller, with a slig tofic be the this mectime. for insta
epoch 500, perplexity 1.358781, time 0.22 sec
- traveller smiled. 'ard ye_ usmy.'  'one might get one's gre
- time traveller smiled. 'ard ye_ usmy.'  'one might get one's gre


Even though our model was rather primitive, it is nonetheless able to produce text that resembles language. In particular it learns some notion of quotations, punctuation and a basic sense of grammar, at least for frequent words. Now let’s compare this with sequential partitioning.

In [14]:

train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
vocab_size, ctx, corpus_indices, idx_to_char,
char_to_idx, False, num_epochs, num_steps, lr,
clipping_theta, batch_size, pred_period, pred_len,
prefixes)

epoch 50, perplexity 8.939165, time 0.22 sec
- travellered and and and and and and and and and and and and
- time travellered and and and and and and and and and and and and
epoch 100, perplexity 6.947001, time 0.22 sec
- traveller ard and have tha k and hour at are the trecent on
- time traveller ard and have tha k and hour at are the trecent on
epoch 150, perplexity 4.607737, time 0.22 sec
- traveller.  'o at ment an the greand for and the grimens on
- time traveller.  'icully the merand the perthece as extere the g
epoch 200, perplexity 2.477860, time 0.22 sec
- traveller.  'it't 'esto in the fict, and the ones ofusmidic
- time traveller.  'it' sacavime the gettrecentinted an the fistor
epoch 250, perplexity 1.671189, time 0.22 sec
- traveller.  'it woumarist efoul bat of space exceptis, cown
- time traveller.  'it w aly could cave long to hat at al ugreal a
epoch 300, perplexity 1.259885, time 0.22 sec
- traveller.  'you cug seow they mown hat te the parsed the t
- time traveller.  'you can show black is white by argument,' said
epoch 350, perplexity 1.176128, time 0.22 sec
- traveller.  'you can show black is white by argument,' said
- time traveller caly in ave in the fiment anetofrem the procentig
epoch 400, perplexity 1.146039, time 0.22 sec
- traveller came back, and filby's anecdote collapsed.  the t
- time traveller came back, and filby's anecdote collapsed.  the t
epoch 450, perplexity 1.147105, time 0.22 sec
- traveller. 'but now you begin to see the ofjections of spac
- time traveller cfole ong whis reace, hine in ave dine sove ficel
epoch 500, perplexity 1.095567, time 0.22 sec
- traveller. 'but now you begin to see the object of my inves
- time traveller (for so it will be convenient to speak of him) wa


The perplexity is quite a bit lower. In fact, both models are pretty close to $$1$$. This means that if we were to compress the text using this simple character-based language model we would needs less than 1 bit per character to encode a symbol. In the following we will see how to improve significantly on the current model and how to make it faster and easier to implement.

## 7.5.7. Summary¶

• Sequence models need state initialization for training.
• Between sequential models you need to ensure to detach the gradient, to ensure that the automatic differentiation does not propagate effects beyond the current sample.
• A simple RNN language model consists of an encoder, an RNN model and a decoder.
• Perplexity calibrates model performance across variable sequence length. It is the exponentiated average of the cross-entropy loss.
• Sequential partitioning typically leads to better models.

## 7.5.8. Exercises¶

1. Show that one-hot encoding is equivalent to picking a different embedding for each object.
2. Adjust the hyperparameters to improve the perplexity.
• How low can you go? Adjust embeddings, hidden units, learning rate, etc.
• How well will it work on other books by H. G. Wells, e.g. The War of the Worlds.
3. Run the code in this section without clipping the gradient. What happens?
4. Set the pred_period variable to 1 to observe how the under-trained model (high perplexity) writes lyrics. What can you learn from this?
5. Change adjacent sampling so that it does not separate hidden states from the computational graph. Does the running time change? How about the accuracy?
6. Replace the activation function used in this section with ReLU and repeat the experiments in this section.
7. Prove that the perplexity is the inverse of the harmonic mean of the conditional word probabilities.