# 9.3. Language Models¶ Open the notebook in SageMaker Studio Lab

In Section 9.2, we see how to map text sequences into
tokens, where these tokens can be viewed as a sequence of discrete
observations, such as words or characters. Assume that the tokens in a
text sequence of length \(T\) are in turn
\(x_1, x_2, \ldots, x_T\). The goal of *language models* is to
estimate the joint probability of the whole sequence:

where statistical tools in Section 9.1 can be applied.

Language models are incredibly useful. For instance, an ideal language
model would be able to generate natural text just on its own, simply by
drawing one token at a time
\(x_t \sim P(x_t \mid x_{t-1}, \ldots, x_1)\). Quite unlike the
monkey using a typewriter, all text emerging from such a model would
pass as natural language, e.g., English text. Furthermore, it would be
sufficient for generating a meaningful dialog, simply by conditioning
the text on previous dialog fragments. Clearly we are still very far
from designing such a system, since it would need to *understand* the
text rather than just generate grammatically sensible content.

Nonetheless, language models are of great service even in their limited form. For instance, the phrases “to recognize speech” and “to wreck a nice beach” sound very similar. This can cause ambiguity in speech recognition, which is easily resolved through a language model that rejects the second translation as outlandish. Likewise, in a document summarization algorithm it is worthwhile knowing that “dog bites man” is much more frequent than “man bites dog”, or that “I want to eat grandma” is a rather disturbing statement, whereas “I want to eat, grandma” is much more benign.

## 9.3.1. Learning Language Models¶

The obvious question is how we should model a document, or even a sequence of tokens. Suppose that we tokenize text data at the word level. Let’s start by applying basic probability rules:

For example, the probability of a text sequence containing four words would be given as:

### 9.3.1.1. Markov Models and \(n\)-grams¶

Among those sequence model analysis in Section 9.1, let’s apply Markov models to language modeling. A distribution over sequences satisfies the Markov property of first order if \(P(x_{t+1} \mid x_t, \ldots, x_1) = P(x_{t+1} \mid x_t)\). Higher orders correspond to longer dependencies. This leads to a number of approximations that we could apply to model a sequence:

The probability formulae that involve one, two, and three variables are
typically referred to as *unigram*, *bigram*, and *trigram* models,
respectively. In order to compute the language model, we need to
calculate the probability of words and the conditional probability of a
word given the previous few words. Note that such probabilities are
language model parameters.

### 9.3.1.2. Word Frequency¶

Here, we assume that the training dataset is a large text corpus, such as all Wikipedia entries, Project Gutenberg, and all text posted on the Web. The probability of words can be calculated from the relative word frequency of a given word in the training dataset. For example, the estimate \(\hat{P}(\text{deep})\) can be calculated as the probability of any sentence starting with the word “deep”. A slightly less accurate approach would be to count all occurrences of the word “deep” and divide it by the total number of words in the corpus. This works fairly well, particularly for frequent words. Moving on, we could attempt to estimate

where \(n(x)\) and \(n(x, x')\) are the number of occurrences of singletons and consecutive word pairs, respectively. Unfortunately, estimating the probability of a word pair is somewhat more difficult, since the occurrences of “deep learning” are a lot less frequent. In particular, for some unusual word combinations it may be tricky to find enough occurrences to get accurate estimates. As suggested by the empirical results in Section 9.2.5, things take a turn for the worse for three-word combinations and beyond. There will be many plausible three-word combinations that we likely will not see in our dataset. Unless we provide some solution to assign such word combinations nonzero count, we will not be able to use them in a language model. If the dataset is small or if the words are very rare, we might not find even a single one of them.

### 9.3.1.3. Laplace Smoothing¶

A common strategy is to perform some form of *Laplace smoothing*. The
solution is to add a small constant to all counts. Denote by \(n\)
the total number of words in the training set and \(m\) the number
of unique words. This solution helps with singletons, e.g., via

Here \(\epsilon_1,\epsilon_2\), and \(\epsilon_3\) are hyperparameters. Take \(\epsilon_1\) as an example: when \(\epsilon_1 = 0\), no smoothing is applied; when \(\epsilon_1\) approaches positive infinity, \(\hat{P}(x)\) approaches the uniform probability \(1/m\). The above is a rather primitive variant of what other techniques can accomplish [Wood et al., 2011].

Unfortunately, models like this get unwieldy rather quickly for the following reasons. First, as discussed in Section 9.2.5, many \(n\)-grams occur very rarely, making Laplace smoothing rather unsuitable for language modeling. Second, we need to store all counts. Third, this entirely ignores the meaning of the words. For instance, “cat” and “feline” should occur in related contexts. It is quite difficult to adjust such models to additional contexts, whereas, deep learning based language models are well suited to take this into account. Last, long word sequences are almost certain to be novel, hence a model that simply counts the frequency of previously seen word sequences is bound to perform poorly there. Therefore, we focus on using neural networks for language modeling in the rest of the chapter.

## 9.3.2. Perplexity¶

Next, let’s discuss about how to measure the language model quality, which will be used to evaluate our models in the subsequent sections. One way is to check how surprising the text is. A good language model is able to predict with high-accuracy tokens that what we will see next. Consider the following continuations of the phrase “It is raining”, as proposed by different language models:

“It is raining outside”

“It is raining banana tree”

“It is raining piouw;kcj pwepoiut”

In terms of quality, example 1 is clearly the best. The words are sensible and logically coherent. While it might not quite accurately reflect which word follows semantically (“in San Francisco” and “in winter” would have been perfectly reasonable extensions), the model is able to capture which kind of word follows. Example 2 is considerably worse by producing a nonsensical extension. Nonetheless, at least the model has learned how to spell words and some degree of correlation between words. Last, example 3 indicates a poorly trained model that does not fit data properly.

We might measure the quality of the model by computing the likelihood of
the sequence. Unfortunately this is a number that is hard to understand
and difficult to compare. After all, shorter sequences are much more
likely to occur than the longer ones, hence evaluating the model on
Tolstoy’s magnum opus *War and Peace* will inevitably produce a much
smaller likelihood than, say, on Saint-Exupery’s novella *The Little
Prince*. What is missing is the equivalent of an average.

Information theory comes handy here. We have defined entropy, surprisal, and cross-entropy when we introduced the softmax regression (Section 4.1.3). If we want to compress text, we can ask about predicting the next token given the current set of tokens. A better language model should allow us to predict the next token more accurately. Thus, it should allow us to spend fewer bits in compressing the sequence. So we can measure it by the cross-entropy loss averaged over all the \(n\) tokens of a sequence:

where \(P\) is given by a language model and \(x_t\) is the
actual token observed at time step \(t\) from the sequence. This
makes the performance on documents of different lengths comparable. For
historical reasons, scientists in natural language processing prefer to
use a quantity called *perplexity*. In a nutshell, it is the exponential
of (9.3.7):

Perplexity can be best understood as the geometric mean of the number of real choices that we have when deciding which token to pick next. Let’s look at a number of cases:

In the best case scenario, the model always perfectly estimates the probability of the target token as 1. In this case the perplexity of the model is 1.

In the worst case scenario, the model always predicts the probability of the target token as 0. In this situation, the perplexity is positive infinity.

At the baseline, the model predicts a uniform distribution over all the available tokens of the vocabulary. In this case, the perplexity equals the number of unique tokens of the vocabulary. In fact, if we were to store the sequence without any compression, this would be the best we could do to encode it. Hence, this provides a nontrivial upper bound that any useful model must beat.

```
import torch
from d2l import torch as d2l
```

```
from mxnet import np, npx
from d2l import mxnet as d2l
npx.set_np()
```

```
import tensorflow as tf
from d2l import tensorflow as d2l
```

## 9.3.3. Partitioning Sequences¶

We will design language models using neural networks and use perplexity to evaluate how good the model is at predicting the next token given the current set of tokens in text sequences. Before introducing the model, let’s assume that it processes a minibatch of sequences with predefined length at a time. Now the question is how to read minibatches of input sequences and target sequences at random.

Suppose that the dataset takes the form of a sequence of \(T\) token
indices in `corpus`

. We will partition it into subsequences, where
each subsequence has \(n\) tokens (time steps). To iterate over
(almost) all the tokens of the entire dataset for each epoch and obtain
all possible length-\(n\) subsequences, we can introduce
randomness. More concretely, at the beginning of each epoch, discard the
first \(d\) tokens, where \(d\in [0,n)\) is uniformly sampled at
random. The rest of the sequence is then partitioned into
\(m=\lfloor (T-d)/n \rfloor\) subsequences. Denote by
\(\mathbf x_t = [x_t, \ldots, x_{t+n-1}]\) the length-\(n\)
subsequence starting from token \(x_t\) at time step \(t\). The
resulting \(m\) partitioned subsequences are
\(\mathbf x_d, \mathbf x_{d+n}, \ldots, \mathbf x_{d+n(m-1)}.\) Each
subsequence will be used as an input sequence into the language model.

For language modeling, the goal is to predict the next token based on what tokens we have seen so far, hence the targets (labels) are the original sequence, shifted by one token. The target sequence for any input sequence \(\mathbf x_t\) is \(\mathbf x_{t+1}\) with length \(n\).

Fig. 9.3.1 shows an example of obtaining 5 pairs of input sequences and target sequences with \(n=5\) and \(d=2\).

```
@d2l.add_to_class(d2l.TimeMachine) #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
super(d2l.TimeMachine, self).__init__()
self.save_hyperparameters()
corpus, self.vocab = self.build(self._download())
array = torch.tensor([corpus[i:i+num_steps+1]
for i in range(0, len(corpus)-num_steps-1)])
self.X, self.Y = array[:,:-1], array[:,1:]
```

```
@d2l.add_to_class(d2l.TimeMachine) #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
super(d2l.TimeMachine, self).__init__()
self.save_hyperparameters()
corpus, self.vocab = self.build(self._download())
array = np.array([corpus[i:i+num_steps+1]
for i in range(0, len(corpus)-num_steps-1)])
self.X, self.Y = array[:,:-1], array[:,1:]
```

```
@d2l.add_to_class(d2l.TimeMachine) #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
super(d2l.TimeMachine, self).__init__()
self.save_hyperparameters()
corpus, self.vocab = self.build(self._download())
array = tf.constant([corpus[i:i+num_steps+1]
for i in range(0, len(corpus)-num_steps-1)])
self.X, self.Y = array[:,:-1], array[:,1:]
```

To train language models, we will randomly sample pairs of input
sequences and target sequences in minibatches. The following data loader
randomly generates a minibatch from the dataset each time. The argument
`batch_size`

specifies the number of subsequence examples (`self.b`

)
in each minibatch and `num_steps`

is the subsequence length in tokens
(`self.n`

).

```
@d2l.add_to_class(d2l.TimeMachine) #@save
def get_dataloader(self, train):
idx = slice(0, self.num_train) if train else slice(
self.num_train, self.num_train + self.num_val)
return self.get_tensorloader([self.X, self.Y], train, idx)
```

As we can see in the following, a minibatch of target sequences can be obtained by shifting the input sequences by one token.

```
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
print('X:', X, '\nY:', Y)
break
```

```
X: tensor([[20, 0, 21, 9, 6, 19, 6, 0, 15, 16],
[ 3, 6, 10, 15, 8, 0, 24, 9, 10, 4]])
Y: tensor([[ 0, 21, 9, 6, 19, 6, 0, 15, 16, 21],
[ 6, 10, 15, 8, 0, 24, 9, 10, 4, 9]])
```

```
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
print('X:', X, '\nY:', Y)
break
```

```
X: [[ 9. 0. 3. 19. 6. 2. 5. 21. 9. 0.]
[ 0. 9. 6. 13. 5. 0. 10. 15. 0. 9.]]
Y: [[ 0. 3. 19. 6. 2. 5. 21. 9. 0. 2.]
[ 9. 6. 13. 5. 0. 10. 15. 0. 9. 10.]]
```

```
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
print('X:', X, '\nY:', Y)
break
```

```
X: tf.Tensor(
[[19 13 26 0 24 9 26 0 15 16]
[21 9 0 16 7 0 21 10 14 6]], shape=(2, 10), dtype=int32)
Y: tf.Tensor(
[[13 26 0 24 9 26 0 15 16 21]
[ 9 0 16 7 0 21 10 14 6 0]], shape=(2, 10), dtype=int32)
```

## 9.3.4. Summary¶

Language models estimate the joint probability of a text sequence.

\(n\)-grams provide a convenient model for dealing with long sequences by truncating the dependence.

There is a lot of structure but not enough frequency to deal with infrequent word combinations efficiently via Laplace smoothing.

To train language models, we can randomly sample pairs of input sequences and target sequences in minibatches.

## 9.3.5. Exercises¶

Suppose there are \(100,000\) words in the training dataset. How much word frequency and multi-word adjacent frequency does a four-gram need to store?

How would you model a dialogue?

What other methods can you think of for reading long sequence data?

Consider our method for discarding a uniformly random number of the first few tokens at the beginning of each epoch.

Does it really lead to a perfectly uniform distribution over the sequences on the document?

What would you have to do to make things even more uniform?

If we want a sequence example to be a complete sentence, what kind of problem does this introduce in minibatch sampling? How can we fix the problem?