9.3. Language Models
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in SageMaker Studio Lab

In Section 9.2, we see how to map text sequences into tokens, where these tokens can be viewed as a sequence of discrete observations, such as words or characters. Assume that the tokens in a text sequence of length \(T\) are in turn \(x_1, x_2, \ldots, x_T\). The goal of language models is to estimate the joint probability of the whole sequence:

(9.3.1)\[P(x_1, x_2, \ldots, x_T),\]

where statistical tools in Section 9.1 can be applied.

Language models are incredibly useful. For instance, an ideal language model would be able to generate natural text just on its own, simply by drawing one token at a time \(x_t \sim P(x_t \mid x_{t-1}, \ldots, x_1)\). Quite unlike the monkey using a typewriter, all text emerging from such a model would pass as natural language, e.g., English text. Furthermore, it would be sufficient for generating a meaningful dialog, simply by conditioning the text on previous dialog fragments. Clearly we are still very far from designing such a system, since it would need to understand the text rather than just generate grammatically sensible content.

Nonetheless, language models are of great service even in their limited form. For instance, the phrases “to recognize speech” and “to wreck a nice beach” sound very similar. This can cause ambiguity in speech recognition, which is easily resolved through a language model that rejects the second translation as outlandish. Likewise, in a document summarization algorithm it is worthwhile knowing that “dog bites man” is much more frequent than “man bites dog”, or that “I want to eat grandma” is a rather disturbing statement, whereas “I want to eat, grandma” is much more benign.

9.3.1. Learning Language Models

The obvious question is how we should model a document, or even a sequence of tokens. Suppose that we tokenize text data at the word level. Let’s start by applying basic probability rules:

(9.3.2)\[P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^T P(x_t \mid x_1, \ldots, x_{t-1}).\]

For example, the probability of a text sequence containing four words would be given as:

(9.3.3)\[P(\text{deep}, \text{learning}, \text{is}, \text{fun}) = P(\text{deep}) P(\text{learning} \mid \text{deep}) P(\text{is} \mid \text{deep}, \text{learning}) P(\text{fun} \mid \text{deep}, \text{learning}, \text{is}).\]

9.3.1.1. Markov Models and \(n\)-grams

Among those sequence model analysis in Section 9.1, let’s apply Markov models to language modeling. A distribution over sequences satisfies the Markov property of first order if \(P(x_{t+1} \mid x_t, \ldots, x_1) = P(x_{t+1} \mid x_t)\). Higher orders correspond to longer dependencies. This leads to a number of approximations that we could apply to model a sequence:

(9.3.4)\[\begin{split}\begin{aligned} P(x_1, x_2, x_3, x_4) &= P(x_1) P(x_2) P(x_3) P(x_4),\\ P(x_1, x_2, x_3, x_4) &= P(x_1) P(x_2 \mid x_1) P(x_3 \mid x_2) P(x_4 \mid x_3),\\ P(x_1, x_2, x_3, x_4) &= P(x_1) P(x_2 \mid x_1) P(x_3 \mid x_1, x_2) P(x_4 \mid x_2, x_3). \end{aligned}\end{split}\]

The probability formulae that involve one, two, and three variables are typically referred to as unigram, bigram, and trigram models, respectively. In order to compute the language model, we need to calculate the probability of words and the conditional probability of a word given the previous few words. Note that such probabilities are language model parameters.

9.3.1.2. Word Frequency

Here, we assume that the training dataset is a large text corpus, such as all Wikipedia entries, Project Gutenberg, and all text posted on the Web. The probability of words can be calculated from the relative word frequency of a given word in the training dataset. For example, the estimate \(\hat{P}(\text{deep})\) can be calculated as the probability of any sentence starting with the word “deep”. A slightly less accurate approach would be to count all occurrences of the word “deep” and divide it by the total number of words in the corpus. This works fairly well, particularly for frequent words. Moving on, we could attempt to estimate

(9.3.5)\[\hat{P}(\text{learning} \mid \text{deep}) = \frac{n(\text{deep, learning})}{n(\text{deep})},\]

where \(n(x)\) and \(n(x, x')\) are the number of occurrences of singletons and consecutive word pairs, respectively. Unfortunately, estimating the probability of a word pair is somewhat more difficult, since the occurrences of “deep learning” are a lot less frequent. In particular, for some unusual word combinations it may be tricky to find enough occurrences to get accurate estimates. As suggested by the empirical results in Section 9.2.5, things take a turn for the worse for three-word combinations and beyond. There will be many plausible three-word combinations that we likely will not see in our dataset. Unless we provide some solution to assign such word combinations nonzero count, we will not be able to use them in a language model. If the dataset is small or if the words are very rare, we might not find even a single one of them.

9.3.1.3. Laplace Smoothing

A common strategy is to perform some form of Laplace smoothing. The solution is to add a small constant to all counts. Denote by \(n\) the total number of words in the training set and \(m\) the number of unique words. This solution helps with singletons, e.g., via

(9.3.6)\[\begin{split}\begin{aligned} \hat{P}(x) & = \frac{n(x) + \epsilon_1/m}{n + \epsilon_1}, \\ \hat{P}(x' \mid x) & = \frac{n(x, x') + \epsilon_2 \hat{P}(x')}{n(x) + \epsilon_2}, \\ \hat{P}(x'' \mid x,x') & = \frac{n(x, x',x'') + \epsilon_3 \hat{P}(x'')}{n(x, x') + \epsilon_3}. \end{aligned}\end{split}\]

Here \(\epsilon_1,\epsilon_2\), and \(\epsilon_3\) are hyperparameters. Take \(\epsilon_1\) as an example: when \(\epsilon_1 = 0\), no smoothing is applied; when \(\epsilon_1\) approaches positive infinity, \(\hat{P}(x)\) approaches the uniform probability \(1/m\). The above is a rather primitive variant of what other techniques can accomplish (Wood et al., 2011).

Unfortunately, models like this get unwieldy rather quickly for the following reasons. First, as discussed in Section 9.2.5, many \(n\)-grams occur very rarely, making Laplace smoothing rather unsuitable for language modeling. Second, we need to store all counts. Third, this entirely ignores the meaning of the words. For instance, “cat” and “feline” should occur in related contexts. It is quite difficult to adjust such models to additional contexts, whereas, deep learning based language models are well suited to take this into account. Last, long word sequences are almost certain to be novel, hence a model that simply counts the frequency of previously seen word sequences is bound to perform poorly there. Therefore, we focus on using neural networks for language modeling in the rest of the chapter.

9.3.2. Perplexity

Next, let’s discuss about how to measure the language model quality, which will be used to evaluate our models in the subsequent sections. One way is to check how surprising the text is. A good language model is able to predict with high-accuracy tokens that what we will see next. Consider the following continuations of the phrase “It is raining”, as proposed by different language models:

  1. “It is raining outside”

  2. “It is raining banana tree”

  3. “It is raining piouw;kcj pwepoiut”

In terms of quality, example 1 is clearly the best. The words are sensible and logically coherent. While it might not quite accurately reflect which word follows semantically (“in San Francisco” and “in winter” would have been perfectly reasonable extensions), the model is able to capture which kind of word follows. Example 2 is considerably worse by producing a nonsensical extension. Nonetheless, at least the model has learned how to spell words and some degree of correlation between words. Last, example 3 indicates a poorly trained model that does not fit data properly.

We might measure the quality of the model by computing the likelihood of the sequence. Unfortunately this is a number that is hard to understand and difficult to compare. After all, shorter sequences are much more likely to occur than the longer ones, hence evaluating the model on Tolstoy’s magnum opus War and Peace will inevitably produce a much smaller likelihood than, say, on Saint-Exupery’s novella The Little Prince. What is missing is the equivalent of an average.

Information theory comes handy here. We have defined entropy, surprisal, and cross-entropy when we introduced the softmax regression (Section 4.1.3). If we want to compress text, we can ask about predicting the next token given the current set of tokens. A better language model should allow us to predict the next token more accurately. Thus, it should allow us to spend fewer bits in compressing the sequence. So we can measure it by the cross-entropy loss averaged over all the \(n\) tokens of a sequence:

(9.3.7)\[\frac{1}{n} \sum_{t=1}^n -\log P(x_t \mid x_{t-1}, \ldots, x_1),\]

where \(P\) is given by a language model and \(x_t\) is the actual token observed at time step \(t\) from the sequence. This makes the performance on documents of different lengths comparable. For historical reasons, scientists in natural language processing prefer to use a quantity called perplexity. In a nutshell, it is the exponential of (9.3.7):

(9.3.8)\[\exp\left(-\frac{1}{n} \sum_{t=1}^n \log P(x_t \mid x_{t-1}, \ldots, x_1)\right).\]

Perplexity can be best understood as the geometric mean of the number of real choices that we have when deciding which token to pick next. Let’s look at a number of cases:

  • In the best case scenario, the model always perfectly estimates the probability of the target token as 1. In this case the perplexity of the model is 1.

  • In the worst case scenario, the model always predicts the probability of the target token as 0. In this situation, the perplexity is positive infinity.

  • At the baseline, the model predicts a uniform distribution over all the available tokens of the vocabulary. In this case, the perplexity equals the number of unique tokens of the vocabulary. In fact, if we were to store the sequence without any compression, this would be the best we could do to encode it. Hence, this provides a nontrivial upper bound that any useful model must beat.

import torch
from d2l import torch as d2l
from mxnet import np, npx
from d2l import mxnet as d2l

npx.set_np()
import tensorflow as tf
from d2l import tensorflow as d2l

9.3.3. Partitioning Sequences

We will design language models using neural networks and use perplexity to evaluate how good the model is at predicting the next token given the current set of tokens in text sequences. Before introducing the model, let’s assume that it processes a minibatch of sequences with predefined length at a time. Now the question is how to read minibatches of input sequences and target sequences at random.

Suppose that the dataset takes the form of a sequence of \(T\) token indices in corpus. We will partition it into subsequences, where each subsequence has \(n\) tokens (time steps). To iterate over (almost) all the tokens of the entire dataset for each epoch and obtain all possible length-\(n\) subsequences, we can introduce randomness. More concretely, at the beginning of each epoch, discard the first \(d\) tokens, where \(d\in [0,n)\) is uniformly sampled at random. The rest of the sequence is then partitioned into \(m=\lfloor (T-d)/n \rfloor\) subsequences. Denote by \(\mathbf x_t = [x_t, \ldots, x_{t+n-1}]\) the length-\(n\) subsequence starting from token \(x_t\) at time step \(t\). The resulting \(m\) partitioned subsequences are \(\mathbf x_d, \mathbf x_{d+n}, \ldots, \mathbf x_{d+n(m-1)}.\) Each subsequence will be used as an input sequence into the language model.

For language modeling, the goal is to predict the next token based on what tokens we have seen so far, hence the targets (labels) are the original sequence, shifted by one token. The target sequence for any input sequence \(\mathbf x_t\) is \(\mathbf x_{t+1}\) with length \(n\).

../_images/lang-model-data.svg

Fig. 9.3.1 Obtaining 5 pairs of input sequences and target sequences from partitioned length-5 subsequences.

Fig. 9.3.1 shows an example of obtaining 5 pairs of input sequences and target sequences with \(n=5\) and \(d=2\).

@d2l.add_to_class(d2l.TimeMachine)  #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
    super(d2l.TimeMachine, self).__init__()
    self.save_hyperparameters()
    corpus, self.vocab = self.build(self._download())
    array = torch.tensor([corpus[i:i+num_steps+1]
                        for i in range(0, len(corpus)-num_steps-1)])
    self.X, self.Y = array[:,:-1], array[:,1:]
@d2l.add_to_class(d2l.TimeMachine)  #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
    super(d2l.TimeMachine, self).__init__()
    self.save_hyperparameters()
    corpus, self.vocab = self.build(self._download())
    array = np.array([corpus[i:i+num_steps+1]
                        for i in range(0, len(corpus)-num_steps-1)])
    self.X, self.Y = array[:,:-1], array[:,1:]
@d2l.add_to_class(d2l.TimeMachine)  #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
    super(d2l.TimeMachine, self).__init__()
    self.save_hyperparameters()
    corpus, self.vocab = self.build(self._download())
    array = tf.constant([corpus[i:i+num_steps+1]
                        for i in range(0, len(corpus)-num_steps-1)])
    self.X, self.Y = array[:,:-1], array[:,1:]

To train language models, we will randomly sample pairs of input sequences and target sequences in minibatches. The following data loader randomly generates a minibatch from the dataset each time. The argument batch_size specifies the number of subsequence examples (self.b) in each minibatch and num_steps is the subsequence length in tokens (self.n).

@d2l.add_to_class(d2l.TimeMachine)  #@save
def get_dataloader(self, train):
    idx = slice(0, self.num_train) if train else slice(
        self.num_train, self.num_train + self.num_val)
    return self.get_tensorloader([self.X, self.Y], train, idx)

As we can see in the following, a minibatch of target sequences can be obtained by shifting the input sequences by one token.

data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
    print('X:', X, '\nY:', Y)
    break
Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
X: tensor([[19,  2, 23,  6, 13, 13,  6, 19,  0, 10],
        [ 2, 23,  6, 13,  0, 21,  9, 19, 16, 22]])
Y: tensor([[ 2, 23,  6, 13, 13,  6, 19,  0, 10, 21],
        [23,  6, 13,  0, 21,  9, 19, 16, 22,  8]])
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
    print('X:', X, '\nY:', Y)
    break
Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
X: [[ 6.  5.  0.  2. 21.  0. 22. 20.  0. 10.]
 [ 9.  6.  0. 13.  2. 21. 21.  6. 19.  0.]]
Y: [[ 5.  0.  2. 21.  0. 22. 20.  0. 10.  0.]
 [ 6.  0. 13.  2. 21. 21.  6. 19.  0.  7.]]
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
    print('X:', X, '\nY:', Y)
    break
X: tf.Tensor(
[[22  3  6  0 21  9  2 21  0  5]
 [21  9  0  5 10 14  6 15 20 10]], shape=(2, 10), dtype=int32)
Y: tf.Tensor(
[[ 3  6  0 21  9  2 21  0  5 16]
 [ 9  0  5 10 14  6 15 20 10 16]], shape=(2, 10), dtype=int32)

9.3.4. Summary

Language models estimate the joint probability of a text sequence. For long sequences, \(n\)-grams provide a convenient model by truncating the dependence. However, there is a lot of structure but not enough frequency to deal with infrequent word combinations efficiently via Laplace smoothing. Thus, we will focus on neural language modeling in subsequent sections. To train language models, we can randomly sample pairs of input sequences and target sequences in minibatches. After training, we will use perplexity to measure the language model quality.

9.3.5. Exercises

  1. Suppose there are \(100,000\) words in the training dataset. How much word frequency and multi-word adjacent frequency does a four-gram need to store?

  2. How would you model a dialogue?

  3. What other methods can you think of for reading long sequence data?

  4. Consider our method for discarding a uniformly random number of the first few tokens at the beginning of each epoch.

    1. Does it really lead to a perfectly uniform distribution over the sequences on the document?

    2. What would you have to do to make things even more uniform?

  5. If we want a sequence example to be a complete sentence, what kind of problem does this introduce in minibatch sampling? How can we fix the problem?