.. _sec_synthetic-regression-data:
Synthetic Regression Data
=========================
Machine learning is all about extracting information from data. So you
might wonder, what could we possibly learn from synthetic data? While we
might not care intrinsically about the patterns that we ourselves baked
into an artificial data generating model, such datasets are nevertheless
useful for didactic purposes, helping us to evaluate the properties of
our learning algorithms and to confirm that our implementations work as
expected. For example, if we create data for which the correct
parameters are known *a priori*, then we can check that our model can in
fact recover them.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import random
import torch
from d2l import torch as d2l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import random
from mxnet import gluon, np, npx
from d2l import mxnet as d2l
npx.set_np()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import random
import jax
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from jax import numpy as jnp
from d2l import jax as d2l
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import random
import tensorflow as tf
from d2l import tensorflow as d2l
.. raw:: html
.. raw:: html
Generating the Dataset
----------------------
For this example, we will work in low dimension for succinctness. The
following code snippet generates 1000 examples with 2-dimensional
features drawn from a standard normal distribution. The resulting design
matrix :math:`\mathbf{X}` belongs to :math:`\mathbb{R}^{1000 \times 2}`.
We generate each label by applying a *ground truth* linear function,
corrupting them via additive noise :math:`\boldsymbol{\epsilon}`, drawn
independently and identically for each example:
.. math:: \mathbf{y}= \mathbf{X} \mathbf{w} + b + \boldsymbol{\epsilon}.
For convenience we assume that :math:`\boldsymbol{\epsilon}` is drawn
from a normal distribution with mean :math:`\mu= 0` and standard
deviation :math:`\sigma = 0.01`. Note that for object-oriented design we
add the code to the ``__init__`` method of a subclass of
``d2l.DataModule`` (introduced in :numref:`oo-design-data`). It is
good practice to allow the setting of any additional hyperparameters. We
accomplish this with ``save_hyperparameters()``. The ``batch_size`` will
be determined later.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class SyntheticRegressionData(d2l.DataModule): #@save
"""Synthetic data for linear regression."""
def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
batch_size=32):
super().__init__()
self.save_hyperparameters()
n = num_train + num_val
self.X = torch.randn(n, len(w))
noise = torch.randn(n, 1) * noise
self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class SyntheticRegressionData(d2l.DataModule): #@save
"""Synthetic data for linear regression."""
def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
batch_size=32):
super().__init__()
self.save_hyperparameters()
n = num_train + num_val
self.X = np.random.randn(n, len(w))
noise = np.random.randn(n, 1) * noise
self.y = np.dot(self.X, w.reshape((-1, 1))) + b + noise
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class SyntheticRegressionData(d2l.DataModule): #@save
"""Synthetic data for linear regression."""
def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
batch_size=32):
super().__init__()
self.save_hyperparameters()
n = num_train + num_val
key = jax.random.PRNGKey(0)
key1, key2 = jax.random.split(key)
self.X = jax.random.normal(key1, (n, w.shape[0]))
noise = jax.random.normal(key2, (n, 1)) * noise
self.y = jnp.matmul(self.X, w.reshape((-1, 1))) + b + noise
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class SyntheticRegressionData(d2l.DataModule): #@save
"""Synthetic data for linear regression."""
def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
batch_size=32):
super().__init__()
self.save_hyperparameters()
n = num_train + num_val
self.X = tf.random.normal((n, w.shape[0]))
noise = tf.random.normal((n, 1)) * noise
self.y = tf.matmul(self.X, tf.reshape(w, (-1, 1))) + b + noise
.. raw:: html
.. raw:: html
Below, we set the true parameters to :math:`\mathbf{w} = [2, -3.4]^\top`
and :math:`b = 4.2`. Later, we can check our estimated parameters
against these *ground truth* values.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = SyntheticRegressionData(w=np.array([2, -3.4]), b=4.2)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[22:03:54] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = SyntheticRegressionData(w=jnp.array([2, -3.4]), b=4.2)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = SyntheticRegressionData(w=tf.constant([2, -3.4]), b=4.2)
.. raw:: html
.. raw:: html
Each row in ``features`` consists of a vector in :math:`\mathbb{R}^2`
and each row in ``labels`` is a scalar. Let’s have a look at the first
entry.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
print('features:', data.X[0],'\nlabel:', data.y[0])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
features: tensor([0.9026, 1.0264])
label: tensor([2.5148])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
print('features:', data.X[0],'\nlabel:', data.y[0])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
features: [2.2122064 1.1630787]
label: [4.684836]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
print('features:', data.X[0],'\nlabel:', data.y[0])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
features: [-0.86997527 -3.2320356 ]
label: [13.438176]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
print('features:', data.X[0],'\nlabel:', data.y[0])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
features: tf.Tensor([ 1.0864811 -0.40848374], shape=(2,), dtype=float32)
label: tf.Tensor([7.7656293], shape=(1,), dtype=float32)
.. raw:: html
.. raw:: html
Reading the Dataset
-------------------
Training machine learning models often requires multiple passes over a
dataset, grabbing one minibatch of examples at a time. This data is then
used to update the model. To illustrate how this works, we implement the
``get_dataloader`` method, registering it in the
``SyntheticRegressionData`` class via ``add_to_class`` (introduced in
:numref:`oo-design-utilities`). It takes a batch size, a matrix of
features, and a vector of labels, and generates minibatches of size
``batch_size``. As such, each minibatch consists of a tuple of features
and labels. Note that we need to be mindful of whether we’re in training
or validation mode: in the former, we will want to read the data in
random order, whereas for the latter, being able to read data in a
pre-defined order may be important for debugging purposes.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
if train:
indices = list(range(0, self.num_train))
# The examples are read in random order
random.shuffle(indices)
else:
indices = list(range(self.num_train, self.num_train+self.num_val))
for i in range(0, len(indices), self.batch_size):
batch_indices = torch.tensor(indices[i: i+self.batch_size])
yield self.X[batch_indices], self.y[batch_indices]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
if train:
indices = list(range(0, self.num_train))
# The examples are read in random order
random.shuffle(indices)
else:
indices = list(range(self.num_train, self.num_train+self.num_val))
for i in range(0, len(indices), self.batch_size):
batch_indices = np.array(indices[i: i+self.batch_size])
yield self.X[batch_indices], self.y[batch_indices]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
if train:
indices = list(range(0, self.num_train))
# The examples are read in random order
random.shuffle(indices)
else:
indices = list(range(self.num_train, self.num_train+self.num_val))
for i in range(0, len(indices), self.batch_size):
batch_indices = jnp.array(indices[i: i+self.batch_size])
yield self.X[batch_indices], self.y[batch_indices]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(SyntheticRegressionData)
def get_dataloader(self, train):
if train:
indices = list(range(0, self.num_train))
# The examples are read in random order
random.shuffle(indices)
else:
indices = list(range(self.num_train, self.num_train+self.num_val))
for i in range(0, len(indices), self.batch_size):
j = tf.constant(indices[i : i+self.batch_size])
yield tf.gather(self.X, j), tf.gather(self.y, j)
.. raw:: html
.. raw:: html
To build some intuition, let’s inspect the first minibatch of data. Each
minibatch of features provides us with both its size and the
dimensionality of input features. Likewise, our minibatch of labels will
have a matching shape given by ``batch_size``.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X shape: torch.Size([32, 2])
y shape: torch.Size([32, 1])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X shape: (32, 2)
y shape: (32, 1)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X shape: (32, 2)
y shape: (32, 1)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X shape: (32, 2)
y shape: (32, 1)
.. raw:: html
.. raw:: html
While seemingly innocuous, the invocation of
``iter(data.train_dataloader())`` illustrates the power of Python’s
object-oriented design. Note that we added a method to the
``SyntheticRegressionData`` class *after* creating the ``data`` object.
Nonetheless, the object benefits from the *ex post facto* addition of
functionality to the class.
Throughout the iteration we obtain distinct minibatches until the entire
dataset has been exhausted (try this). While the iteration implemented
above is good for didactic purposes, it is inefficient in ways that
might get us into trouble with real problems. For example, it requires
that we load all the data in memory and that we perform lots of random
memory access. The built-in iterators implemented in a deep learning
framework are considerably more efficient and they can deal with sources
such as data stored in files, data received via a stream, and data
generated or processed on the fly. Next let’s try to implement the same
method using built-in iterators.
Concise Implementation of the Data Loader
-----------------------------------------
Rather than writing our own iterator, we can call the existing API in a
framework to load data. As before, we need a dataset with features ``X``
and labels ``y``. Beyond that, we set ``batch_size`` in the built-in
data loader and let it take care of shuffling examples efficiently.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.DataModule) #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
tensors = tuple(a[indices] for a in tensors)
dataset = torch.utils.data.TensorDataset(*tensors)
return torch.utils.data.DataLoader(dataset, self.batch_size,
shuffle=train)
@d2l.add_to_class(SyntheticRegressionData) #@save
def get_dataloader(self, train):
i = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader((self.X, self.y), train, i)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.DataModule) #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
tensors = tuple(a[indices] for a in tensors)
dataset = gluon.data.ArrayDataset(*tensors)
return gluon.data.DataLoader(dataset, self.batch_size,
shuffle=train)
@d2l.add_to_class(SyntheticRegressionData) #@save
def get_dataloader(self, train):
i = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader((self.X, self.y), train, i)
.. raw:: html
.. raw:: html
JAX is all about NumPy like API with device acceleration and the
functional transformations, so at least the current version doesn’t
include data loading methods. With other libraries we already have great
data loaders out there, and JAX suggests using them instead. Here we
will grab TensorFlow’s data loader, and modify it slightly to make it
work with JAX.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.DataModule) #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
tensors = tuple(a[indices] for a in tensors)
# Use Tensorflow Datasets & Dataloader. JAX or Flax do not provide
# any dataloading functionality
shuffle_buffer = tensors[0].shape[0] if train else 1
return tfds.as_numpy(
tf.data.Dataset.from_tensor_slices(tensors).shuffle(
buffer_size=shuffle_buffer).batch(self.batch_size))
@d2l.add_to_class(SyntheticRegressionData) #@save
def get_dataloader(self, train):
i = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader((self.X, self.y), train, i)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.DataModule) #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
tensors = tuple(a[indices] for a in tensors)
shuffle_buffer = tensors[0].shape[0] if train else 1
return tf.data.Dataset.from_tensor_slices(tensors).shuffle(
buffer_size=shuffle_buffer).batch(self.batch_size)
@d2l.add_to_class(SyntheticRegressionData) #@save
def get_dataloader(self, train):
i = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader((self.X, self.y), train, i)
.. raw:: html
.. raw:: html
The new data loader behaves just like the previous one, except that it
is more efficient and has some added functionality.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X shape: torch.Size([32, 2])
y shape: torch.Size([32, 1])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X shape: (32, 2)
y shape: (32, 1)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X shape: (32, 2)
y shape: (32, 1)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X shape: (32, 2)
y shape: (32, 1)
.. raw:: html
.. raw:: html
For instance, the data loader provided by the framework API supports the
built-in ``__len__`` method, so we can query its length, i.e., the
number of batches.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
len(data.train_dataloader())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
32
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
len(data.train_dataloader())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
32
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
len(data.train_dataloader())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
32
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
len(data.train_dataloader())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
32
.. raw:: html
.. raw:: html
Summary
-------
Data loaders are a convenient way of abstracting out the process of
loading and manipulating data. This way the same machine learning
*algorithm* is capable of processing many different types and sources of
data without the need for modification. One of the nice things about
data loaders is that they can be composed. For instance, we might be
loading images and then have a postprocessing filter that crops them or
modifies them in other ways. As such, data loaders can be used to
describe an entire data processing pipeline.
As for the model itself, the two-dimensional linear model is about the
simplest we might encounter. It lets us test out the accuracy of
regression models without worrying about having insufficient amounts of
data or an underdetermined system of equations. We will put this to good
use in the next section.
Exercises
---------
1. What will happen if the number of examples cannot be divided by the
batch size. How would you change this behavior by specifying a
different argument by using the framework’s API?
2. Suppose that we want to generate a huge dataset, where both the size
of the parameter vector ``w`` and the number of examples
``num_examples`` are large.
1. What happens if we cannot hold all data in memory?
2. How would you shuffle the data if it is held on disk? Your task is
to design an *efficient* algorithm that does not require too many
random reads or writes. Hint: `pseudorandom permutation
generators `__
allow you to design a reshuffle without the need to store the
permutation table explicitly :cite:`Naor.Reingold.1999`.
3. Implement a data generator that produces new data on the fly, every
time the iterator is called.
4. How would you design a random data generator that generates *the
same* data each time it is called?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html