.. _sec_mlp-implementation:
Implementation of Multilayer Perceptrons
========================================
Multilayer perceptrons (MLPs) are not much more complex to implement
than simple linear models. The key conceptual difference is that we now
concatenate multiple layers.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import torch
from torch import nn
from d2l import torch as d2l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from mxnet import np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import tensorflow as tf
from d2l import tensorflow as d2l
.. raw:: html
.. raw:: html
Implementation from Scratch
---------------------------
Let’s begin again by implementing such a network from scratch.
Initializing Model Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Recall that Fashion-MNIST contains 10 classes, and that each image
consists of a :math:`28 \times 28 = 784` grid of grayscale pixel values.
As before we will disregard the spatial structure among the pixels for
now, so we can think of this as a classification dataset with 784 input
features and 10 classes. To begin, we will implement an MLP with one
hidden layer and 256 hidden units. Both the number of layers and their
width are adjustable (they are considered hyperparameters). Typically,
we choose the layer widths to be divisible by larger powers of 2. This
is computationally efficient due to the way memory is allocated and
addressed in hardware.
Again, we will represent our parameters with several tensors. Note that
*for every layer*, we must keep track of one weight matrix and one bias
vector. As always, we allocate memory for the gradients of the loss with
respect to these parameters.
.. raw:: html
.. raw:: html
In the code below we use ``nn.Parameter`` to automatically register a
class attribute as a parameter to be tracked by ``autograd``
(:numref:`sec_autograd`).
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class MLPScratch(d2l.Classifier):
def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
super().__init__()
self.save_hyperparameters()
self.W1 = nn.Parameter(torch.randn(num_inputs, num_hiddens) * sigma)
self.b1 = nn.Parameter(torch.zeros(num_hiddens))
self.W2 = nn.Parameter(torch.randn(num_hiddens, num_outputs) * sigma)
self.b2 = nn.Parameter(torch.zeros(num_outputs))
.. raw:: html
.. raw:: html
In the code below, we first define and initialize the parameters and
then enable gradient tracking.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class MLPScratch(d2l.Classifier):
def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
super().__init__()
self.save_hyperparameters()
self.W1 = np.random.randn(num_inputs, num_hiddens) * sigma
self.b1 = np.zeros(num_hiddens)
self.W2 = np.random.randn(num_hiddens, num_outputs) * sigma
self.b2 = np.zeros(num_outputs)
for param in self.get_scratch_params():
param.attach_grad()
.. raw:: html
.. raw:: html
In the code below we use ``flax.linen.Module.param`` to define the model
parameter.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class MLPScratch(d2l.Classifier):
num_inputs: int
num_outputs: int
num_hiddens: int
lr: float
sigma: float = 0.01
def setup(self):
self.W1 = self.param('W1', nn.initializers.normal(self.sigma),
(self.num_inputs, self.num_hiddens))
self.b1 = self.param('b1', nn.initializers.zeros, self.num_hiddens)
self.W2 = self.param('W2', nn.initializers.normal(self.sigma),
(self.num_hiddens, self.num_outputs))
self.b2 = self.param('b2', nn.initializers.zeros, self.num_outputs)
.. raw:: html
.. raw:: html
In the code below we use ``tf.Variable`` to define the model parameter.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class MLPScratch(d2l.Classifier):
def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
super().__init__()
self.save_hyperparameters()
self.W1 = tf.Variable(
tf.random.normal((num_inputs, num_hiddens)) * sigma)
self.b1 = tf.Variable(tf.zeros(num_hiddens))
self.W2 = tf.Variable(
tf.random.normal((num_hiddens, num_outputs)) * sigma)
self.b2 = tf.Variable(tf.zeros(num_outputs))
.. raw:: html
.. raw:: html
Model
~~~~~
To make sure we know how everything works, we will implement the ReLU
activation ourselves rather than invoking the built-in ``relu`` function
directly.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def relu(X):
a = torch.zeros_like(X)
return torch.max(X, a)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def relu(X):
return np.maximum(X, 0)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def relu(X):
return jnp.maximum(X, 0)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def relu(X):
return tf.math.maximum(X, 0)
.. raw:: html
.. raw:: html
Since we are disregarding spatial structure, we ``reshape`` each
two-dimensional image into a flat vector of length ``num_inputs``.
Finally, we implement our model with just a few lines of code. Since we
use the framework built-in autograd this is all that it takes.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(MLPScratch)
def forward(self, X):
X = X.reshape((-1, self.num_inputs))
H = relu(torch.matmul(X, self.W1) + self.b1)
return torch.matmul(H, self.W2) + self.b2
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(MLPScratch)
def forward(self, X):
X = X.reshape((-1, self.num_inputs))
H = relu(np.dot(X, self.W1) + self.b1)
return np.dot(H, self.W2) + self.b2
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(MLPScratch)
def forward(self, X):
X = X.reshape((-1, self.num_inputs))
H = relu(jnp.matmul(X, self.W1) + self.b1)
return jnp.matmul(H, self.W2) + self.b2
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(MLPScratch)
def forward(self, X):
X = tf.reshape(X, (-1, self.num_inputs))
H = relu(tf.matmul(X, self.W1) + self.b1)
return tf.matmul(H, self.W2) + self.b2
.. raw:: html
.. raw:: html
Training
~~~~~~~~
Fortunately, the training loop for MLPs is exactly the same as for
softmax regression. We define the model, data, and trainer, then finally
invoke the ``fit`` method on model and data.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = MLPScratch(num_inputs=784, num_outputs=10, num_hiddens=256, lr=0.1)
data = d2l.FashionMNIST(batch_size=256)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_mlp-implementation_d1b2f2_67_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = MLPScratch(num_inputs=784, num_outputs=10, num_hiddens=256, lr=0.1)
data = d2l.FashionMNIST(batch_size=256)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_mlp-implementation_d1b2f2_70_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = MLPScratch(num_inputs=784, num_outputs=10, num_hiddens=256, lr=0.1)
data = d2l.FashionMNIST(batch_size=256)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_mlp-implementation_d1b2f2_73_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = MLPScratch(num_inputs=784, num_outputs=10, num_hiddens=256, lr=0.1)
data = d2l.FashionMNIST(batch_size=256)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_mlp-implementation_d1b2f2_76_0.svg
.. raw:: html
.. raw:: html
Concise Implementation
----------------------
As you might expect, by relying on the high-level APIs, we can implement
MLPs even more concisely.
Model
~~~~~
Compared with our concise implementation of softmax regression
implementation (:numref:`sec_softmax_concise`), the only difference is
that we add *two* fully connected layers where we previously added only
*one*. The first is the hidden layer, the second is the output layer.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class MLP(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens, lr):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential(nn.Flatten(), nn.LazyLinear(num_hiddens),
nn.ReLU(), nn.LazyLinear(num_outputs))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class MLP(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens, lr):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential()
self.net.add(nn.Dense(num_hiddens, activation='relu'),
nn.Dense(num_outputs))
self.net.initialize()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class MLP(d2l.Classifier):
num_outputs: int
num_hiddens: int
lr: float
@nn.compact
def __call__(self, X):
X = X.reshape((X.shape[0], -1)) # Flatten
X = nn.Dense(self.num_hiddens)(X)
X = nn.relu(X)
X = nn.Dense(self.num_outputs)(X)
return X
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class MLP(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens, lr):
super().__init__()
self.save_hyperparameters()
self.net = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(num_hiddens, activation='relu'),
tf.keras.layers.Dense(num_outputs)])
.. raw:: html
.. raw:: html
Previously, we defined ``forward`` methods for models to transform input
using the model parameters. These operations are essentially a pipeline:
you take an input and apply a transformation (e.g., matrix
multiplication with weights followed by bias addition), then
repetitively use the output of the current transformation as input to
the next transformation. However, you may have noticed that no
``forward`` method is defined here. In fact, ``MLP`` inherits the
``forward`` method from the ``Module`` class
(:numref:`subsec_oo-design-models`) to simply invoke ``self.net(X)``
(``X`` is input), which is now defined as a sequence of transformations
via the ``Sequential`` class. The ``Sequential`` class abstracts the
forward process enabling us to focus on the transformations. We will
further discuss how the ``Sequential`` class works in
:numref:`subsec_model-construction-sequential`.
Training
~~~~~~~~
The training loop is exactly the same as when we implemented softmax
regression. This modularity enables us to separate matters concerning
the model architecture from orthogonal considerations.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = MLP(num_outputs=10, num_hiddens=256, lr=0.1)
trainer.fit(model, data)
.. figure:: output_mlp-implementation_d1b2f2_97_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = MLP(num_outputs=10, num_hiddens=256, lr=0.1)
trainer.fit(model, data)
.. figure:: output_mlp-implementation_d1b2f2_100_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = MLP(num_outputs=10, num_hiddens=256, lr=0.1)
trainer.fit(model, data)
.. figure:: output_mlp-implementation_d1b2f2_103_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = MLP(num_outputs=10, num_hiddens=256, lr=0.1)
trainer.fit(model, data)
.. figure:: output_mlp-implementation_d1b2f2_106_0.svg
.. raw:: html
.. raw:: html
Summary
-------
Now that we have more practice in designing deep networks, the step from
a single to multiple layers of deep networks does not pose such a
significant challenge any longer. In particular, we can reuse the
training algorithm and data loader. Note, though, that implementing MLPs
from scratch is nonetheless messy: naming and keeping track of the model
parameters makes it difficult to extend models. For instance, imagine
wanting to insert another layer between layers 42 and 43. This might now
be layer 42b, unless we are willing to perform sequential renaming.
Moreover, if we implement the network from scratch, it is much more
difficult for the framework to perform meaningful performance
optimizations.
Nonetheless, you have now reached the state of the art of the late 1980s
when fully connected deep networks were the method of choice for neural
network modeling. Our next conceptual step will be to consider images.
Before we do so, we need to review a number of statistical basics and
details on how to compute models efficiently.
Exercises
---------
1. Change the number of hidden units ``num_hiddens`` and plot how its
number affects the accuracy of the model. What is the best value of
this hyperparameter?
2. Try adding a hidden layer to see how it affects the results.
3. Why is it a bad idea to insert a hidden layer with a single neuron?
What could go wrong?
4. How does changing the learning rate alter your results? With all
other parameters fixed, which learning rate gives you the best
results? How does this relate to the number of epochs?
5. Let’s optimize over all hyperparameters jointly, i.e., learning rate,
number of epochs, number of hidden layers, and number of hidden units
per layer.
1. What is the best result you can get by optimizing over all of
them?
2. Why it is much more challenging to deal with multiple
hyperparameters?
3. Describe an efficient strategy for optimizing over multiple
parameters jointly.
6. Compare the speed of the framework and the from-scratch
implementation for a challenging problem. How does it change with the
complexity of the network?
7. Measure the speed of tensor–matrix multiplications for well-aligned
and misaligned matrices. For instance, test for matrices with
dimension 1024, 1025, 1026, 1028, and 1032.
1. How does this change between GPUs and CPUs?
2. Determine the memory bus width of your CPU and GPU.
8. Try out different activation functions. Which one works best?
9. Is there a difference between weight initializations of the network?
Does it matter?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html