4.3. Concise Implementation of Multilayer Perceptron
Open the notebook in Colab
Open the notebook in Colab

As you might expect, by relying on the high-level interface, we can implement MLPs even more concisely.

from d2l import mxnet as d2l
from mxnet import gluon, init, npx
from mxnet.gluon import nn
npx.set_np()
from d2l import torch as d2l
import torch
from torch import nn

4.3.1. The Model

As compared to our gluon implementation of softmax regression implementation (Section 3.7), the only difference is that we add two fully-connected layers (previously, we added one). The first is our hidden layer, which contains 256 hidden units and applies the ReLU activation function. The second is our output layer.

net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'),
        nn.Dense(10))
net.initialize(init.Normal(sigma=0.01))
class Reshape(torch.nn.Module):
    def forward(self, x):
        return x.view(-1,784)

net = nn.Sequential(Reshape(),
                    nn.Linear(784, 256),
                    nn.ReLU(),
                    nn.Linear(256, 10))

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.normal_(m.weight, std=0.01)

net.apply(init_weights)
Sequential(
  (0): Reshape()
  (1): Linear(in_features=784, out_features=256, bias=True)
  (2): ReLU()
  (3): Linear(in_features=256, out_features=10, bias=True)
)

The training loop is exactly the same as when we implemented softmax regression. This modularity enables us to separate matters concerning the model architecture from orthogonal considerations.

batch_size, num_epochs = 256, 10
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
loss = gluon.loss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.5})
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
../_images/output_mlp-concise_f87756_21_0.svg
num_epochs, lr, batch_size = 10, 0.5, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
loss = nn.CrossEntropyLoss()
trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
../_images/output_mlp-concise_f87756_24_0.svg

4.3.2. Exercises

  1. Try adding different numbers of hidden layers. What setting (keeping other parameters and hyperparameters constant) works best?

  2. Try out different activation functions. Which ones work best?

  3. Try different schemes for initializing the weights. What method works best?