.. _sec_fcn:
Fully Convolutional Networks
============================
As discussed in :numref:`sec_semantic_segmentation`, semantic
segmentation classifies images in pixel level. A fully convolutional
network (FCN) uses a convolutional neural network to transform image
pixels to pixel classes :cite:`Long.Shelhamer.Darrell.2015`. Unlike
the CNNs that we encountered earlier for image classification or object
detection, a fully convolutional network transforms the height and width
of intermediate feature maps back to those of the input image: this is
achieved by the transposed convolutional layer introduced in
:numref:`sec_transposed_conv`. As a result, the classification output
and the input image have a one-to-one correspondence in pixel level: the
channel dimension at any output pixel holds the classification results
for the input pixel at the same spatial position.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
from mxnet import gluon, image, init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
.. raw:: html
.. raw:: html
The Model
---------
Here we describe the basic design of the fully convolutional network
model. As shown in :numref:`fig_fcn`, this model first uses a CNN to
extract image features, then transforms the number of channels into the
number of classes via a :math:`1\times 1` convolutional layer, and
finally transforms the height and width of the feature maps to those of
the input image via the transposed convolution introduced in
:numref:`sec_transposed_conv`. As a result, the model output has the
same height and width as the input image, where the output channel
contains the predicted classes for the input pixel at the same spatial
position.
.. _fig_fcn:
.. figure:: ../img/fcn.svg
Fully convolutional network.
Below, we use a ResNet-18 model pretrained on the ImageNet dataset to
extract image features and denote the model instance as
``pretrained_net``. The last few layers of this model include a global
average pooling layer and a fully connected layer: they are not needed
in the fully convolutional network.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
pretrained_net = torchvision.models.resnet18(pretrained=True)
list(pretrained_net.children())[-3:]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/ci/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 56.3MB/s]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
),
AdaptiveAvgPool2d(output_size=(1, 1)),
Linear(in_features=512, out_features=1000, bias=True)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True)
pretrained_net.features[-3:], pretrained_net.output
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[22:23:49] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(HybridSequential(
(0): Activation(relu)
(1): GlobalAvgPool2D(size=(1, 1), stride=(1, 1), padding=(0, 0), ceil_mode=True, global_pool=True, pool_type=avg, layout=NCHW)
(2): Flatten
),
Dense(512 -> 1000, linear))
.. raw:: html
.. raw:: html
Next, we create the fully convolutional network instance ``net``. It
copies all the pretrained layers in the ResNet-18 except for the final
global average pooling layer and the fully connected layer that are
closest to the output.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = nn.Sequential(*list(pretrained_net.children())[:-2])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = nn.HybridSequential()
for layer in pretrained_net.features[:-2]:
net.add(layer)
.. raw:: html
.. raw:: html
Given an input with height and width of 320 and 480 respectively, the
forward propagation of ``net`` reduces the input height and width to
1/32 of the original, namely 10 and 15.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = torch.rand(size=(1, 3, 320, 480))
net(X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([1, 512, 10, 15])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = np.random.uniform(size=(1, 3, 320, 480))
net(X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(1, 512, 10, 15)
.. raw:: html
.. raw:: html
Next, we use a :math:`1\times 1` convolutional layer to transform the
number of output channels into the number of classes (21) of the Pascal
VOC2012 dataset. Finally, we need to increase the height and width of
the feature maps by 32 times to change them back to the height and width
of the input image. Recall how to calculate the output shape of a
convolutional layer in :numref:`sec_padding`. Since
:math:`(320-64+16\times2+32)/32=10` and
:math:`(480-64+16\times2+32)/32=15`, we construct a transposed
convolutional layer with stride of :math:`32`, setting the height and
width of the kernel to :math:`64`, the padding to :math:`16`. In
general, we can see that for stride :math:`s`, padding :math:`s/2`
(assuming :math:`s/2` is an integer), and the height and width of the
kernel :math:`2s`, the transposed convolution will increase the height
and width of the input by :math:`s` times.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
num_classes = 21
net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1))
net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes,
kernel_size=64, padding=16, stride=32))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
num_classes = 21
net.add(nn.Conv2D(num_classes, kernel_size=1),
nn.Conv2DTranspose(
num_classes, kernel_size=64, padding=16, strides=32))
.. raw:: html
.. raw:: html
Initializing Transposed Convolutional Layers
--------------------------------------------
We already know that transposed convolutional layers can increase the
height and width of feature maps. In image processing, we may need to
scale up an image, i.e., *upsampling*. *Bilinear interpolation* is one
of the commonly used upsampling techniques. It is also often used for
initializing transposed convolutional layers.
To explain bilinear interpolation, say that given an input image we want
to calculate each pixel of the upsampled output image. In order to
calculate the pixel of the output image at coordinate :math:`(x, y)`,
first map :math:`(x, y)` to coordinate :math:`(x', y')` on the input
image, for example, according to the ratio of the input size to the
output size. Note that the mapped :math:`x'` and :math:`y'` are real
numbers. Then, find the four pixels closest to coordinate
:math:`(x', y')` on the input image. Finally, the pixel of the output
image at coordinate :math:`(x, y)` is calculated based on these four
closest pixels on the input image and their relative distance from
:math:`(x', y')`.
Upsampling of bilinear interpolation can be implemented by the
transposed convolutional layer with the kernel constructed by the
following ``bilinear_kernel`` function. Due to space limitations, we
only provide the implementation of the ``bilinear_kernel`` function
below without discussions on its algorithm design.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def bilinear_kernel(in_channels, out_channels, kernel_size):
factor = (kernel_size + 1) // 2
if kernel_size % 2 == 1:
center = factor - 1
else:
center = factor - 0.5
og = (torch.arange(kernel_size).reshape(-1, 1),
torch.arange(kernel_size).reshape(1, -1))
filt = (1 - torch.abs(og[0] - center) / factor) * \
(1 - torch.abs(og[1] - center) / factor)
weight = torch.zeros((in_channels, out_channels,
kernel_size, kernel_size))
weight[range(in_channels), range(out_channels), :, :] = filt
return weight
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def bilinear_kernel(in_channels, out_channels, kernel_size):
factor = (kernel_size + 1) // 2
if kernel_size % 2 == 1:
center = factor - 1
else:
center = factor - 0.5
og = (np.arange(kernel_size).reshape(-1, 1),
np.arange(kernel_size).reshape(1, -1))
filt = (1 - np.abs(og[0] - center) / factor) * \
(1 - np.abs(og[1] - center) / factor)
weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size))
weight[range(in_channels), range(out_channels), :, :] = filt
return np.array(weight)
.. raw:: html
.. raw:: html
Let’s experiment with upsampling of bilinear interpolation that is
implemented by a transposed convolutional layer. We construct a
transposed convolutional layer that doubles the height and weight, and
initialize its kernel with the ``bilinear_kernel`` function.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv_trans = nn.ConvTranspose2d(3, 3, kernel_size=4, padding=1, stride=2,
bias=False)
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4));
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2)
conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4)))
.. raw:: html
.. raw:: html
Read the image ``X`` and assign the upsampling output to ``Y``. In order
to print the image, we need to adjust the position of the channel
dimension.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
img = torchvision.transforms.ToTensor()(d2l.Image.open('../img/catdog.jpg'))
X = img.unsqueeze(0)
Y = conv_trans(X)
out_img = Y[0].permute(1, 2, 0).detach()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
img = image.imread('../img/catdog.jpg')
X = np.expand_dims(img.astype('float32').transpose(2, 0, 1), axis=0) / 255
Y = conv_trans(X)
out_img = Y[0].transpose(1, 2, 0)
.. raw:: html
.. raw:: html
As we can see, the transposed convolutional layer increases both the
height and width of the image by a factor of two. Except for the
different scales in coordinates, the image scaled up by bilinear
interpolation and the original image printed in :numref:`sec_bbox`
look the same.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.set_figsize()
print('input image shape:', img.permute(1, 2, 0).shape)
d2l.plt.imshow(img.permute(1, 2, 0));
print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img);
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
input image shape: torch.Size([561, 728, 3])
output image shape: torch.Size([1122, 1456, 3])
.. figure:: output_fcn_ce3435_75_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.set_figsize()
print('input image shape:', img.shape)
d2l.plt.imshow(img.asnumpy());
print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img.asnumpy());
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
input image shape: (561, 728, 3)
output image shape: (1122, 1456, 3)
.. figure:: output_fcn_ce3435_78_1.svg
.. raw:: html
.. raw:: html
In a fully convolutional network, we initialize the transposed
convolutional layer with upsampling of bilinear interpolation. For the
:math:`1\times 1` convolutional layer, we use Xavier initialization.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
W = bilinear_kernel(num_classes, num_classes, 64)
net.transpose_conv.weight.data.copy_(W);
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
W = bilinear_kernel(num_classes, num_classes, 64)
net[-1].initialize(init.Constant(W))
net[-2].initialize(init=init.Xavier())
.. raw:: html
.. raw:: html
Reading the Dataset
-------------------
We read the semantic segmentation dataset as introduced in
:numref:`sec_semantic_segmentation`. The output image shape of random
cropping is specified as :math:`320\times 480`: both the height and
width are divisible by :math:`32`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
batch_size, crop_size = 32, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
read 1114 examples
read 1078 examples
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
batch_size, crop_size = 32, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
read 1114 examples
read 1078 examples
.. raw:: html
.. raw:: html
Training
--------
Now we can train our constructed fully convolutional network. The loss
function and accuracy calculation here are not essentially different
from those in image classification of earlier chapters. Because we use
the output channel of the transposed convolutional layer to predict the
class for each pixel, the channel dimension is specified in the loss
calculation. In addition, the accuracy is calculated based on
correctness of the predicted class for all the pixels.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def loss(inputs, targets):
return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)
num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus()
trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd)
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.449, train acc 0.861, test acc 0.852
226.7 examples/sec on [device(type='cuda', index=0), device(type='cuda', index=1)]
.. figure:: output_fcn_ce3435_102_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
num_epochs, lr, wd, devices = 5, 0.1, 1e-3, d2l.try_all_gpus()
loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1)
net.collect_params().reset_ctx(devices)
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'learning_rate': lr, 'wd': wd})
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.322, train acc 0.894, test acc 0.851
132.1 examples/sec on [gpu(0), gpu(1)]
.. figure:: output_fcn_ce3435_105_1.svg
.. raw:: html
.. raw:: html
Prediction
----------
When predicting, we need to standardize the input image in each channel
and transform the image into the four-dimensional input format required
by the CNN.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def predict(img):
X = test_iter.dataset.normalize_image(img).unsqueeze(0)
pred = net(X.to(devices[0])).argmax(dim=1)
return pred.reshape(pred.shape[1], pred.shape[2])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def predict(img):
X = test_iter._dataset.normalize_image(img)
X = np.expand_dims(X.transpose(2, 0, 1), axis=0)
pred = net(X.as_in_ctx(devices[0])).argmax(axis=1)
return pred.reshape(pred.shape[1], pred.shape[2])
.. raw:: html
.. raw:: html
To visualize the predicted class of each pixel, we map the predicted
class back to its label color in the dataset.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def label2image(pred):
colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])
X = pred.long()
return colormap[X, :]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def label2image(pred):
colormap = np.array(d2l.VOC_COLORMAP, ctx=devices[0], dtype='uint8')
X = pred.astype('int32')
return colormap[X, :]
.. raw:: html
.. raw:: html
Images in the test dataset vary in size and shape. Since the model uses
a transposed convolutional layer with stride of 32, when the height or
width of an input image is indivisible by 32, the output height or width
of the transposed convolutional layer will deviate from the shape of the
input image. In order to address this issue, we can crop multiple
rectangular areas with height and width that are integer multiples of 32
in the image, and perform forward propagation on the pixels in these
areas separately. Note that the union of these rectangular areas needs
to completely cover the input image. When a pixel is covered by multiple
rectangular areas, the average of the transposed convolution outputs in
separate areas for this same pixel can be input to the softmax operation
to predict the class.
For simplicity, we only read a few larger test images, and crop a
:math:`320\times480` area for prediction starting from the upper-left
corner of an image. For these test images, we print their cropped areas,
prediction results, and ground-truth row by row.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
crop_rect = (0, 0, 320, 480)
X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)
pred = label2image(predict(X))
imgs += [X.permute(1,2,0), pred.cpu(),
torchvision.transforms.functional.crop(
test_labels[i], *crop_rect).permute(1,2,0)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);
.. figure:: output_fcn_ce3435_129_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
crop_rect = (0, 0, 480, 320)
X = image.fixed_crop(test_images[i], *crop_rect)
pred = label2image(predict(X))
imgs += [X, pred, image.fixed_crop(test_labels[i], *crop_rect)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);
.. figure:: output_fcn_ce3435_132_0.svg
.. raw:: html
.. raw:: html
Summary
-------
- The fully convolutional network first uses a CNN to extract image
features, then transforms the number of channels into the number of
classes via a :math:`1\times 1` convolutional layer, and finally
transforms the height and width of the feature maps to those of the
input image via the transposed convolution.
- In a fully convolutional network, we can use upsampling of bilinear
interpolation to initialize the transposed convolutional layer.
Exercises
---------
1. If we use Xavier initialization for the transposed convolutional
layer in the experiment, how does the result change?
2. Can you further improve the accuracy of the model by tuning the
hyperparameters?
3. Predict the classes of all pixels in test images.
4. The original fully convolutional network paper also uses outputs of
some intermediate CNN layers :cite:`Long.Shelhamer.Darrell.2015`.
Try to implement this idea.
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html