14.7. Single Shot Multibox Detection¶

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in SageMaker Studio Lab

In Section 14.3–Section 14.6, we introduced bounding boxes, anchor boxes, multiscale object detection, and the dataset for object detection. Now we are ready to use such background knowledge to design an object detection model: single shot multibox detection (SSD) (Liu et al., 2016). This model is simple, fast, and widely used. Although this is just one of vast amounts of object detection models, some of the design principles and implementation details in this section are also applicable to other models.

14.7.1. Model¶

Fig. 14.7.1 provides an overview of the design of single-shot multibox detection. This model mainly consists of a base network followed by several multiscale feature map blocks. The base network is for extracting features from the input image, so it can use a deep CNN. For example, the original single-shot multibox detection paper adopts a VGG network truncated before the classification layer (Liu et al., 2016), while ResNet has also been commonly used. Through our design we can make the base network output larger feature maps so as to generate more anchor boxes for detecting smaller objects. Subsequently, each multiscale feature map block reduces (e.g., by half) the height and width of the feature maps from the previous block, and enables each unit of the feature maps to increase its receptive field on the input image.

Recall the design of multiscale object detection through layerwise representations of images by deep neural networks in Section 14.5. Since multiscale feature maps closer to the top of Fig. 14.7.1 are smaller but have larger receptive fields, they are suitable for detecting fewer but larger objects.

In a nutshell, via its base network and several multiscale feature map blocks, single-shot multibox detection generates a varying number of anchor boxes with different sizes, and detects varying-size objects by predicting classes and offsets of these anchor boxes (thus the bounding boxes); thus, this is a multiscale object detection model.

Fig. 14.7.1 As a multiscale object detection model, single-shot multibox detection mainly consists of a base network followed by several multiscale feature map blocks.¶

In the following, we will describe the implementation details of different blocks in Fig. 14.7.1. To begin with, we discuss how to implement the class and bounding box prediction.

14.7.1.1. Class Prediction Layer¶

Let the number of object classes be \(q\). Then anchor boxes have \(q+1\) classes, where class 0 is background. At some scale, suppose that the height and width of feature maps are \(h\) and \(w\), respectively. When \(a\) anchor boxes are generated with each spatial position of these feature maps as their center, a total of \(hwa\) anchor boxes need to be classified. This often makes classification with fully connected layers infeasible due to likely heavy parametrization costs. Recall how we used channels of convolutional layers to predict classes in Section 8.3. Single-shot multibox detection uses the same technique to reduce model complexity.

Specifically, the class prediction layer uses a convolutional layer without altering width or height of feature maps. In this way, there can be a one-to-one correspondence between outputs and inputs at the same spatial dimensions (width and height) of feature maps. More concretely, channels of the output feature maps at any spatial position (\(x\), \(y\)) represent class predictions for all the anchor boxes centered on (\(x\), \(y\)) of the input feature maps. To produce valid predictions, there must be \(a(q+1)\) output channels, where for the same spatial position the output channel with index \(i(q+1) + j\) represents the prediction of the class \(j\) (\(0 \leq j \leq q\)) for the anchor box \(i\) (\(0 \leq i < a\)).

Below we define such a class prediction layer, specifying \(a\) and \(q\) via arguments num_anchors and num_classes, respectively. This layer uses a \(3\times3\) convolutional layer with a padding of 1. The width and height of the input and output of this convolutional layer remain unchanged.

14.7. Single Shot Multibox Detection¶ Colab [pytorch] Open the notebook in Colab Colab [mxnet] Open the notebook in Colab Colab [jax] Open the notebook in Colab Colab [tensorflow] Open the notebook in Colab SageMaker Studio Lab Open the notebook in SageMaker Studio Lab

14.7.1. Model¶

14.7.1.1. Class Prediction Layer¶

14.7.1.2. Bounding Box Prediction Layer¶

14.7.1.3. Concatenating Predictions for Multiple Scales¶

14.7.1.4. Downsampling Block¶

14.7.1.5. Base Network Block¶

14.7.1.6. The Complete Model¶

14.7.2. Training¶

14.7.2.1. Reading the Dataset and Initializing the Model¶

14.7.2.2. Defining Loss and Evaluation Functions¶

14.7.2.3. Training the Model¶

14.7.3. Prediction¶

14.7.4. Summary¶

14.7.5. Exercises¶

14.7. Single Shot Multibox Detection¶

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in SageMaker Studio Lab