.. _sec_bbox:
Object Detection and Bounding Boxes
===================================
In earlier sections (e.g.,
:numref:`sec_alexnet`–:numref:`sec_googlenet`), we introduced
various models for image classification. In image classification tasks,
we assume that there is only *one* major object in the image and we only
focus on how to recognize its category. However, there are often
*multiple* objects in the image of interest. We not only want to know
their categories, but also their specific positions in the image. In
computer vision, we refer to such tasks as *object detection* (or
*object recognition*).
Object detection has been widely applied in many fields. For example,
self-driving needs to plan traveling routes by detecting the positions
of vehicles, pedestrians, roads, and obstacles in the captured video
images. Besides, robots may use this technique to detect and localize
objects of interest throughout its navigation of an environment.
Moreover, security systems may need to detect abnormal objects, such as
intruders or bombs.
In the next few sections, we will introduce several deep learning
methods for object detection. We will begin with an introduction to
*positions* (or *locations*) of objects.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import torch
from d2l import torch as d2l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
from mxnet import image, np, npx
from d2l import mxnet as d2l
npx.set_np()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import tensorflow as tf
from d2l import tensorflow as d2l
.. raw:: html
.. raw:: html
We will load the sample image to be used in this section. We can see
that there is a dog on the left side of the image and a cat on the
right. They are the two major objects in this image.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.set_figsize()
img = d2l.plt.imread('../img/catdog.jpg')
d2l.plt.imshow(img);
.. figure:: output_bounding-box_d6b70e_15_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.set_figsize()
img = image.imread('../img/catdog.jpg').asnumpy()
d2l.plt.imshow(img);
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[21:49:41] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
.. figure:: output_bounding-box_d6b70e_18_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.set_figsize()
img = d2l.plt.imread('../img/catdog.jpg')
d2l.plt.imshow(img);
.. figure:: output_bounding-box_d6b70e_21_0.svg
.. raw:: html
.. raw:: html
Bounding Boxes
--------------
In object detection, we usually use a *bounding box* to describe the
spatial location of an object. The bounding box is rectangular, which is
determined by the :math:`x` and :math:`y` coordinates of the upper-left
corner of the rectangle and the such coordinates of the lower-right
corner. Another commonly used bounding box representation is the
:math:`(x, y)`-axis coordinates of the bounding box center, and the
width and height of the box.
Here we define functions to convert between these two representations:
``box_corner_to_center`` converts from the two-corner representation to
the center-width-height presentation, and ``box_center_to_corner`` vice
versa. The input argument ``boxes`` should be a two-dimensional tensor
of shape (:math:`n`, 4), where :math:`n` is the number of bounding
boxes.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def box_corner_to_center(boxes):
"""Convert from (upper-left, lower-right) to (center, width, height)."""
x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
cx = (x1 + x2) / 2
cy = (y1 + y2) / 2
w = x2 - x1
h = y2 - y1
boxes = torch.stack((cx, cy, w, h), axis=-1)
return boxes
#@save
def box_center_to_corner(boxes):
"""Convert from (center, width, height) to (upper-left, lower-right)."""
cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
x1 = cx - 0.5 * w
y1 = cy - 0.5 * h
x2 = cx + 0.5 * w
y2 = cy + 0.5 * h
boxes = torch.stack((x1, y1, x2, y2), axis=-1)
return boxes
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def box_corner_to_center(boxes):
"""Convert from (upper-left, lower-right) to (center, width, height)."""
x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
cx = (x1 + x2) / 2
cy = (y1 + y2) / 2
w = x2 - x1
h = y2 - y1
boxes = np.stack((cx, cy, w, h), axis=-1)
return boxes
#@save
def box_center_to_corner(boxes):
"""Convert from (center, width, height) to (upper-left, lower-right)."""
cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
x1 = cx - 0.5 * w
y1 = cy - 0.5 * h
x2 = cx + 0.5 * w
y2 = cy + 0.5 * h
boxes = np.stack((x1, y1, x2, y2), axis=-1)
return boxes
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def box_corner_to_center(boxes):
"""Convert from (upper-left, lower-right) to (center, width, height)."""
x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
cx = (x1 + x2) / 2
cy = (y1 + y2) / 2
w = x2 - x1
h = y2 - y1
boxes = tf.stack((cx, cy, w, h), axis=-1)
return boxes
#@save
def box_center_to_corner(boxes):
"""Convert from (center, width, height) to (upper-left, lower-right)."""
cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
x1 = cx - 0.5 * w
y1 = cy - 0.5 * h
x2 = cx + 0.5 * w
y2 = cy + 0.5 * h
boxes = tf.stack((x1, y1, x2, y2), axis=-1)
return boxes
.. raw:: html
.. raw:: html
We will define the bounding boxes of the dog and the cat in the image
based on the coordinate information. The origin of the coordinates in
the image is the upper-left corner of the image, and to the right and
down are the positive directions of the :math:`x` and :math:`y` axes,
respectively.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Here `bbox` is the abbreviation for bounding box
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Here `bbox` is the abbreviation for bounding box
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Here `bbox` is the abbreviation for bounding box
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]
.. raw:: html
.. raw:: html
We can verify the correctness of the two bounding box conversion
functions by converting twice.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
boxes = torch.tensor((dog_bbox, cat_bbox))
box_center_to_corner(box_corner_to_center(boxes)) == boxes
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[True, True, True, True],
[True, True, True, True]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
boxes = np.array((dog_bbox, cat_bbox))
box_center_to_corner(box_corner_to_center(boxes)) == boxes
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[ True, True, True, True],
[ True, True, True, True]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
boxes = tf.constant((dog_bbox, cat_bbox))
box_center_to_corner(box_corner_to_center(boxes)) == boxes
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
Let’s draw the bounding boxes in the image to check if they are
accurate. Before drawing, we will define a helper function
``bbox_to_rect``. It represents the bounding box in the bounding box
format of the ``matplotlib`` package.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def bbox_to_rect(bbox, color):
"""Convert bounding box to matplotlib format."""
# Convert the bounding box (upper-left x, upper-left y, lower-right x,
# lower-right y) format to the matplotlib format: ((upper-left x,
# upper-left y), width, height)
return d2l.plt.Rectangle(
xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
fill=False, edgecolor=color, linewidth=2)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def bbox_to_rect(bbox, color):
"""Convert bounding box to matplotlib format."""
# Convert the bounding box (upper-left x, upper-left y, lower-right x,
# lower-right y) format to the matplotlib format: ((upper-left x,
# upper-left y), width, height)
return d2l.plt.Rectangle(
xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
fill=False, edgecolor=color, linewidth=2)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def bbox_to_rect(bbox, color):
"""Convert bounding box to matplotlib format."""
# Convert the bounding box (upper-left x, upper-left y, lower-right x,
# lower-right y) format to the matplotlib format: ((upper-left x,
# upper-left y), width, height)
return d2l.plt.Rectangle(
xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
fill=False, edgecolor=color, linewidth=2)
.. raw:: html
.. raw:: html
After adding the bounding boxes on the image, we can see that the main
outline of the two objects are basically inside the two boxes.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
fig = d2l.plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));
.. figure:: output_bounding-box_d6b70e_75_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
fig = d2l.plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));
.. figure:: output_bounding-box_d6b70e_78_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
fig = d2l.plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));
.. figure:: output_bounding-box_d6b70e_81_0.svg
.. raw:: html
.. raw:: html
Summary
-------
- Object detection not only recognizes all the objects of interest in
the image, but also their positions. The position is generally
represented by a rectangular bounding box.
- We can convert between two commonly used bounding box
representations.
Exercises
---------
1. Find another image and try to label a bounding box that contains the
object. Compare labeling bounding boxes and categories: which usually
takes longer?
2. Why is the innermost dimension of the input argument ``boxes`` of
``box_corner_to_center`` and ``box_center_to_corner`` always 4?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html