.. raw:: html

For a warmup consider the following toy problem: we want to generate a random matrix and multiply it. Let’s do that both in NumPy and in PyTorch tensor to see the difference. Note that PyTorch ``tensor`` is defined on a GPU. .. raw:: latex \diilbookstyleinputcell .. code:: python # Warmup for GPU computation device = d2l.try_gpu() a = torch.randn(size=(1000, 1000), device=device) b = torch.mm(a, a) with d2l.Benchmark('numpy'): for _ in range(10): a = numpy.random.normal(size=(1000, 1000)) b = numpy.dot(a, a) with d2l.Benchmark('torch'): for _ in range(10): a = torch.randn(size=(1000, 1000), device=device) b = torch.mm(a, a) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output numpy: 1.4693 sec torch: 0.0022 sec The benchmark output via PyTorch is orders of magnitude faster. NumPy dot product is executed on the CPU processor while PyTorch matrix multiplication is executed on GPU and hence the latter is expected to be much faster. But the huge time difference suggests something else must be going on. By default, GPU operations are asynchronous in PyTorch. Forcing PyTorch to finish all computation prior to returning shows what happened previously: computation is being executed by the backend while the frontend returns control to Python. .. raw:: latex \diilbookstyleinputcell .. code:: python with d2l.Benchmark(): for _ in range(10): a = torch.randn(size=(1000, 1000), device=device) b = torch.mm(a, a) torch.cuda.synchronize(device) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Done: 0.0058 sec Broadly speaking, PyTorch has a frontend for direct interaction with the users, e.g., via Python, as well as a backend used by the system to perform the computation. As shown in :numref:`fig_frontends`, users can write PyTorch programs in various frontend languages, such as Python and C++. Regardless of the frontend programming language used, the execution of PyTorch programs occurs primarily in the backend of C++ implementations. Operations issued by the frontend language are passed on to the backend for execution. The backend manages its own threads that continuously collect and execute queued tasks. Note that for this to work the backend must be able to keep track of the dependencies between various steps in the computational graph. Hence, it is not possible to parallelize operations that depend on each other. .. raw:: html

.. raw:: html

For a warmup consider the following toy problem: we want to generate a random matrix and multiply it. Let’s do that both in NumPy and in ``mxnet.np`` to see the difference. .. raw:: latex \diilbookstyleinputcell .. code:: python with d2l.Benchmark('numpy'): for _ in range(10): a = numpy.random.normal(size=(1000, 1000)) b = numpy.dot(a, a) with d2l.Benchmark('mxnet.np'): for _ in range(10): a = np.random.normal(size=(1000, 1000)) b = np.dot(a, a) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output numpy: 0.8850 sec mxnet.np: 0.0164 sec [21:49:14] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU The benchmark output via MXNet is orders of magnitude faster. Since both are executed on the same processor something else must be going on. Forcing MXNet to finish all the backend computation prior to returning shows what happened previously: computation is executed by the backend while the frontend returns control to Python. .. raw:: latex \diilbookstyleinputcell .. code:: python with d2l.Benchmark(): for _ in range(10): a = np.random.normal(size=(1000, 1000)) b = np.dot(a, a) npx.waitall() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Done: 1.4073 sec Broadly speaking, MXNet has a frontend for direct interactions with users, e.g., via Python, as well as a backend used by the system to perform the computation. As shown in :numref:`fig_frontends`, users can write MXNet programs in various frontend languages, such as Python, R, Scala, and C++. Regardless of the frontend programming language used, the execution of MXNet programs occurs primarily in the backend of C++ implementations. Operations issued by the frontend language are passed on to the backend for execution. The backend manages its own threads that continuously collect and execute queued tasks. Note that for this to work the backend must be able to keep track of the dependencies between various steps in the computational graph. Hence, it is not possible to parallelize operations that depend on each other. .. raw:: html

.. raw:: html

pytorch mxnet

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python x = torch.ones((1, 2), device=device) y = torch.ones((1, 2), device=device) z = x * y + 2 z .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tensor([[3., 3.]], device='cuda:0') .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python x = np.ones((1, 2)) y = np.ones((1, 2)) z = x * y + 2 z .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output array([[3., 3.]]) .. raw:: html

.. raw:: html

mxnet

.. raw:: html

There are a number of operations that will force Python to wait for completion: - Most obviously ``npx.waitall()`` waits until all computation has completed, regardless of when the compute instructions were issued. In practice it is a bad idea to use this operator unless absolutely necessary since it can lead to poor performance. - If we just want to wait until a specific variable is available we can call ``z.wait_to_read()``. In this case MXNet blocks return to Python until the variable ``z`` has been computed. Other computation may well continue afterwards. Let’s see how this works in practice. .. raw:: latex \diilbookstyleinputcell .. code:: python with d2l.Benchmark('waitall'): b = np.dot(a, a) npx.waitall() with d2l.Benchmark('wait_to_read'): b = np.dot(a, a) b.wait_to_read() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output waitall: 0.0180 sec wait_to_read: 0.0189 sec Both operations take approximately the same time to complete. Besides the obvious blocking operations we recommend that you are aware of *implicit* blockers. Printing a variable clearly requires the variable to be available and is thus a blocker. Last, conversions to NumPy via ``z.asnumpy()`` and conversions to scalars via ``z.item()`` are blocking, since NumPy has no notion of asynchrony. It needs access to the values just like the ``print`` function. Copying small amounts of data frequently from MXNet’s scope to NumPy and back can destroy performance of an otherwise efficient code, since each such operation requires the computational graph to evaluate all intermediate results needed to get the relevant term *before* anything else can be done. .. raw:: latex \diilbookstyleinputcell .. code:: python with d2l.Benchmark('numpy conversion'): b = np.dot(a, a) b.asnumpy() with d2l.Benchmark('scalar conversion'): b = np.dot(a, a) b.sum().item() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output numpy conversion: 0.0340 sec scalar conversion: 0.0445 sec .. raw:: html

.. raw:: html

mxnet

.. raw:: html

On a heavily multithreaded system (even regular laptops have 4 threads or more and on multi-socket servers this number can exceed 256) the overhead of scheduling operations can become significant. This is why it is highly desirable to have computation and scheduling occur asynchronously and in parallel. To illustrate the benefit of doing so let’s see what happens if we increment a variable by 1 multiple times, both in sequence or asynchronously. We simulate synchronous execution by inserting a ``wait_to_read`` barrier in between each addition. .. raw:: latex \diilbookstyleinputcell .. code:: python with d2l.Benchmark('synchronous'): for _ in range(10000): y = x + 1 y.wait_to_read() with d2l.Benchmark('asynchronous'): for _ in range(10000): y = x + 1 npx.waitall() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output synchronous: 3.1623 sec asynchronous: 0.9288 sec A slightly simplified interaction between the Python frontend thread and the C++ backend thread can be summarized as follows: 1. The frontend orders the backend to insert the computation task ``y = x + 1`` into the queue. 1. The backend then receives the computation tasks from the queue and performs the actual computations. 1. The backend then returns the computation results to the frontend. Assume that the durations of these three stages are :math:`t_1, t_2` and :math:`t_3`, respectively. If we do not use asynchronous programming, the total time taken to perform 10000 computations is approximately :math:`10000 (t_1+ t_2 + t_3)`. If asynchronous programming is used, the total time taken to perform 10000 computations can be reduced to :math:`t_1 + 10000 t_2 + t_3` (assuming :math:`10000 t_2 > 9999t_1`), since the frontend does not have to wait for the backend to return computation results for each loop. .. raw:: html

.. raw:: html