.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.Classifier) #@save
def loss(self, Y_hat, Y, averaged=True):
Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
Y = Y.reshape((-1,))
return F.cross_entropy(
Y_hat, Y, reduction='mean' if averaged else 'none')
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.Classifier) #@save
def loss(self, Y_hat, Y, averaged=True):
Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
Y = Y.reshape((-1,))
fn = gluon.loss.SoftmaxCrossEntropyLoss()
l = fn(Y_hat, Y)
return l.mean() if averaged else l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.Classifier) #@save
@partial(jax.jit, static_argnums=(0, 5))
def loss(self, params, X, Y, state, averaged=True):
# To be used later (e.g., for batch norm)
Y_hat = state.apply_fn({'params': params}, *X,
mutable=False, rngs=None)
Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
Y = Y.reshape((-1,))
fn = optax.softmax_cross_entropy_with_integer_labels
# The returned empty dictionary is a placeholder for auxiliary data,
# which will be used later (e.g., for batch norm)
return (fn(Y_hat, Y).mean(), {}) if averaged else (fn(Y_hat, Y), {})
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.Classifier) #@save
def loss(self, Y_hat, Y, averaged=True):
Y_hat = tf.reshape(Y_hat, (-1, Y_hat.shape[-1]))
Y = tf.reshape(Y, (-1,))
fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
return fn(Y, Y_hat)
.. raw:: html
.. raw:: html
Training
--------
Next we train our model. We use Fashion-MNIST images, flattened to
784-dimensional feature vectors.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegression(num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_softmax-regression-concise_0b22ca_52_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegression(num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_softmax-regression-concise_0b22ca_55_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegression(num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_softmax-regression-concise_0b22ca_58_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegression(num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_softmax-regression-concise_0b22ca_61_0.svg
.. raw:: html
.. raw:: html
As before, this algorithm converges to a solution that is reasonably
accurate, albeit this time with fewer lines of code than before.
Summary
-------
High-level APIs are very convenient at hiding from their user
potentially dangerous aspects, such as numerical stability. Moreover,
they allow users to design models concisely with very few lines of code.
This is both a blessing and a curse. The obvious benefit is that it
makes things highly accessible, even to engineers who never took a
single class of statistics in their life (in fact, they are part of the
target audience of the book). But hiding the sharp edges also comes with
a price: a disincentive to add new and different components on your own,
since there is little muscle memory for doing it. Moreover, it makes it
more difficult to *fix* things whenever the protective padding of a
framework fails to cover all the corner cases entirely. Again, this is
due to lack of familiarity.
As such, we strongly urge you to review *both* the bare bones and the
elegant versions of many of the implementations that follow. While we
emphasize ease of understanding, the implementations are nonetheless
usually quite performant (convolutions are the big exception here). It
is our intention to allow you to build on these when you invent
something new that no framework can give you.
Exercises
---------
1. Deep learning uses many different number formats, including FP64
double precision (used extremely rarely), FP32 single precision,
BFLOAT16 (good for compressed representations), FP16 (very unstable),
TF32 (a new format from NVIDIA), and INT8. Compute the smallest and
largest argument of the exponential function for which the result
does not lead to numerical underflow or overflow.
2. INT8 is a very limited format consisting of nonzero numbers from
:math:`1` to :math:`255`. How could you extend its dynamic range
without using more bits? Do standard multiplication and addition
still work?
3. Increase the number of epochs for training. Why might the validation
accuracy decrease after a while? How could we fix this?
4. What happens as you increase the learning rate? Compare the loss
curves for several learning rates. Which one works better? When?
.. raw:: html