‘Hello World’ of Deep Learning for Computer Vision

A shift in perspective makes all the difference

4 min readApr 6, 2019

Dense Layer Neural Network on MNIST Dataset

MNIST dataset consists of a training set of 60,000 examples and a test set of 10,000 examples, where each example is a 28x28 grayscale image of handwritten digits, associated with a label between 0 to 9.

MNIST is often the dataset everyone builds their first deep learning model while entering this field. Data Science community usually use it as a benchmark to validate their algorithms. They say,

If it doesn't work on MNIST, it won't work at all

Here, we will build two types of deep learning models using Keras API of tensorflow and aim to achieve 99% training accuracy in less than 10 epochs. Our first approach shall be building a model using only dense layers whereas the second approach shall be creating a model using only convolution layers.

Besides, instead of waiting for 10 epochs to finish, we shall add a function in our models so that when it reaches 99% or higher at the end of that epoch, it prints out the string “Reached 99% accuracy so cancelling training!”.

Based on the number of parameters, how fast it reaches 99% and test accuracy, we can infer which type model overall performs better than the other for Image Recognition challenges.

After importing tensorflow library and downloading the MNIST dataset (available in the tf.keras datasets API) as training and testing datasets containing images as x and their labels as y, we visualize an image and printout its pixel values for a clear understanding of our datasets.

We are now ready to build our models.

Using only Dense Layers

Before feeding our dataset, we normalize every image in the dataset, so that pixel values lie between 0 and 1.

We define our model as:

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)])

Here we also define our custom callback function whose job will be to stop running the model if we achieved 99% accuracy before 10 epochs.

Now we compile the model with Adam optimizer and sparse categorical cross entropy loss function and then train it by calling model.fit, asking it to fit our training data to our training labels. This way it figures out the relationship between the training data and its actual labels so that in future if we have data that looks like the training data, then it shall be able to make a prediction for that data.

We achieve more than 99% accuracy in 7 epochs. The accuracy tells us that our neural network is about 99% accurate in classifying the training data, i.e. it figured out a pattern match between the image and the labels that worked 99% of the time. This model achieves 97.79% accuracy on unseen test image datasets.

Finally, we note the number of parameters, which is 101k in just two layers. That’s huge. If we add a few more layers or increase neurons per layers, it will easily contain more than a million trainable parameters. That reflects it’s not a right approach for sustaining larger and more complex image datasets.

In our next approach, we shall build a model with nearly the same number of trainable parameters.

Using only Convolution Layers

Before feeding our dataset, apart from normalizing every image, we reshape our dataset to 4D tensor so that last digit can represent the number of channels which is 1 for grayscale images. We also convert the pixel values from int to float value.

For this model, we one-hot encode our labels, which changes our loss function. We define our model as:

model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape =(28, 28, 1)),
tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
tf.keras.layers.MaxPool2D((2,2)),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.Conv2D(10, (3,3)),
tf.keras.layers.Flatten(),
tf.keras.layers.Activation('softmax')])

After compiling the model (containing 98k parameters) with Adam optimizer and categorical cross entropy loss function and then fitting our training data to our training labels, we get more than 99% accuracy in just three epochs. In addition to this, our model achieves 99.11% accuracy (higher than the previous model) on the unseen test image dataset.

Quite distinctly this model, containing nearly the same number of trainable parameters, trained under lower epochs achieving higher test accuracy, performs better than our previous model.

It shows the transformations through Convolution Layers after first convolution operation on the first four images of training dataset for top 6 layers in the network

We can see above, the layers go from the raw pixels of the images, on the left to increasingly abstract and compact representations of those images, on the right. These representations carry progressively fewer details about the original pixels of the image, but more refined information about the class of the image. You might argue that being the general property of a Neural Network, but the advantage convolution layers have over dense layers is that, it learns local patterns in their input feature space, in contrast to global patterns learned by dense layers.

Thus we understand the reason behind the extensive use of CNN architecture for Image Recognition. It’s indeed a powerful approach!

Take a Challenge

If you want to dive more into this, then here is a small challenge for you.

Get more than 99.25% test accuracy using convolution layers approach with NN models having less than 25k parameters.

For complete Jupyter notebook of this article with valuable comments and references, click here.

‘Hello World’ of Deep Learning for Computer Vision

A shift in perspective makes all the difference

Take a Challenge

Written by Zoheb Abai

No responses yet