Short Notes On AI

Share this post

MNIST

shortnotesonai.com

MNIST

The little dataset that could

Jan 4
12
4
Share this post

MNIST

shortnotesonai.com

The MNIST (Modified National Institute Of Standards And Technology) dataset is a small but well-known data source for benchmarking machine learning models.

In 1996, the US National Institute Of Standards And Technology (NIST) published several datasets of samples of black and white handwritten letters (e.g., "a" "b", etc.) called Special Databases. To construct these datasets, NIST asked people to fill out this form:

In 1998, Yann LeCun, Corinna Cortes, and Christoper J.C. Burges released MNIST as a new version of the original NIST data. Specifically, LeCun, Cortes, and Burges made two improvements:

  1. NIST had initially designated Special Database 3 as the training set and Special Database 1 as the test set. However, Special Database 3 contained handwriting samples from US Census Bureau employees, while Special Database 1 contained samples from US high school students. This made Special Database 3 "much cleaner and easier to recognize" than Special Database 1. MNIST fixed this by merging Special Database 1 and Special Database 3, then resampling samples into new test and training sets. The resulting MNIST dataset's training set contained 30,000 samples from Special Database 1 and 30,000 from Special Database 3. The test set contained 5,000 samples from Special Database 1 and 5,000 samples from Special Database 3.

  2. Each handwriting sample was resized and centered in a 28x28 pixel image. This process converted the images from black and white to greyscale.

One of the appealing qualities of MNIST is that each sample is intuitive. Here is the 30,000 sample in the MNIST dataset. The sample is 28x28 pixels, and you can see each pixel.

Here are the actual values of that same sample. While the original MNIST data used a slightly different scale, modern implementations of MNIST scale values from 0 (blank pixel) to 255 (dark pixel):

Notice how a higher value represents the darker parts of the original image.

MNIST is available in most machine learning libraries. In the code below, we display the 30,000th MNIST sample using Python:

# Install libraries

!pip install keras

!pip install tensorflow

!pip install matplotlib

# Load libraries

from keras.datasets

import mnist from matplotlib

import pyplot

# Download the training data

(train_X, _), _ = mnist.load_data()

# Plot the 30000th sample

pyplot.plot()

pyplot.imshow((train_X[30000]), cmap=pyplot.get_cmap('gray'))

pyplot.show()

4
Share this post

MNIST

shortnotesonai.com
Previous
4 Comments
Sebastian Raschka
Writes Ahead of AI
Jan 4

Nice to find you on Substack, Chris! And that's a nice first article!

A little fun fact: The assumed error-free MNIST dataset also has dozens (if not hundreds) of label errors :P https://arxiv.org/abs/1911.00068

Expand full comment
Reply
3 replies by Chris Albon and others
3 more comments…
TopNewCommunity

No posts

Ready for more?

© 2023 Chris Albon
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing