MNIST
The little dataset that could
The MNIST (Modified National Institute Of Standards And Technology) dataset is a small but well-known data source for benchmarking machine learning models.
In 1996, the US National Institute Of Standards And Technology (NIST) published several datasets of samples of black and white handwritten letters (e.g., "a" "b", etc.) called Special Databases. To construct these datasets, NIST asked people to fill out this form:
In 1998, Yann LeCun, Corinna Cortes, and Christoper J.C. Burges released MNIST as a new version of the original NIST data. Specifically, LeCun, Cortes, and Burges made two improvements:
NIST had initially designated Special Database 3 as the training set and Special Database 1 as the test set. However, Special Database 3 contained handwriting samples from US Census Bureau employees, while Special Database 1 contained samples from US high school students. This made Special Database 3 "much cleaner and easier to recognize" than Special Database 1. MNIST fixed this by merging Special Database 1 and Special Database 3, then resampling samples into new test and training sets. The resulting MNIST dataset's training set contained 30,000 samples from Special Database 1 and 30,000 from Special Database 3. The test set contained 5,000 samples from Special Database 1 and 5,000 samples from Special Database 3.
Each handwriting sample was resized and centered in a 28x28 pixel image. This process converted the images from black and white to greyscale.
One of the appealing qualities of MNIST is that each sample is intuitive. Here is the 30,000 sample in the MNIST dataset. The sample is 28x28 pixels, and you can see each pixel.
Here are the actual values of that same sample. While the original MNIST data used a slightly different scale, modern implementations of MNIST scale values from 0 (blank pixel) to 255 (dark pixel):
Notice how a higher value represents the darker parts of the original image.
MNIST is available in most machine learning libraries. In the code below, we display the 30,000th MNIST sample using Python:
# Install libraries
!pip install keras
!pip install tensorflow
!pip install matplotlib
# Load libraries
from keras.datasets
import mnist from matplotlib
import pyplot
# Download the training data
(train_X, _), _ = mnist.load_data()
# Plot the 30000th sample
pyplot.plot()
pyplot.imshow((train_X[30000]), cmap=pyplot.get_cmap('gray'))
pyplot.show()
Nice to find you on Substack, Chris! And that's a nice first article!
A little fun fact: The assumed error-free MNIST dataset also has dozens (if not hundreds) of label errors :P https://arxiv.org/abs/1911.00068