Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
56 views

Recognizing Handwritten Digits With Scikit-Learn: Punam Seal

This document discusses recognizing handwritten digits using scikit-learn. It begins by introducing the problem of recognizing handwritten text and some applications like OCR and postal codes. It then loads the Digits dataset from scikit-learn, which contains images of handwritten digits. The goal is to use scikit-learn models to predict the digits in these images accurately at least 95% of the time. It prepares the data, splits it into training and validation sets, and visualizes some of the images. Then various scikit-learn models will be tested on this problem to see which performs best.

Uploaded by

Punam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Recognizing Handwritten Digits With Scikit-Learn: Punam Seal

This document discusses recognizing handwritten digits using scikit-learn. It begins by introducing the problem of recognizing handwritten text and some applications like OCR and postal codes. It then loads the Digits dataset from scikit-learn, which contains images of handwritten digits. The goal is to use scikit-learn models to predict the digits in these images accurately at least 95% of the time. It prepares the data, splits it into training and validation sets, and visualizes some of the images. Then various scikit-learn models will be tested on this problem to see which performs best.

Uploaded by

Punam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Punam Seal About Follow Sign in Get started

Recognizing Handwritten Digits


with Scikit-Learn
Punam Seal Sep 1 · 12 min read

Recognizing handwritten text is a problem that can be traced back to


the first automatic machines that needed to recognize individual
characters in handwritten documents. Think about, for example, the ZIP
codes on letters at the post office and the automation needed to
recognize these five digits. Perfect recognition of these codes is
necessary in order to sort mail automatically and efficiently.

Included among the other applications that may come to mind is OCR
(Optical Character Recognition) software. OCR software must read
handwritten text, or pages of printed books, for general electronic
documents in which each character is well defined. But the problem of
handwriting recognition goes farther back in time, more precisely to the
early 20th Century (1920s), when Emanuel Goldberg (1881–1970)
began his studies regarding this issue and suggested that a statistical
approach would be an optimal choice.

To address this issue in Python, the scikit-learn library provides a good


example to better understand this technique, the issues involved, and the
possibility of making predictions.

OVERVIEW:

In this blog, I am going to analyze Recognizing Handwritten Digits with


scikit-learn, the scikit-learn library (http://scikit-learn.org/) enables you
to approach this type of data analysis in a way that is slightly different
from the previous project. The data to be analyzed is closely related to
numerical values or strings, but can also involve images and sounds.

The problem involves predicting a numeric value, and then reading and
interpreting an image that uses a handwritten font. So even in this case
you will have an estimator with the task of learning through a fit()
function, and once it has reached a degree of predictive capability (a
model sufficiently valid), it will produce a prediction with the predict()
function. Then we will discuss the training set and validation set, created
this time from a series of images.

The Digits Dataset- The scikit-learn library provides numerous datasets


that are useful for testing many problems of data analysis and prediction
of the results. Also in this case there is a dataset of images called Digits.
This dataset consists of 1,797 images that are 8*8 pixels in size. Each
image is a handwritten digits in grayscale, as shown in Figure-1.
Figure- 1. One of 1797 handwritten number images that makes up the dataset digit

GOAL:

Our goal is to involve predicting a numeric value, and then reading and
interpreting an image that uses a handwritten font. You can choose a
smaller training set and different range for validation and get 100%
accurate predictions, but this may not be the case at all times. Perform
data analysis to accept the Hypothesis, if it predicts the digit accurately
95% of the times or else reject it. Run for at-least 3 cases , each case for
different range of training and validation sets.

TOOLS USED:

Language used- Python

IDE- Jupyter Notebook

DATA ANALYSIS:

Firstly, I want to import the required libraries and then describe our data
by using function like- DESCR, dir, size, shape, etc.
Import the required libraries
1. I will be using several Python several libraries such as Numpy,
Matplotlib.

import numpy as np
import matplotlib.pyplot as plt

Load and read the dataset


2. I imported sklearn to load datasets and by using manual function i.e.
load_ and read the dataset and store it as a DataFrame object in the
variable as digits.

from sklearn import datasets


digits = datasets.load_digits()
digits

3. By using DESCR: string function, it gives full description of the dataset


and printed it using print() function.

print(digits.DESCR)

Out:
.. _digits_dataset:

Optical recognition of handwritten digits dataset


--------------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 1797


:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range
0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits


datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwr
itten+Digits
The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract


normalized bitmaps of handwritten digits from a preprinted form. From
a total of 43 people, 30 contributed to the training set and
different 13 to the test set. 32x32 bitmaps are divided into
nonoverlapping blocks of 4x4 and the number of on pixels are counted
in each block. This generates an input matrix of 8x8 where each
element is an integer in the range 0..16. This reduces dimensionality
and gives invariance to small distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L.


Blue, G.T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A.
Janet, and C.L. Wilson, NIST Form-Based Handprint Recognition System,
NISTIR 5469,1994.

.. topic:: References

- C. Kaynak (1995) Methods of Combining Multiple Classifiers and


Their Applications to Handwritten Digit Recognition, MSc Thesis,
Institute of Graduate Studies in Science and Engineering, Bogazici
University.
- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School
of Electrical and Electronic Engineering Nanyang Technological
University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.

4. dir() is a powerful inbuilt function, which returns list of the attributes


and methods of any object (say functions , modules, strings, lists,
dictionaries etc.)

dir(digits)

5. Listing the data attributes from the digits datasets.

digits.data
6. The function shape returns the shape of an array which means in our
dataset, we have 1797 lines and 64 columns.

digits.data.shape
Out:(1797, 64)

7. Listing the target attributes from the digits datasets.

digits.target
Out: array([0, 1, 2, ..., 8, 9, 8])

8. size() function count the number of elements along a given axis,


which means in our dataset, we have 1797 number of elements.

digits.target.size
Out: 1797

Visualizing data
9. The images of the handwritten digits are contained in a digits.images
array. Each element of this array is an image that is represented by an
8x8 matrix of numerical values that correspond to a grayscale from
white, with a value of 0, to black, with the value 15.

digits.images[0]
This dataset contains 1,797 elements, and so you can consider the first
1,791 as a training set and will use the last six as a validation set.

10. You can see in detail these six handwritten digits by using the
imshow() function is used to display data as an image; i.e. on a 2D
regular raster, cmap = gray_r displays a grayscale image,
interpolation= ‘nearest’ displays an image without trying to
interpolate between pixels if the display resolution is not the same as the
image resolution and the title() function is used to display the title on
the graph.

plt.figure(figsize=(10,7))
plt.subplot(321)
plt.imshow(digits.images[1791], cmap=plt.cm.gray_r,
interpolation='nearest')
plt.subplot(322)
plt.imshow(digits.images[1792], cmap=plt.cm.gray_r,
interpolation='nearest')
plt.subplot(323)
plt.imshow(digits.images[1793], cmap=plt.cm.gray_r,
interpolation='nearest')
plt.subplot(324)
plt.imshow(digits.images[1794], cmap=plt.cm.gray_r,
interpolation='nearest')
plt.subplot(325)
plt.imshow(digits.images[1795], cmap=plt.cm.gray_r,
interpolation='nearest')
plt.subplot(326)
plt.imshow(digits.images[1796], cmap=plt.cm.gray_r,
interpolation='nearest')
Preparing data
11. Next, for preparing the data for training by declaring a NumPy array
data and reshaping it so that it has the first dimension equal to the
length of the images, which is the number of samples i.e. n_samples and
for length, I used the len() function, but with reduced dimensionality. So,
the dimension of data will be 1797 x 64.

n_samples = len(digits.images)
n_samples
Out: 1797

12. As you can see, this function has been reshaped by the
numpy.reshape() function that shapes an array without changing the
data of the array.

data = digits.images.reshape((n_samples, -1))


data
Splitting Data into Train and Test Method
Now in this method, the train_test_split function is for splitting a single
dataset for two different purposes: training and testing.

The training subset is for building and fitting your model.

The testing subset is for using the model on unknown data to


evaluate the performance of the model.

Ideally, you can split your original dataset into input (x) and output (y)
columns, then call the function passing both arrays and have them split
appropriately into train and test subsets.

13. Here, we have split the data from data and digits.target by
assigning 0.01 as test size and setting the random_state to an integer
value which is equal to zero.

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(data,
digits.target, test_size=0.01, random_state=0)

Classifying the Model


This is the final stage where I classified the model each using the
different algorithms as classifiers, making predictions, printed the
Classification Report, the Confusion Matrix, and the Accuracy Score.

I used three classifiers from sklearn:

Support Vector Machine

Gaussian Naive Bayes

K Nearest Neighbours (KNN)


Let’s explain each classifiers with other algorithms —

1. Support Vector Machine


Support vector machines (SVMs) are a set of supervised learning
methods used for classification, regression and outliers detection.

14. The estimator is the class sklearn.svm.SVC , which implements


support vector classification. The estimator’s constructor takes as
arguments the model’s parameters. So, importing the libraries- svm and
metrics and defining the SVM classifier as svc_classifier.

from sklearn import svm, metrics


svc_classifier = svm.SVC(gamma=0.001, C = 100.)
svc_classifier
Out: SVC(C=100.0, gamma=0.001)

15. The svc_classifier (for classifier) estimator instance is first fitted to


the model; i.e., it must learn from the model. This is done by passing our
training set to the fit method.

svc_classifier.fit(x_train, y_train)
Out: SVC(C=100.0, gamma=0.001)

16. By using predict() function, you can test the estimator by making it
interpret the digits of the test set and named as svc_y_pred.

svc_y_pred = svc_classifier.predict(x_test)
svc_y_pred
Out: array([2, 8, 2, 6, 6, 7, 1, 9, 8, 5, 2, 8, 6, 6, 6, 6, 1, 0])

17. Now, displaying plots of each digit from 0 to 9 which are in the form
of an array as images using functions such as
figure() function which is used to create a new figure with a
specified size of (12,7),

combining two lists using the zip() function for easier handling inside
the plotting loop,

enumerate() method adds a counter to an iterable and returns it.


The returned object is a enumerate object and

subplot() function is used to add a subplot to a current figure at the


specified grid position.

plt.figure(figsize=(12,7))
images_and_labels = list(zip(digits.images, digits.target))

for index, (images, label) in enumerate(images_and_labels[:10]):


plt.subplot(2, 5, index + 1)
plt.imshow(images, cmap = plt.cm.gray_r, interpolation='nearest')
plt.title('Training: %i' % label)

18. Plotting the images of the predicted digits from the array using the
following code.

images_and_predictions = list(zip(x_test,svc_y_pred))
plt.figure(figsize=(18,5))
for index, (image, prediction) in
enumerate(images_and_predictions[:19]):
plt.subplot(2, 9, index + 1)
image = image.reshape(8, 8)
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Prediction: %i' % prediction)

19. This code contains the following functions like —

It will display the confusion matrix and classification report using the
classification_report() and confusion_matrix() functions.

It will display the accuracy score of the model can be obtained using
the score() function.

print("\nClassification report for Support Vector Machine Classifier


%s:\n%s\n" % (svc_classifier, metrics.classification_report(y_test,
svc_y_pred)))
disp = metrics.plot_confusion_matrix(svc_classifier, x_test, y_test)
disp.figure_.suptitle("Confusion Matrix of Support Vector Machine
Classifier")
print("\nConfusion matrix of Support Vector Machine Classifier:\n%s"
% disp.confusion_matrix)
print("\nAccuracy of the Support Vector Machine Classifier Algorithm:
", svc_classifier.score(x_test, y_test))
plt.show()
Classification report for Support Vector Machine Classifier

Confusion matrix of Support Vector Machine Classifier

Confusion matrix of Support Vector Machine Classifier

2. Gaussian Naive Bayes


Gaussian Naive Bayes classifier assumes that the data from each label is
drawn from a simple Gaussian distribution. The Scikit-learn provides
sklearn.naive_bayes.GaussianNB to implement the Gaussian Naive
Bayes algorithm for classification.
20. Importing the library of GaussianNB and defining the GNB classifier
as GNB_classifier. By the fit method will Fit Gaussian Naive Bayes
classifier according to x and y train sets.

from sklearn.naive_bayes import GaussianNB


GNB_classifier = GaussianNB()
GNB_classifier.fit(x_train, y_train)
Out: GaussianNB()

21. predict() function, you can test the estimator by making it interpret
the digits of the test set and named as GNB_y_pred.

GNB_y_pred = GNB_classifier.predict(x_test)
GNB_y_pred
Out: array([2, 8, 2, 6, 6, 7, 1, 9, 8, 5, 2, 8, 6, 6, 6, 6, 1, 0])

22. Plotting the trained images as follows:

plt.figure(figsize=(12,7))
images_and_labels = list(zip(digits.images, digits.target))

for index, (images, label) in enumerate(images_and_labels[:10]):


plt.subplot(2, 5, index + 1)
plt.imshow(images, cmap = plt.cm.gray_r, interpolation='nearest')
plt.title('Training: %i' % label)
23. Plotting the predicted images as follows:

images_and_predictions = list(zip(x_test,GNB_y_pred))
plt.figure(figsize=(18,5))
for index, (image, prediction) in
enumerate(images_and_predictions[:19]):
plt.subplot(2, 9, index + 1)
image = image.reshape(8, 8)
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Prediction: %i' % prediction)

24. This code will display the classification report, confusion matrix and
accuracy of the Gaussian Naive Bayes Classifier as follows:

print("\nClassification report for Gaussian Naive Bayes Classifier


%s:\n%s\n" % (GNB_classifier, metrics.classification_report(y_test,
GNB_y_pred)))
disp = metrics.plot_confusion_matrix(GNB_classifier, x_test, y_test)
disp.figure_.suptitle("Confusion Matrix of Gaussian Naive Bayes
Classifier")
print("\nConfusion matrix of Gaussian Naive Bayes Classifier:\n%s" %
disp.confusion_matrix)
print("\nAccuracy of the Gaussian Naive Bayes Classifier Algorithm:
", GNB_classifier.score(x_test, y_test))
plt.show()
Classification report for Gaussian Naive Bayes Classifier

Confusion matrix of Gaussian Naive Bayes Classifier

Confusion matrix of Gaussian Naive Bayes Classifier

3. K Nearest Neighbours (KNN)


K-NN (K-Nearest Neighbour), one of the simplest machine learning
algorithms, is non-parametric and lazy in nature. Non-parametric means
that there is no assumption for the underlying data distribution i.e. the
model structure is determined from the dataset. Lazy or instance-based
learning means that for the purpose of model generation, it does not
require any training data points and whole training data is used in the
testing phase.

25. sklearn.neighbors.NearestNeighbors is the module used to


implement unsupervised nearest neighbor learning. Importing the
library of KNeighborsClassifier and defining the KNN classifier as
KNN_classifier. By the fit method will Fit K Nearest Neighbours classifier
according to x and y train sets.

from sklearn.neighbors import KNeighborsClassifier


KNN_classifier = KNeighborsClassifier(n_neighbors=5,
metric='euclidean')
KNN_classifier.fit(x_train, y_train)
Out: KNeighborsClassifier(metric='euclidean')

26. predict() function, you can test the estimator by making it interpret
the digits of the test set and named as KNN_y_pred.

KNN_y_pred = KNN_classifier.predict(x_test)
KNN_y_pred
Out: array([2, 8, 2, 6, 6, 7, 1, 9, 8, 5, 2, 8, 6, 6, 6, 6, 1, 0])

27. Plotting the trained images as follows:

plt.figure(figsize=(12,7))
images_and_labels = list(zip(digits.images, digits.target))

for index, (images, label) in enumerate(images_and_labels[:10]):


plt.subplot(2, 5, index + 1)
plt.imshow(images, cmap = plt.cm.gray_r, interpolation='nearest')
plt.title('Training: %i' % label)
28. Plotting the predicted images as follows:

images_and_predictions = list(zip(x_test,KNN_y_pred))
plt.figure(figsize=(18,5))
for index, (image, prediction) in
enumerate(images_and_predictions[:19]):
plt.subplot(2, 9, index + 1)
image = image.reshape(8, 8)
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Prediction: %i' % prediction)

29. This code will display the classification report, confusion matrix and
accuracy of the K Nearest Neighbours Classifier as follows:
print("\nClassification report for K Nearest Neighbours Classifier
%s:\n%s\n" % (KNN_classifier, metrics.classification_report(y_test,
KNN_y_pred)))
disp = metrics.plot_confusion_matrix(KNN_classifier, x_test, y_test)
disp.figure_.suptitle("Confusion Matrix of K Nearest Neighbours
Classifier")
print("\nConfusion matrix of K Nearest Neighbours Classifier:\n%s" %
disp.confusion_matrix)
print("\nAccuracy of the K Nearest Neighbours Classifier Algorithm:
", KNN_classifier.score(x_test, y_test))
plt.show()

Classification report for K Nearest Neighbours Classifier

Confusion matrix of K Nearest Neighbours Classifier


Confusion matrix of K Nearest Neighbours Classifier

OBSERVATION:

print("Total overall accuracies of the Classifier Algorithms are---


")
print("\nAccuracy of the Support Vector Machine Classifier Algorithm:
", svc_classifier.score(x_test, y_test))
print("Accuracy of the Gaussian Naive Bayes Classifier Algorithm: ",
GNB_classifier.score(x_test, y_test))
print("Accuracy of the K Nearest Neighbours Classifier Algorithm: ",
KNN_classifier.score(x_test, y_test))

GITHUB LINK:

Suven-Consultants-and-Technology-Tasks/main.ipynb at master
·…
This repository contains Online Coding Internship related to Data
Analytics using Python Domain. …
github.com

CONCLUSION:

From this analysis, I conclude that I predicted a numeric value, and


then read and interpreted an image that uses a handwritten font. I
trained and predicted the images and used at-least three different
classifiers for validation and got 100% accurate predictions.

I am thankful to mentors at https://internship.suvenconsultants.com for


providing awesome problem statements and giving many of us a Coding
Internship Experience. Thank you www.suvenconsultants.com .

More from Punam Seal Follow

Final year B.Tech ECE engineering student.

More From Medium

Movie Recommendation Methods you need know 【SIGGRAPH 2020】 Stylistic differences
System to Estimate Feature Unpaired Motion Style between R and Python in
Varunsinghal
Importance for ML Transfer from Video to modelling data through
models Animation neural networks
Summer Hu in Artificial Center on Frontiers of Nicola Giordano in Towards
Intelligence in Plain English Computing Studies, PKU Data Science

Python Chatbot A guide to Linear An Introduction to Natural Machine Learning in


Manoj Damor
Regression — Part 1 Language Processing for Academic Research v.s.
Sai Yesaswy Mylavarapu
Beginners Practical
JHANVI SHAH in Python in Christopher Tao in Towards
Plain English Data Science

About Write Help Legal

PDFmyURL.com - convert URLs, web pages or even full websites to PDF online. Easy API for developers!

You might also like