Python Python For Data Science and Machine Learning
Python Python For Data Science and Machine Learning
Where the class variable is y and the dependent features vector, size n, is X,
where
Just to clear this up, a feature vector and its corresponding class variable
example is:
X = (Rainy, Hot, High, False)
y = No
Here, P(y|X) indicates the probability of not playing, given a set of weather
conditions as follows:
Rainy outlook
Hot temperature
High humidity
No wind
Naïve Assumption
Giving the Bayes Theorem a naïve assumption means providing independence
between the features. The evidence can now be split into its independent parts.
If any two events, A and B, are independent:
P(A,B) = P(A)P(B)
The result is:
For a given input, the denominator stays constant so that term can be removed:
The next step is creating a classifier model. We will determine the probability
of a given input set for all class variable y’s possible values, choosing the
output that offers the maximum probability. This is mathematically expressed
as:
P(today) appears in both of the probabilities, so we can ignore it and look for
the proportional probabilities:
and
Because
We can make the sum equal to 1 to convert the numbers into a probability –
this is called normalization:
and
Because
Let’s see what a Python implementation of this classifier looks like using
Scikit-learn:
# load the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=1)
# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# making predictions on the testing set
y_pred = gnb.predict(X_test)
# comparing actual response values (y_test) with predicted response
values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):",
metrics.accuracy_score(y_test, y_pred)*100)
And the output is:
Gaussian Naive Bayes model accuracy(in %): 95.0
K-Nearest Neighbors
K-Nearest Neighbors, otherwise known as K-NN, is a critical yet simple
classification algorithm. It is a supervised learning algorithm used for data
mining, pattern detection, and intrusion detection.
In real-life scenarios, it is considered disposable because it is non-parametric,
which means it makes no underlying assumptions about the data distribution.
Some prior data, the training data, is provided, and this uses attributes to classify
the coordinates into specific groups.
Here is an example showing a table with some data points and two features:
With another set of data points, called the testing data, we can analyze the data
and allocate the points into a group. Unclassified points on the table are marked
‘White.”
Intuition
If those points were plotted on a graph, we might see some groups or clusters.
Unclassified points can be assigned to a group just by seeing which group the
unclassified point’s nearest neighbors are in. That means a point near a cluster
classified ‘Red’ is more likely to be classified as ‘Red.’
Using intuition, it’s clear that the first point (2.5, 7) should have a classification
of ‘Green’ while the second one (5.5, 4.5) should be ‘Red.’
Algorithm
Here, m indicates how many training samples we have, and p is an unknown
point. Here’s what we want to do:
1. The training samples are stored in arr[], which is an array containing data
points. Each of the array elements represents (x, y), which is a tuple.
For i=0 to m
2. Calculate d(arr[i], p), which is the Euclidean distance
3. Set S of K should be the smallest distances retrieved. Each distance is
correspondent to a data point we already classified.
4. The majority label among S is returned.
We can keep K as an odd number, making it easier to work out the clear majority
where there are only two possible groups. As K increases, the boundaries across
the classifications become more defined and smoother. And, as the number of
data points increases in the training set, the better the classifier accuracy
becomes.
Here’s an example program, written in C++ and Python, where 0 and 1 are the
classifiers::
// C++ program to find groups of unknown
// Points using K nearest neighbor algorithm.
#include <bits/stdc++.h>
using namespace std;
struct Point
{
int val; // Group of point
double x, y; // Co-ordinate of point
double distance; // Distance from test point
};
// Used to sort an array of points by increasing
// order of distance
bool comparison(Point a, Point b)
{
return (a.distance < b.distance);
}
// This function finds classification of point p using
// k nearest neighbor algorithm. It assumes only two
// groups and returns 0 if p belongs to group 0, else
// 1 (belongs to group 1).
int classifyAPoint(Point arr[], int n, int k, Point p)
{
// Fill distances of all points from p
for (int i = 0; i < n; i++)
arr[i].distance =
sqrt((arr[i].x - p.x) * (arr[i].x - p.x) +
(arr[i].y - p.y) * (arr[i].y - p.y));
// Sort the Points by distance from p
sort(arr, arr+n, comparison);
// Now consider the first k elements and only
// two groups
int freq1 = 0; // Frequency of group 0
int freq2 = 0; // Frequency of group 1
for (int i = 0; i < k; i++)
{
if (arr[i].val == 0)
freq1++;
else if (arr[i].val == 1)
freq2++;
}
return (freq1 > freq2 ? 0 : 1);
}
// Driver code
int main()
{
int n = 17; // Number of data points
Point arr[n];
arr[0].x = 1;
arr[0].y = 12;
arr[0].val = 0;
arr[1].x = 2;
arr[1].y = 5;
arr[1].val = 0;
arr[2].x = 5;
arr[2].y = 3;
arr[2].val = 1;
arr[3].x = 3;
arr[3].y = 2;
arr[3].val = 1;
arr[4].x = 3;
arr[4].y = 6;
arr[4].val = 0;
arr[5].x = 1.5;
arr[5].y = 9;
arr[5].val = 1;
arr[6].x = 7;
arr[6].y = 2;
arr[6].val = 1;
arr[7].x = 6;
arr[7].y = 1;
arr[7].val = 1;
arr[8].x = 3.8;
arr[8].y = 3;
arr[8].val = 1;
arr[9].x = 3;
arr[9].y = 10;
arr[9].val = 0;
arr[10].x = 5.6;
arr[10].y = 4;
arr[10].val = 1;
arr[11].x = 4;
arr[11].y = 2;
arr[11].val = 1;
arr[12].x = 3.5;
arr[12].y = 8;
arr[12].val = 0;
arr[13].x = 2;
arr[13].y = 11;
arr[13].val = 0;
arr[14].x = 2;
arr[14].y = 5;
arr[14].val = 1;
arr[15].x = 2;
arr[15].y = 9;
arr[15].val = 0;
arr[16].x = 1;
arr[16].y = 7;
arr[16].val = 0;
/*Testing Point*/
Point p;
p.x = 2.5;
p.y = 7;
// Parameter to decide group of the testing point
int k = 3;
printf ("The value classified to unknown point"
" is %d.\n", classifyAPoint(arr, n, k, p));
return 0;
}
And the output is:
The value classified to the unknown point is 0.
Support Vector Machine Learning Algorithm
Support vector machines or SVMs are grouped under the supervised learning
algorithms used to analyze the data used in regression and classification analysis.
The SVM is classed as a discriminative classifier, defined formally by a
separating hyperplane. In simple terms, with labeled training data, the supervised
learning part, the output will be an optimal hyperplace that takes new examples
and categorizes them.
SVMs represent examples as points in space. Each is mapped so that the separate
categories have a wide, clear gap separating them. As well as being used for
linear classification, an SVM can also be used for efficient non-linear
classification, with the inputs implicitly mapped into high-dimensional feature
spaces.
So, what does an SVM do?
When you provide an SVM algorithm with training examples, each one labeled
as belonging to one of the two categories, it will build a model that will assign
newly provided examples to one of the categories. This makes it a 'non-
probabilistic binary linear classification.'
To illustrate this, we'll use an SVM model that uses ML tools, like Scikit-learn,
to classify cancer UCI datasets. For this, you need to have NumPy, Pandas,
Matplotlib, and Scikit-learn.
First, the dataset must be created:
# importing scikit learn with make_blobs
from sklearn.datasets.samples_generator import make_blobs
# creating datasets X containing n_samples
# Y containing two classes
X, Y = make_blobs(n_samples=500, centers=2,
random_state=0, cluster_std=0.40)
import matplotlib.pyplot as plt
# plotting scatters
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring');
plt.show()
Support vector machines don't just draw lines between classes. They also
consider a region around the line of a specified width. Here's an example:
# creating line space between -1 to 3.5
xfit = np.linspace(-1, 3.5)
# plotting scatter
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring')
# plot a line between the different sets of data
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
plt.show()
Importing Datasets
The SVMs intuition is optimizing linear discriminant models representing the
distance between the datasets, which is a perpendicular distance. Let's use our
training data to train the model. Before we do that, the cancer datasets need to be
imported as a CSV file, and we will train two of all the features:
# importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# reading csv file and extracting class column to y.
x = pd.read_csv("C:\...\cancer.csv")
a = np.array(x)
y = a[:,30] # classes having 0 and 1
# extracting two features
x = np.column_stack((x.malignant,x.benign))
# 569 samples and 2 features
x.shape
print (x),(y)
The output would be:
[[ 122.8 1001. ]
[ 132.9 1326. ]
[ 130. 1203. ]
...,
[ 108.3 858.1 ]
[ 140.1 1265. ]
[ 47.92 181. ]]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1.,
1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 1., ....,
1.])
Fitting Support Vector Machines
Next, we need to fit the classifier to the points.
# import support vector classifier
# "Support Vector Classifier"
from sklearn.svm import SVC
clf = SVC(kernel='linear')
# fitting x samples and y classes
clf.fit(x, y)
Once the model is fitted, it can be used to make predictions on new values:
clf.predict([[120, 990]])
clf.predict([[85, 550]])
array([ 0.])
array([ 1.])
The data and the pre-processing methods are analyzed, and Matplotlib is used to
produce optimal hyperplanes.
Linear Regression Machine Learning Algorithm
No doubt you have heard of linear regression, as it is one of the most popular
machine learning algorithms. It is a statistical method used to model
relationships between one dependent variable and a specified set of independent
variables. For this section, the dependent variable is referred to as response and
the independent variables as features.
Simple Linear Regression
This is the most basic form of linear regression and is used to predict responses
from single features. Here, the assumption is that there is a linear relationship
between two variables. As such, we want a linear function that can predict the
response (y) value as accurately as it can as a function of the independent (x)
variable or feature.
Let’s look at a dataset with a response value for each feature:
x 0 1 2 3 4 5 6 7 8 9
y 1 3 2 5 7 8 8 9 10 12
We define the following for generality:
x is defined as a feature vector, i.e. x = [x_1, x_2, …., x_n]
y is defined as a response vector, i.e. y = [y_1, y_2, …., y_n]
These are defined for n observations, i.e. n=10.
The model’s task is to fund the best fitting line to predict responses for new
feature values, i.e., those features not in the dataset already. The line is known as
a regression line, and its equation is represented as:
Here:
h(x_1) is representing the ith observation’s predicted response
b_0 and b_1 are both coefficients and are representing the y-intercept and
the regression line slope, respectively.
Creating the model requires that we estimate or learn the coefficients’ values,
and once they have been estimated, the model can predict response.
We will be making use of the principle of Least Squares, so consider the
following:
In this, e_i is the ith observation’s residual error and we want to reduce the total
residual error.
The cost function or squared error is defined as:
We want to find the b_0 and b_1 values, where J(b_1, b_1) is the minimum.
Without drawing you through all the mathematical details, this is the result:
Where h(x_i) is the ith observation’s predicted response, and the regression
coefficients are b_0, b_1, …., b_p.
We can also write:
And the linear model can now be expressed as the following in terms of
matrices:
where
and
The estimate of b can now be determined using the Least Squares method, which
is used to determine b where the total residual errors are minimized. The result
is:
Where ‘ represents the matrix transpose, and the inverse is represented by -1.
Now that the least-squares estimate, b’ is known, we can estimate the multiple
linear regression models as: