Machine Learning Algorithms PDF
Machine Learning Algorithms PDF
Algorithms
Scikit Learn
Hichem Felouat
hichemfel@gmail.com
https://www.linkedin.com/in/hichemfelouat/
Table of Contents
1) Dataset Loading
2) Preprocessing data
3) Feature selection
4) Dimensionality Reduction
5) Training and Test Sets
6) Supervised learning
7) Unsupervised learning
8) Save and Load Machine Learning Models
Hichem Felouat - hichemfel@gmail.com 2
1. Dataset Loading: Pandas
import pandas as pd
a b c
df = pd.DataFrame(
1 4 7 10
{“a” : [4, 5, 6],
“b” : [7, 8, 9], 2 5 8 11
“c” : [10, 11, 12]}, 3 6 9 12
index = [1, 2, 3])
# Using .tail() method with an argument which helps us to restrict the number of initial records
that should be displayed
data.tail(n=2)
Hichem Felouat - hichemfel@gmail.com 4
1. Dataset Loading: Pandas
Training Set & Test Set
columns = [' ', ... , ' '] # n -1
my_data = data[columns ]
df = pd.read_excel('file.xlsx')
df.to_excel('myDataFrame.xlsx')
dat = datasets.load_breast_cancer()
print("Examples = ",dat.data.shape ," Labels = ", dat.target.shape)
dat = datasets.fetch_20newsgroups(subset='train')
from pprint import pprint
pprint(list(dat.target_names))
>>> mice.data.shape
(1080, 77)
>>> mice.target.shape
(1080,)
>>> np.unique(mice.target)
array(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'],
dtype=object)
>>> mice.url
'https://www.openml.org/d/40966'
>>> mice.details['version']
'1'
In [5]: a.tofile('test2.dat')
In [6]: c = np.fromfile('test2.dat', dtype=int)
In [7]: c == a
Out[7]: array([ True, True, True, True], dtype=bool)
x, y = datasets.make_regression(
n_samples=30,
n_features=1,
noise=0.8)
plt.scatter(x,y)
plt.show()
Hichem Felouat - hichemfel@gmail.com 15
1. Dataset Loading: Generated Datasets -
Clustering
from sklearn.datasets.samples_generator import
make_blobs
from matplotlib import pyplot as plt
import pandas as pd
X, y = make_blobs(n_samples=200, centers=4,
n_features=2)
fig, ax = plt.subplots()
colors = ["blue", "red", "green", "purple"]
for idx, classification in groups:
classification.plot(ax=ax, kind='scatter', x='x1', y='x2',
label=idx, color=colors[idx])
plt.show()
Hichem Felouat - hichemfel@gmail.com 16
2. Preprocessing Data: missing values
df.isnull()
df = df.replace(['-','na'], np.nan)
Hichem Felouat - hichemfel@gmail.com 18
2. Preprocessing Data: missing values
import numpy as np
from sklearn.impute import SimpleImputer
X = [[np.nan, 2], [6, np.nan], [7, 6]]
# mean, median, most_frequent, constant(fill_value = )
imp = SimpleImputer(missing_values = np.nan, strategy='mean')
data = imp.fit_transform(X)
print(data)
X, y = make_classification( n_samples=10000,
n_features=4,
flip_y=0.1, class_sep=0.1)
x = variable
u = mean
s = standard deviatio
scaler = StandardScaler().fit_transform(X_train)
print(scaler)
# scaling in a way that the training data lies within the range [-1, 1]
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])
scaler = preprocessing.RobustScaler()
X_train_rob_scal = scaler.fit_transform(X_train)
print(X_train_rob_scal)
pt = preprocessing.PowerTransformer(method='box-cox',
standardize=False)
print(pt.fit_transform(X_lognormal))
Hichem Felouat - hichemfel@gmail.com 27
3. Feature Selection: 3. Normalization
Normalization is the process of scaling individual samples to have
unit norm. This process can be useful if you plan to use a quadratic
form such as the dot-product or any other kernel to quantify the
similarity of any pair of samples.
X_kbd = kbd.fit_transform(X)
print(X_kbd)
Hichem Felouat - hichemfel@gmail.com 31
3. Feature Selection: 5.1 Feature
Binarization
from sklearn import preprocessing
import numpy as np
X = [[ 1., -1., 2.],[ 2., 0., 0.],[ 0., 1., -1.]]
binarizer = preprocessing.Binarizer()
X_bin = binarizer.fit_transform(X)
print(X_bin)
# It is possible to adjust the threshold of the binarizer:
binarizer_1 = preprocessing.Binarizer(threshold=1.1)
X_bin_1 = binarizer_1.fit_transform(X)
print(X_bin_1)
Hichem Felouat - hichemfel@gmail.com 32
3. Feature Selection: 6. Generating
Polynomial Features
Often it’s useful to add complexity to the model by
considering nonlinear features of the input data. A simple
and common method to use is polynomial features, which
can get features’ high-order and interaction terms. It is
implemented in PolynomialFeatures.
for 2 features :
X_poly = poly.fit_transform(X)
print(X_poly)
Hichem Felouat - hichemfel@gmail.com 34
3. Feature Selection: 7. Custom
Transformers
Often, you will want to convert an existing Python function into a transformer to
assist in data cleaning or processing. You can implement a transformer from an
arbitrary function with FunctionTransformer. For example, to build a transformer
that applies a log transformation in a pipeline, do:
transformer = preprocessing.FunctionTransformer(np.log1p,
validate=True)
X = np.array([[0, 1], [2, 3]])
X_tr = transformer.fit_transform(X)
print(X_tr)
Hichem Felouat - hichemfel@gmail.com 35
3. Feature Selection: Text Feature
scikit-learn provides utilities for the most common ways to extract
numerical features from text content, namely:
import pandas as pd
pd.DataFrame(vec.transform(texts).toarray(), columns=sorted(vec.vocabulary_.keys()))
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!') == (
['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
vec = CountVectorizer(binary=False)
vec = TfidfVectorizer()
vec.fit(texts)
print([w for w in sorted(vec.vocabulary_.keys())])
X = vec.transform(texts).toarray()
import pandas as pd
pd.DataFrame(vec.transform(texts).toarray(),
columns=sorted(vec.vocabulary_.keys()))
Hichem Felouat - hichemfel@gmail.com 42
3. Feature Selection: Image Feature
#image.extract_patches_2d
from sklearn.feature_extraction import image
from sklearn.datasets import fetch_olivetti_faces
import matplotlib.pyplot as plt
import matplotlib.image as img
data = fetch_olivetti_faces()
plt.imshow(data.images[0])
def histogram(image,mask=None):
image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
hist = cv2.calcHist([image],[0],None,[256],[0,256])
cv2.normalize(hist, hist)
return hist.flatten()
Hichem Felouat - hichemfel@gmail.com 44
3. Feature Selection: Image Feature
import mahotas
def haralick_moments(image):
#image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
image = image.astype(int)
haralick = mahotas.features.haralick(image).mean(axis=0)
return haralick
class ZernikeMoments:
def __init__(self, radius):
# store the size of the radius that will be
# used when computing moments
self.radius = radius
dat = datasets.load_breast_cancer()
X, Y = dat.data, dat.target
print("Examples = ",X.shape ," Labels = ", Y.shape)
pca = PCA(n_components = 5)
X_pca = pca.fit_transform(X)
print("Examples = ",X_pca.shape ," Labels = ", Y.shape)
Hichem Felouat - hichemfel@gmail.com 48
4. Dimensionality Reduction: Kernel
Principal Component Analysis (KPCA)
Non-linear dimensionality reduction through the use of kernels.
dat = datasets.load_breast_cancer()
X, Y = dat.data, dat.target
print("Examples = ",X.shape ," Labels = ", Y.shape)
#kernel“linear” | “poly” | “rbf” | “sigmoid” | “cosine” | “precomputed”
kpca = KernelPCA(n_components=7, kernel='rbf')
X_kpca = kpca.fit_transform(X)
print("Examples = ",X_kpca.shape ," Labels = ", Y.shape)
Hichem Felouat - hichemfel@gmail.com 49
4. Dimensionality Reduction: PCA VS
KPCA
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit(X, Y).transform(X)
print("Examples = ",X_lda.shape ," Labels = ", Y.shape)
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
ax.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
ax.legend(loc='upper right', bbox_to_anchor=(0.5, 1.1),ncol=1, fancybox=True, shadow=True)
ax.set_xlabel('Training set size')
ax.set_ylabel('RMSE')
plt.show()
Hichem Felouat - hichemfel@gmail.com 56
1. Regression: Ridge regression
Ridge regression addresses some of the problems of Ordinary Least
Squares by imposing a penalty on the size of the coefficients.
from sklearn import datasets, linear_model
# Regularization strength; must be a positive float. Regularization improves the conditioning of the
problem and reduces the variance of the estimates. Larger values specify stronger regularization.
regressor = linear_model.Ridge(alpha=.5)
regressor.fit(X_train, Y_train)
predicted = regressor.predict(X_test)
# Lasso
regressor = linear_model.Lasso(alpha=0.1)
Hichem Felouat - hichemfel@gmail.com 57
1. Regression: Kernel Ridge regression
# kernel = [linear,polynomial,rbf]
regressor = KernelRidge(kernel ='rbf', alpha=1.0)
regressor.fit(X_train, Y_train)
predicted = regressor.predict(X_test)
regressor = RandomForestRegressor(max_depth=5,
random_state=0)
regressor.fit(X, Y)
predicted = regressor1.predict(X)
1) Generate hyperplanes that separate the classes in the best way. Left-hand side figure
showing three hyperplanes black, blue and orange. Here, the blue and orange have higher
classification errors, but the black is separating the two classes correctly.
Gamma (Kernel coefficient for rbf, poly and sigmoid): the low
value of gamma considers only nearby points in calculating the separation line,
while a high value of gamma considers all the data points in the calculation of
the separation line.
Hichem Felouat - hichemfel@gmail.com 68
2. Classification: Support Vector Machines -Kernel +
Tuning Hyperparameters
svc_cls.fit(X_train, y_train)
_, axes = plt.subplots(2, 4) # Now predict the value of the digit on the second half:
images_and_labels = list(zip(digits.images, digits.target)) predicted = classifier.predict(X_test)
for ax, (image, label) in zip(axes[0, :], images_and_labels[:4]):
images_and_predictions = list(zip(digits.images[n_samples // 2:], predicted))
ax.set_axis_off() for ax, (image, prediction) in zip(axes[1, :], images_and_predictions[:4]):
ax.imshow(image, cmap=plt.cm.gray_r, ax.set_axis_off()
interpolation='nearest') ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('Training: %i' % label) ax.set_title('Prediction: %i' % prediction)
dat = datasets.load_breast_cancer()
print("Examples = ",dat.data.shape ," Labels = ", dat.target.shape)
X_train, X_test, Y_train, Y_test = train_test_split(dat.data, dat.target, test_size= 0.20, random_state=100)
# default = ’auto’
logreg = LogisticRegression()
logreg.fit(X_train,Y_train)
predicted = logreg.predict(X_test)
Disadvantages:
Logistic regression is not able to handle a large number of categorical
features/variables. It is vulnerable to overfitting. Also, can't solve the non-linear
problem with the logistic regression that is why it requires a transformation of
non-linear features. Logistic regression will not perform well with independent
variables that are not correlated to the target variable and are very similar or
correlated to each other.
Hichem Felouat - hichemfel@gmail.com 73
2. Classification: Stochastic Gradient Descent
Gradient Descent : at a theoretical level, gradient descent is an algorithm that minimizes
functions. Given a function defined by a set of parameters, gradient descent starts with an initial
set of parameter values and iteratively moves toward a set of parameter values that minimize
the function. This iterative minimization is achieved using calculus, taking steps in the negative
direction of the function gradient.
SGD has been successfully applied to large-scale and sparse machine learning
problems often encountered in text classification and natural language
processing. Given that the data is sparse, the classifiers in this module easily
scale to problems with more than 10^5 training examples and more than 10^5
features.
predicted = clf.predict(X_test)
Disadvantages:
1) The KNN algorithm doesn't work well with high dimensional data because with large number
of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension.
2) The KNN algorithm has a high prediction cost for large datasets. This is because in large
datasets the cost of calculating distance between new point and each existing point becomes
higher.
3) Finally, the KNN algorithm doesn't work well with categorical features since it is difficult to
find the distance between dimensions with categorical features.
Hichem Felouat - hichemfel@gmail.com 79
2. Classification: Naive Bayes Classification
When the features are independent, we can extend the Bayes Rule to what
is called Naive Bayes.
Laplace Correction :
when you have a model with many features if the entire probability will become
zero because one of the feature’s value was zero. To avoid this, we increase
the count of the variable with zero to a small value (usually 1) in the numerator,
so that the overall probability doesn’t become zero.
Hichem Felouat - hichemfel@gmail.com 82
2. Classification: Naive Bayes Classification
weather Temperature Play weather No Yes P(No) P(Yes)
Sunny 3 2 3/6 2/8
1 Sunny Hot No
Overcast 1 3 1/6 3/8
2 Sunny Hot No
Rainy 2 3 2/6 3/8
3 Overcast Hot Yes Total 6 8 6/14 8/14
4 Rainy Mild Yes
Temperature No Yes P(No) P(Yes)
5 Rainy Cool Yes Hot 3 1 3/6 1/8
6 Rainy Cool No Mild 2 4 2/6 4/8
In Gaussian Naive Bayes, continuous values associated with each feature are
assumed to be distributed according to a Gaussian distribution. A Gaussian
distribution is also called Normal distribution.
Advantages:
Decision trees are supervised learning algorithms used for both, classification and
regression tasks.
The main idea of decision trees is to find the best descriptive features which contain the most
information regarding the target feature and then split the dataset along the values of these
features such that the target feature values for the resulting sub_datasets are as pure as
possible. The descriptive feature which leaves the target feature most purely is said to be the
most informative one. This process of finding the most informative feature is done until we
accomplish a stopping criterion where we then finally end up in so-called leaf nodes. The leaf
nodes contain the predictions we will make for new query instances presented to our trained
model. This is possible since the model has kind of learned the underlying structure of the
training data and hence can, given some assumptions, make predictions about the target
feature value (class) of unseen query instances.
1) S e l e c t a t e s t f o r r o o t n o d e .
Create branch for each possible
outcome of the test.
2) Split instances into subsets. One
for each branch extending from
the node.
3) Repeat recursively for eac h
branch, using only instances that
reach the branch.
4) Stop recursion for a branch if all
its instances have the same
class.
Attribute selection measure (ASM): is a heuristic for selecting the splitting criterion that
partition data into the best possible manner.
Hichem Felouat - hichemfel@gmail.com 91
2. Classification: Decision Trees
graph.write_png('tree1.png')
#pdf
import graphviz
dot_data = tree.export_graphviz(tree_clf, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("iris")
dot_data = tree.export_graphviz(tree_clf, out_file=None, feature_names=dat.feature_names, class_names=dat.target_names,
filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph.view('tree2.pdf')
Hichem Felouat - hichemfel@gmail.com 94
2. Classification: Decision Trees - Example
Gini impurity:
Entropy:
Scikit-Learn uses the CART algorithm, which produces only binary trees:
nonleaf nodes always have two children (i.e., questions only have yes/no
answers). However, other algorithms such as ID3 can produce Decision Trees
with nodes that have more than two children.
Advantages:
1) Random forests is considered as a highly accurate and robust method because of the
number of decision trees participating in the process.
2) It does not suffer from the overfitting problem. The main reason is that it takes the average of
all the predictions, which cancels out the biases.
3) The algorithm can be used in both classification and regression problems.
Disadvantages:
1) Random forests is slow in generating predictions because it has multiple decision trees.
Whenever it makes a prediction, all the trees in the forest have to make a prediction for the
same given input and then perform voting on it. This whole process is time-consuming.
2) The model is difficult to interpret compared to a decision tree, where you can easily make a
decision by following the path in the tree.
The goal of ensemble methods is to combine the predictions of several base estimators built
with a given learning algorithm in order to improve generalizability / robustness over a single
estimator.
In averaging methods, the driving principle is to build several estimators independently and
then to average their predictions. On average, the combined estimator is usually better than
any of the single base estimator because its variance is reduced.
Examples: Bagging methods, Forests of randomized trees, …
By contrast, in boosting methods, base estimators are built sequentially and one tries to
reduce the bias of the combined estimator. The motivation is to combine several weak models
to produce a powerful ensemble.
Examples: AdaBoost, Gradient Tree Boosting, …
The idea behind the VotingClassifier is to combine conceptually different machine learning
classifiers and use a majority vote (Hard Voting) or the average predicted probabilities
(soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well
performing model in order to balance out their individual weaknesses.
Soft voting returns the class label as argmax of the sum of predicted probabilities. Specific
weights can be assigned to each classifier via the weights parameter. When weights are
provided, the predicted class probabilities for each classifier are collected, multiplied by the
classifier weight, and averaged. The final class label is then derived from the class label with the
highest average probability.
Example:
w1=1, w2=1, w3=1.
In this approach, we use the same training algorithm for every predictor, but to
train them on different random subsets of the training set. When sampling is
performed with replacement, this method is called bagging (short for
bootstrap aggregating). When sampling is performed without replacement, it is
called pasting.
This is an example of bagging, but if you want to use pasting instead, just set
bootstrap=False
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=500, max_samples=100,
bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1), n_estimators=100,
algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)
# Training classifiers
from sklearn.ensemble import GradientBoostingRegressor
clf = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
max_depth=1, random_state=0, loss='ls')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Mean Absolute Error : ', metrics.mean_absolute_error(Y_test, y_pred))
print('Mean Squared Error : ', metrics.mean_squared_error(Y_test, y_pred))
print('Root Mean Squared Error: ', np.sqrt(metrics.mean_squared_error(Y_test, y_pred)))
Hichem Felouat - hichemfel@gmail.com 113
3. Ensemble Methods: XGBoost
XGBoost is a specific implementation of the Gradient Boosting method which
uses more accurate approximations to find the best tree model. It employs a
number of nifty tricks that make it exceptionally successful, particularly with
structured data. XGBoost is often an important component of the winning
entries in ML competitions.
xg_reg.fit(X_train,y_train)
predicted = xg_reg.predict(X_test)
param_grid = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
specifies that two grids should be explored: one with a linear kernel and C values in [1, 10, 100,
1000], and the second one with an RBF kernel, and the cross-product of C values ranging in [1,
10, 100, 1000] and gamma values in [0.001, 0.0001].
Parallelism:
GridSearchCV and RandomizedSearchCV evaluate each parameter setting independently.
Computations can be run in parallel if your OS supports it, by using the keyword n_jobs=-1.
Pipeline can be used to chain multiple estimators into one. This is useful as there is often
a fixed sequence of steps in processing the data, for example feature selection, normalization
and classification. Pipeline serves multiple purposes here:
Safety:
Pipelines help avoid leaking statistics from your test data into the trained model in cross-
validation, by ensuring that the same samples are used to train the transformers and predictors.
good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])
kmeans = KMeans(n_clusters=5, init=good_init, n_init=1)
1) For each instance, the algorithm counts how many instances are located within a
small distance ε (epsilon) from it. This region is called the instance’s ε-neighborhood.
3) All instances in the neighborhood of a core instance belong to the same cluster. This
neighborhood may include other core instances; therefore, a long sequence of
neighboring core instances forms a single cluster.
4) Any instance that is not a core instance and does not have one in its neighborhood is
considered an anomaly.
Hichem Felouat - hichemfel@gmail.com 136
7. Clustering - Density-based spatial clustering of
applications with noise (DBSCAN)
1) At the start, treat each data point as one cluster. Therefore, the number
of clusters at the start will be K, while K is an integer representing the
number of data points.
2) Form a cluster by joining the two closest data points resulting in K-1
clusters.
3) Form more clusters by joining the two closest clusters resulting in K-2
clusters.
4) Repeat the above three steps until one big cluster is formed.
5) Once single cluster is formed, dendrograms are used to divide into
multiple clusters depending upon the problem.
plot_gm(gm, X)