Python Machine Learning
Python Machine Learning
from Learning Python for Data Analysis and Visualization by Jose Portilla
https://www.udemy.com/learning-python-for-data-analysis-and-visualization/
Notes by Michael Brothers
Table of Contents
What is Machine Learning?.................................................................................................................................................... 3
Types of Machine Learning – Supervised & Unsupervised ................................................................................................... 3
Supervised Learning ........................................................................................................................................................... 3
Supervised Learning: Regression .................................................................................................................................... 3
Supervised Learning: Classification................................................................................................................................. 3
Unsupervised Learning ....................................................................................................................................................... 3
Supervised Learning – LINEAR REGRESSION ......................................................................................................................... 4
Getting & Setting Up the Data ........................................................................................................................................... 4
Quick visualization of the data: ......................................................................................................................................... 4
Root Mean Square Error .................................................................................................................................................... 6
Using SciKit Learn to perform multivariate regressions ................................................................................................... 6
Building Training and Validation Sets using train_test_split ............................................................................. 7
Predicting Prices ................................................................................................................................................................. 7
Residual Plots ..................................................................................................................................................................... 8
Supervised Learning – LOGISTIC REGRESSION ...................................................................................................................... 9
Getting & Setting Up the Data ........................................................................................................................................... 9
Binary Classification using the Logistic Function ............................................................................................................... 9
Dataset Analysis ................................................................................................................................................................. 9
Data Preparation .............................................................................................................................................................. 10
Multicollinearity Consideration ....................................................................................................................................... 11
Testing and Training Data Sets ........................................................................................................................................ 11
For more info on Logistic Regression:.............................................................................................................................. 12
Supervised Learning – MULTI-CLASS CLASSIFICATION........................................................................................................ 12
The Iris Flower Data Set ................................................................................................................................................... 12
Getting & Setting Up the Data ......................................................................................................................................... 13
Data Visualization............................................................................................................................................................. 13
Plotting individual histograms: ........................................................................................................................................ 14
Multi-Class Classification with Sci Kit Learn .................................................................................................................... 14
K-Nearest Neighbors ........................................................................................................................................................ 14
SUPPORT VECTOR MACHINES.............................................................................................................................................. 16
Supervised Learning using NAÏVE BAYES CLASSIFIERS ........................................................................................................ 19
Bayes' Theorem ................................................................................................................................................................ 19
Naïve Bayes Equation....................................................................................................................................................... 19
Constructing a classifier from the probability model ..................................................................................................... 19
Gaussian Naïve Bayes....................................................................................................................................................... 19
For more info on Naïve Bayes: ......................................................................................................................................... 20
DECISION TREES and RANDOM FORESTS ............................................................................................................................ 20
Visualization Function ...................................................................................................................................................... 21
Random Forests ................................................................................................................................................................ 22
Random Forest Regression .............................................................................................................................................. 23
1
More resources for Random Forests: .............................................................................................................................. 24
Unsupervised Learning – NATURAL LANGUAGE PROCESSING ........................................................................................... 25
Exploratory Data Analysis (EDA) ...................................................................................................................................... 25
Feature Engineering ......................................................................................................................................................... 25
Text Pre-processing .......................................................................................................................................................... 26
Vectorization .................................................................................................................................................................... 26
Term Frequency – Inverse Document Frequency (TF-IDF) .............................................................................................. 27
Training a Model .............................................................................................................................................................. 27
APPENDIX I – SciKit Learn Boston Dataset: ......................................................................................................................... 28
APPENDIX II: FOR FURTHER RESEARCH ............................................................................................................................... 29
2
PYTHON MACHINE LEARNING WITH SCIKIT LEARN
Unsupervised Learning
Here data has no labels, and we are interested in finding similarities between the objects in question.
In a sense, unsupervised learning is a means of discovering labels from the data itself.
3
Supervised Learning – LINEAR REGRESSION
Ultimately we want to minimize the difference between our hypothetical model (theta) and the actual,
in an exercise called Gradient Descent (trial and error with different parameter values).
Note that complex gradient descents may be subject to local minimums.
Batch Gradient Descent – stepwise calculations performed over entire training set (i = 0 to m), repeat until convergence
Stochastic Gradient Descent – for j = 1 to m, perform parameter adjustments to the whole based on iterative
calculations. In a sense, calculations meander their way toward the minimum without necessarily hitting it exactly,
but get there much faster for large data sets.
4
Plot the column at the 5 index (Labeled RM)
plt.scatter(boston.data[:,5],boston.target)
plt.ylabel('Price in $1000s')
plt.xlabel('Number of rooms')
The lecture then builds a DataFrame using features specific to the SciKit boston dataset:
boston_df = DataFrame(boston.data)
boston_df.columns = boston.feature_names to label the columns
boston_df['Price'] = boston.target adds a column not yet present
boston_df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT Price
0 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
He explains the math behind the Least Squares Method, then applies numpy to the univariate problem at hand:
X = np.vstack(boston_df.RM) Use vstack to make X two-dimensional (w/index)
X = np.array([[value,1] for value in X]) pairs each x-value to an attribute number (1)
this feels messy
Y = boston_df.Price Set up Y as the target price of the houses.
m, b = np.linalg.lstsq(X, Y)[0] returns m & b values for the least-squares-fit line
plt.plot(boston_df.RM,boston_df.Price,'o') plot with best fit line (entered in one cell)
x = boston_df.RM
plt.plot(x, m*x + b,'r',label='Best Fit Line')
plt.legend(loc='lower right') unlike Seaborn, pyplot requires a separate legend line
5
Root Mean Square Error
Since we used numpy already, we can obtain the error the same way:
result = np.linalg.lstsq(X,Y)
error_total = result[1]
rmse = np.sqrt(error_total/len(X)) this is the root mean square error
print "The root mean square error was %.2f " %rmse
The root mean square error was 6.60
Since the root mean square error (RMSE) corresponds approximately to the standard deviation we can now say
that the price of a house won't vary more than 2 times the RMSE 95% of the time.
Thus we can reasonably expect a house price to be within $13,200 of our line fit.
The sklearn.linear_model.LinearRegression class is an estimator. Estimators predict a value based on the observed data.
In scikit-learn, all estimators implement the fit() and predict() methods. The former method is used to learn the
parameters of a model, and the latter method is used to predict the value of a response variable for an explanatory
variable using the learned parameters. It is easy to experiment with different models using scikit-learn because all
estimators implement the fit and predict methods.
We'll start the multi variable regression analysis by seperating our boston dataframe into the data columns and the
target columns:
X_multi = boston_df.drop('Price',1) these are our Data Columns
(in order to drop a column you need to pass a 1 index)
Y_target = boston_df.Price this is our Target Column
6
To see each of these coefficients mapped to their original columns:
coeff_df = DataFrame(boston_df.columns) Set a DataFrame from the Features
coeff_df.columns = ['Features']
Set a new column lining up the coefficients from the linear regression
coeff_df["Coefficient Estimate"] = pd.Series(lreg.coef_)
coeff_df
Coefficient
Features Estimate For more info on interpreting coefficients:
0 CRIM -0.107171 http://www.theanalysisfactor.com/interpreting-regression-
1 ZN 0.046395 coefficients/
2 INDUS 0.02086
3 CHAS 2.688561 SciKit Learn's built-in methods of best feature selection:
4 NOX -17.795759 http://scikit-
5 RM 3.804752 learn.org/stable/modules/generated/sklearn.feature_selection.f_
6 AGE 0.000751 regression.html
7 DIS -1.475759
8 RAD 0.305655
9 TAX -0.012329
10 PTRATIO -0.953464
11 B 0.009393
12 LSTAT -0.525467
13 Price NaN
Jose claims that the highest correlated feature was # of rooms (RM) with a coefficient estimate of 3.8. I see NOX as the
highest with a coefficient of -17.79. Related question: how much does the coefficient affect the target value if the
variable doesn't change much? ie, a low coefficient on # rooms may have greater effect when rooms can double from 2
to 4 quite easily, where a high coefficient on NOX may not matter much if the variation over our sample set is only 1 or 2
ppm. And what about orders of magnitude? A small change to a big number may outweigh a big change to a small one.
What about non-linear relationships? The number of rooms may have diminishing marginal utility.
Predicting Prices
lreg = LinearRegression()
Once again do a linear regression, except only on the training sets this time
lreg.fit(X_train,Y_train)
Now obtain the mean square error (these values change with each new train_test_split run)
print "Fit a model X_train, and calculate MSE with Y_train: %.2f"
% np.mean((Y_train - pred_train) ** 2)
print "Fit a model X_train, and calculate MSE with X_test and Y_test: %.2f"
%np.mean((Y_test - pred_test) ** 2)
Fit a model X_train, and calculate MSE with Y_train: 42.95
Fit a model X_train, and calculate MSE with X_test and Y_test: 46.34
7
It looks like our mean square error between our training and testing was pretty close.
But how do we actually visualize this?
Residual Plots
In regression analysis, the difference between the observed value of the dependent variable (y) and the predicted value
(ŷ) is called the residual (e). Each data point has one residual, so that:
You can think of these residuals in the same way as the D value we discussed earlier, in this case however, there were
multiple data points considered.
A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal
axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is
appropriate for the data; otherwise, a non-linear model is more appropriate.
Residual plots are a good way to visualize the errors in your data. If you have done a good job then your data should be
randomly scattered around line zero. If there is some strucutre or pattern, that means your model is not capturing some
thing. There could be an interaction between 2 variables that you're not considering, or may be you are measuring time
dependent data. If this is the case go back to your model and check your data set closely.
So now let's go ahead and create the residual plot. For more info on the residual plots check out this great link.
Great! Looks like there aren't any major patterns to be concerned about, (though it may be interesting to check out the
line occurring towards the upper right), but overall the majority of the residuals seem to be randomly allocated above
and below the horizontal.
NOTE: the line upper right relates to the outlier 50 values from the dataset (same disbursement of 11 values).
For more info: http://scikit-learn.org/stable/modules/linear_model.html#linear-model
8
Supervised Learning – LOGISTIC REGRESSION
Dataset Import
import statsmodels.api as sm
The Logistic Function takes any value from negative to positive infinity and it has always has an output between 0 and 1.
Refer to the jupyter notebook for code behind the plot above.
Essentially we're applying a linear regression equation to the logistic function. The goal is to return a probability of
"success" or "failure" from our linear regression equation. Since the logistic function outputs a value between 0 and 1,
we now have a binary classification between outputs from 0 to 0.5 (failure), and 0.5 to 1 (success).
Dataset Analysis
The dataset is packaged within Statsmodels. It is a data set from a 1974 survey of women by Redbook magazine.
Married women were asked if they have had extramarital affairs. The published work on the data set can be found in:
Fair, Ray. 1978. “A Theory of Extramarital Affairs,” Journal of Political Economy, February, 45-61.
Given certain variables for each woman, can we classify them as either having particpated in an affair,
or not participated in an affair?
9
Take a quick look at the Had_Affair column and mean values of all other attributes:
df.groupby('Had_Affair').mean()
Most of the values are fairly close to one another. There are no obvious correlations between a given parameter and
the likelihood of participating in an affair.
Data Preparation
Most of the columns in our dataset contain parametric data (age, level of education, degree of religiousness, etc.) while
Occupation does not. Occupation and Husban's Occupation contain Categorical Variables. We need to apply the pandas
get_dummies method to split each occupational category into its own column:
occ_dummies.columns = ['occ1','occ2','occ3','occ4','occ5','occ6']
hus_occ_dummies.columns = ['hocc1','hocc2','hocc3','hocc4','hocc5','hocc6']
This is just to rename the columns to something more recognizable
Drop the original columns (and the target) and load the new dataframes onto our dataset
X = df.drop(['occupation','occupation_husb','Had_Affair'],axis=1)
X = pd.concat([X, occ_dummies, hus_occ_dummies],axis=1)
Note: in the lecture, Jose first combined occ_dummies & hus_occ_dummies into a "dummies" dataframe and joined
that into X using concat. I chose to do it in one step.
10
Multicollinearity Consideration
Our six dummy occupation categories are highly correlated. Among the six only one will contain a "1" value, so you can
always determine the value of one column based on the values of the other five. This will lead to an exaggerated level of
accuracy in the regression calculation. To compensate, we drop a column of data, and sacrifice one data point in favor of
more realistic regression calculations. While the choice of column is fairly arbitrary, it does affect the final result.
For more info see: https://en.wikipedia.org/wiki/Multicollinearity
In order to use the Y with SciKit Learn, we need to set it as a 1-D array. This means we need to "flatten" the array.
Numpy has a built in method for this called ravel:
Y = np.ravel(Y)
NOTE: Y was a Series to begin with, so np.array(Y) does the same thing!
11
For more info on Logistic Regression:
So what could we do to try to further improve our Logistic Regression model?
We could try some regularization techniques or using a non-linear model.
A great post on how to do logistic regression analysis using Statsmodels from yhat!
The SciKit learn Documentation includes several examples at the bottom of the page.
DataRobot has a great overview of Logistic Regression
Fantastic resource from aimotion.blogspot on the Logistic Regression and the Mathmatics of how it relates to the cost
function and gradient!
In this section we will learn how to use multi-class classification with SciKit Learn to seperate data into multiple classes.
We will first use SciKit Learn to implement a strategy known as one vs. all (sometimes called one vs. rest) to perform
multi-class classification. This method works by basically performing a logistic regression for binary classification for each
possible class. The class that is then predicted with the highest confidence is assigned to that data point.
For a great visual explanation of this, here is Andrew Ng's quick explanation of how one-vs-rest works:
from IPython.display import YouTubeVideo
YouTubeVideo("Zj403m-fjqg")
After we use the one-vs-all logistic regression method, we'll use the k nearest neighbors method to classify the data.
12
Getting & Setting Up the Data
from sklearn import linear_model
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
Y = iris.target
Put into a pandas DataFrame:
iris_data = DataFrame(X, columns=['Sepal Length','Sepal Width',
'Petal Length','Petal Width'])
iris_target = DataFrame(Y,columns=['Species'])
If we look at the iris_target data, we'll notice that the Species are still defined as either 0, 1, or 2.
Let's go ahead and use apply() to split the column, apply a naming function, and then combine it back together:
def flower(num):
''' Takes in a numerical class, returns a flower name'''
if num == 0:
return 'Setosa'
elif num == 1:
return 'Veriscolour'
else:
return 'Virginica'
iris_target['Species'] = iris_target['Species'].apply(flower)
Data Visualization
We can get a quick birds eye view with seaborn's pairplot:
iris = pd.concat([iris_data, iris_target],axis=1) first create a combined DataFrame
sns.pairplot(iris, hue='Species', size=2)
13
Plotting individual histograms:
Use factorplot to view individual parameters (but you need to sort them first):
xorder = np.apply_along_axis(sorted, 0, iris['Petal Length'].unique())
sns.factorplot('Petal Length', data=iris, order=xorder, size = 10, hue='Species',
kind='count'); (see the full plot in the file Python Data Visualizations)
Now that we've trained our model with a training set, let's test our accuracy with the testing set.
We'll make a prediction using our model and then check its accuracy.
Import testing metrics from SciKit Learn
from sklearn import metrics Import testing metrics from SciKit Learn
Y_pred = logreg.predict(X_test) Prediction from X_test
print metrics.accuracy_score(Y_test,Y_pred) Check accuracy
0.933333333333
It looks like our model had a 93% accuracy (this could change from run to run due to the random splitting).
Should we trust this level of accuracy? I encourage you to figure out ways to intuitively understand this result.
Try looking at the PairPlot again and check to see how separate the data features initially were. Also try changing the
test_size parameter and check how that affects the outcome. In conclusion, given how clean the data is and how
separated some of the features are, we should expect pretty high accuracy.
K-Nearest Neighbors
Now we'll use k-nearest neighbors to implement Multi-Class Classification. The premise of this algorithm is simple.
Given an object to be assigned to a class in a feature space, select the class that is "nearest" to the neighbors in the
training set. This "nearness" is a distance metric, which is usually a Euclidean distance.
The k-nearest neighbor (kNN) algorithm is well explained in the following two videos:
How kNN algorithm works by Thales Sehn Körting
MIT OpenCourseWare 10. Introduction to Learning, Nearest Neighbors lecture by Patrick Winston
To leverage the algorithm the hard way, compute the distance between your object and every neighbor, and make your
selection based on the closest ones. The value of k determine how many neighbors are considered. For binary
classifications, always choose an odd number to avoid ties.
14
Using SciKit Learn with k=6:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 6) create an instance and set k=6
knn.fit(X_train,Y_train) fit our data to the instance
Y_pred = knn.predict(X_test) run a prediction
print metrics.accuracy_score(Y_test,Y_pred) check accuracy
0.95
You can cycle through 20 possible k-values to find the most accurate:
k_range = range(1, 21)
accuracy = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
accuracy.append(metrics.accuracy_score(Y_test, Y_pred))
plt.plot(k_range, accuracy)
plt.xlabel('K value for for kNN')
plt.ylabel('Testing Accuracy');
Interesting! Try changing the way Sci Kit Learn split the training and Testing data sets and try re-running this analysis.
What changed?
A better classification method is to first divide your feature space based on the existing data (done so by tracing all
possible perpendicular bisectors to draw "decision boundaries"), then assign a class to the object by the space it
occupies. In some cases, Euclidian distances don't apply, so we use vector angles (a mechanism for comparing
parameter ratios)
15
SUPPORT VECTOR MACHINES
Support Vector Machines (SVM) are a method that uses points in a transformed problem space that best separate
classes into two groups. Classification for multiple classes is then supported by a one-vs-all method (just like we
previously did for Logistic Regression for Multi Class Classification).
Formal Explanation:
In machine learning, support vector machines (SVMs) are supervised learning models with associated learning
algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of
training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that
assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model
is a representation of the examples as points in space, mapped so that the examples of the separate categories are
divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to
belong to a category based on which side of the gap they fall on.
Key to the success of Support Vector Machines is computation of a hyperplane dividing the classes. If the plane is curved,
then a kernel trick may be employed to cast the feature space into 3 dimensions for slicing. Nice video here.
Import numpy and matplotlib (but not Seaborn), and the Iris dataset as seen above.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Next, import the SVC (Support Vector Classification) from the SVM library of Sci Kit Learn and set up our model:
from sklearn.svm import SVC
model = SVC()
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
model.fit(X_train,Y_train)
16
Now that we've gone through a basic implementation of SVM lets go ahead and quickly explore the various kernel types
we can use for classification. We can do this by plotting out the boundaries created by each kernel type! We'll start with
some imports and by setting up the data.
The four methods we will explore are two linear models, a Gaussian Radial Basis Function, and a SVC with a polynomial
(3rd Degree) kernel.
The linear models LinearSVC() and SVC(kernel='linear') yield slightly different decision boundaries. This can be a
consequence of the following differences:
• LinearSVC minimizes the squared hinge loss while SVC minimizes the regular hinge loss.
• LinearSVC uses the One-vs-All multiclass reduction while SVC uses the One-vs-One multiclass reduction.
We'll use all the data and not bother with a split between training and testing. We'll also only use two features.
X = iris.data[:,:2]
Y = iris.target
C = 1.0 SVM regularization parameter
SVC Linear
lin_svc = svm.LinearSVC(C=C).fit(X,Y)
Now that we have fitted the four models, we will go ahead and begin the process of setting up the visual plots.
Note: This example is taken from the Sci Kit Learn Documentation.
First we define a mesh to plot in. We define the max and min of the plot for the y and x axis by the smallest and larget
features in the data set. We can use numpy's built in meshgrid method to construct our plot:
h = 0.02 Set the step size
x_min=X[:, 0].min() – 1 Set the X-axis min and max
x_max =X[:, 0].max() + 1
y_min = X[:, 1].min() - 1 Set the Y-axis min and max
y_max = X[:, 1].max() + 1
Finally we will go through each model, set its position as a subplot, then scatter the data points and draw a countour of
the decision boundaries.
17
# Use enumerate for a count
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
plt.figure(figsize=(15,15))
# Set the subplot position (Size = 2 by 2, position deifined by i count
plt.subplot(2, 2, i + 1)
# Subplot spacing
plt.subplots_adjust(wspace=0.4, hspace=0.4)
# Define Z as the prediction, note the use of ravel to format the arrays
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
plt.show()
18
Supervised Learning using NAÏVE BAYES CLASSIFIERS
This section requires understanding of mathematical notation.
Capital pi is used to show the product of sequences:
4
∏ 𝑖 = 1 ∙ 2 ∙ 3 ∙ 4 = 24
𝑖=1
arg max f(x) the argument of the maximum (arg max or argmax) is the set of inputs that correspond to the maximum
outputs of a function. The set may be empty, have one element, or have multiple elements.
Bayes' Theorem
Assesses the likelihood of A given B when you know the overall likelihood of A, of B, and the likelihood of B given A.
𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
When you don't know the overall likelihood of B, use
𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵|𝐴)𝑃(𝐴) + 𝑃(𝐵|𝑛𝑜𝑡𝐴)𝑃(𝑛𝑜𝑡𝐴)
Naïve Bayes Equation
𝑃(𝑦) ∏𝑛𝑖=1 𝑃(𝑥𝑖 |𝑦)
𝑃(𝑦|𝑥1 , … , 𝑥𝑛 ) =
𝑃(𝑥1 , … , 𝑥𝑛 )
In other words, the probability of y given a specific set of conditions x is equal to the overall probability of y times the
product of all the individual, independent conditions x given y divided by the overall probability of the conditions.
Constructing a classifier from the probability model (and ultimately, a decision rule)
The goal here is to use the Naïve Bayes probability model to drive a decision model that selects the most probable class
for y based on given attributes x. Picking the hypothesis that is most probable is known as the maximum a posteriori or
MAP decision rule. Selecting the appropriate classifier depends on our assumption of the distributions of attributes x.
19
For more info on Naïve Bayes:
1.) SciKit Learn Documentation 4.) Andrew Ng's Class Notes
2.) Naive Bayes with NLTK 5.) Andrew Ng's Video Lecture on Naive Bayes
3.) Wikipedia on Naive Bayes 6.) UC Berkeley Lecture by Pieter Abbeel
Random Forests are a classic example of an "ensemble learner", made up of many weak learners (decision trees).
20
Visualization Function
def visualize_tree(classifier, X, y, boundaries=True,xlim=None, ylim=None):
'''
Visualizes a Decision Tree.
INPUTS: Classifier Model, X, y, optional x/y limits.
OUTPUTS: Meshgrid visualization for boundaries of the Decision Tree
'''
# Fit the X and y data to the tree
classifier.fit(X, y)
# Define the Z by the predictions (this will color in the mesh grid)
Z = classifier.predict(np.c_[xx.ravel(), yy.ravel()])
#Set Limits
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
elif tree.feature[i] == 1:
plt.plot(xlim, [tree.threshold[i], tree.threshold[i]], '-k')
plot_boundaries(tree.children_left[i], xlim,
[ylim[0], tree.threshold[i]])
plot_boundaries(tree.children_right[i], xlim,
[tree.threshold[i], ylim[1]])
21
# Set model variable
clf = DecisionTreeClassifier(max_depth=2,random_state=0) ON LEFT
clf = DecisionTreeClassifier(max_depth=4,random_state=0) ON RIGHT
# Show Boundaries
visualize_tree(clf,X,y)
Notice how changing the depth of the decision causes the boundaries to change substantially! If we pay close attention
to the second model we can begin to see evidence of over-fitting. This basically means that if we were to try to predict a
new point the result would be influenced more by the noise than the signal.
So how do we address this issue? The answer is by creating an ensemble of decision trees.
Random Forests
Ensemble Methods essentially average the results of many individual estimators which over-fit the data. The resulting
estimates are much more robust and accurate than the individual estimates which make them up! One of the most
common ensemble methods is the Random Forest, in which the ensemble is made up of many decision trees which are
in some way perturbed. Lets see how we can use Sci-Kit Learn to create a random forest (its actually very simple!)
Note that n_estimators stands for the numerb of trees to use. You would intuitively know that using more decision trees
would be better, but after a certain amount of trees (somewhere between 100-400 depending on your data) the
benefits in accuracy of adding more estimators significantly decreases and just becomes a load on your CPU.
# n_estimators
clf = RandomForestClassifier(n_estimators=100,random_state=0)
22
# Get rid of boundaries to avoid error
visualize_tree(clf,X,y,boundaries=False)
You can see that the random forest has been able to pick up features that the Decision Tree was not able to (although
we must be careful of over-fitting with Random Forests too!)
While a visual is nice, a better way to evaluate our model would be with train test split if we had real data!
23
Now lets use a Random Forest Regressor to create a fitted regression, obviously a standard linear regression approach
wouldn't work here. And if we didn't know anything about the true nature of the model, polynomial or sinusoidal
regression would be tedious.
# X points
xfit = np.linspace(0, 10, 1000)
# Model
rfr = RandomForestRegressor(100)
# Fit Model (Format array for y with [:,None])
rfr.fit(x[:, None], y)
# Set predicted points
yfit = rfr.predict(xfit[:, None])
# Set real poitns (the model function)
ytrue = sin_model(xfit, 0)
# Plot
plt.figure(figsize=(16,8))
plt.errorbar(x, y, 0.1, fmt='o')
plt.plot(xfit, yfit, '-r');
plt.plot(xfit, ytrue, '-k', alpha=0.5);
As you can see, the non-parametric random forest model is flexible enough to fit the multi-period data,
without us even specifying a multi-period model!
This is a tradeoff between simplicity and thinking about what your data actually is.
24
Unsupervised Learning – NATURAL LANGUAGE PROCESSING
import nltk
Then, download the SMS spam collection dataset from the UCI datasets:
messages = [line.rstrip() for line in
open('smsspamcollection/SMSSpamCollection')]
print len(messages)
5574
use messages.groupby('label').describe() to see that 4825 messages are "ham", 747 are "spam"
Feature Engineering
Drilling down on specific features of our dataset. First we'll tackle length:
messages['length'] = messages['message'].apply(len) adds a "length" column
import matplotlib.pyplot as plt
%matplotlib inline
messages['length'].plot(bins=50, kind='hist');
Shows that most messages fall betwn 0-200 characters, but some may be as large as 1000.
messages.length.describe()
Tells us the longest message is 910 characters
25
messages[messages['length'] == 910]['message'].iloc[0]
Shows us a particularly verbose love note.
Let's plot lengths for each label separately:
messages.hist(column='length', by='label', bins=50,figsize=(10,4));
Shows us that "ham" peaks before 150 characters, while "spam" has a peak between 150-200.
Text Pre-processing
In order to classify the corpus we need to convert text content into some sort of numerical feature vector.
A simple method is the bag-of-words approach, where each unique word in a text will be represented by one number.
To massasge the data we'll first strip punctuation using the string module, then the stopwords using one of NLTK's
libraries:
def text_process(mess):
"""
Takes in a string of text, then performs the following:
1. Remove all punctuation (Note: spaces are NOT removed)
2. Remove all stopwords
3. Returns a list of the cleaned text
"""
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in
stopwords.words('english')]
Continuing Normalization
Removing punctuation and stop words are steps toward normalizing the data. We can continue normalizing with tools
such as stemming (reducing words to their roots to improve counts, for example "traveling" == "travel") or distinguishing
by part of speech. For more info: http://www.nltk.org/book/
Vectorization
We have converted the messages into lists of token words (lemmas). Now we need to convert the lists into numerical
vectors. We'll do that in three steps using the bag-of-words model:
1. Count how many times does a word occur in each message (known as term frequency)
2. Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)
3. Normalize the vectors to unit length, to abstract from the original text length (L2 norm)
26
Term Frequency – Inverse Document Frequency (TF-IDF)
The tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used
to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to
the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations
of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's
relevance given a user query.
One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated
ranking functions are variants of this simple model.
Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the
number of times a word appears in a document, divided by the total number of words in that document; the second
term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the
corpus divided by the number of documents where the specific term appears.
Example:
Consider a document containing 100 words wherein the word cat appears 3 times.
The term frequency (tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat
appears in one thousand of these. Then, the inverse document frequency (idf) is calculated as
log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
Training a Model
27
APPENDIX I – SciKit Learn Boston Dataset:
print boston.DESCR
Boston House Prices dataset
Notes
------
Data Set Characteristics:
Investopedia defines hedonic pricing as "A model identifying price factors according to the premise that price is
determined both by internal characteristics of the good being sold and external factors affecting it."
28
APPENDIX II: FOR FURTHER RESEARCH
Value of coefficients:
1. What determines the strength of a coefficient? absolute value? p-value?
2. How much does the coefficient affect the target value if the associated variable doesn't change much?
ie, a low coefficient on # rooms may have greater effect when rooms can double from 2 to 4 quite easily, where
a high coefficient on NOX may not matter much if the variation over our sample set is only 1 or 2 ppm.
3. What about orders of magnitude? A small change to a big number may outweigh a big change to a small one.
4. What about non-linear functions? There should be a greater effect between 3 and 4 room houses than between
7 and 8 rooms.
Multicollinearity:
Need to play around with this some more (perhaps with a SciKit Learn dataset instead of statsmodel). How best to
choose which category to drop. In the affair example, we arbitrarily dropped the only negative-coefficient category
(student). What happens if we chose a different one / multiples / none?
29