Machine Learning

Laborator
Programare Python – Machine Learning
1. Data Preprocessing
# Data Preprocessing
# Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset

dataset =
pd.read_csv('C:\\Urbino_MachineLearning\\0.DataPreprocessing\\Data.csv')
X = dataset.iloc[:, :-1].values #select all but last column of data frame
y = dataset.iloc[:, 3].values
# Taking care of missing data

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# Encoding categorical data

# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
'''
#How to marking invalid or corrupt values as missing in your dataset.
#How to remove rows with missing data from your dataset.
#How to impute missing values with mean values in your dataset.
#The Pima Indians Diabetes Dataset involves predicting the onset of diabetes
within 5 years
#in Pima Indians given medical details.
#It is a binary (2-class) classification problem.
#The number of observations for each class is not balanced.
#There are 768 observations with 8 input variables and 1 output variable.
#The variable names are as follows:
#0. Number of times pregnant.
#1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
#2. Diastolic blood pressure (mm Hg).
#3. Triceps skinfold thickness (mm).
#4. 2-Hour serum insulin (mu U/ml).
#5. Body mass index (weight in kg/(height in m)^2).
#6. Diabetes pedigree function.
#7. Age (years).
#8. Class variable (0 or 1).
#This dataset is known to have missing values.

#Specifically, there are missing observations for some columns
#that are marked as a zero value.
#We can corroborate this by the definition of those columns
#and the domain knowledge that a zero value is invalid for those measures,
#e.g. a zero for body mass index or blood pressure is invalid.
#2. Mark Missing Values

#In this section, we will look at how we can identify and mark values as
missing.
#We can load the dataset as a Pandas DataFrame
#and print summary statistics on each attribute.
'''
from pandas import read_csv

dataset = read_csv('C:\\Urbino_MachineLearning\\0.DataPreprocessing\\pima-
indians-diabetes.data.csv', header=None)
print(dataset.describe())
'''
This is useful.
We can see that there are columns that have a minimum value of zero (0).
On some columns, a value of zero does not make sense and indicates an invalid
or missing value.
Specifically, the following columns have an invalid zero minimum value:
1: Plasma glucose concentration
2: Diastolic blood pressure
3: Triceps skinfold thickness
4: 2-Hour serum insulin
5: Body mass index
Let’ confirm this my looking at the raw data, the example prints the first 20
rows of data.
'''
import numpy
# print the first 20 rows of data
print(dataset.head(20))
'''
We can get a count of the number of missing values on each of these columns.
We can do this my marking all of the values in the subset of the DataFrame
we are interested in that have zero
values as True. We can then count the number of true values in each column.
We can do this my marking all of the values in the subset of the DataFrame we
are interested
in that have zero values as True. We can then count the number of true values
in each column.
'''

print((dataset[[1,2,3,4,5]] == 0).sum())
'''
We can see that columns 1,2 and 5 have just a few zero values,
whereas columns 3 and 4 show a lot more, nearly half of the rows.
This highlights that different “missing value” strategies may be needed for
different columns,
e.g. to ensure that there are still a sufficient number of records left to
train a predictive model.
In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values
as NaN.
Values with a NaN value are ignored from operations like sum, count, etc.
We can mark values as NaN easily with the Pandas DataFrame
by using the replace() function on a subset of the columns we are interested
in.
After we have marked the missing values,
we can use the isnull() function to mark all of the NaN values in the dataset
as True and
get a count of the missing values for each column.
'''

import numpy
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# count the number of NaN values in each column
print(dataset.isnull().sum())
'''
Running the example prints the number of missing values in each column.
We can see that the columns 1:5 have the same number of missing values as zero
values identified above.
This is a sign that we have marked the identified missing values correctly.
Below is the same example, except we print the first 20 rows of data.
'''

import numpy
# print the first 20 rows of data
print(dataset.head(20))
'''
Running the example, we can clearly see NaN values in the columns 2, 3, 4 and
5.
There are only 5 missing values in column 1,
so it is not surprising we did not see an example in the first 20 rows.
It is clear from the raw data that marking the missing values had the intended
effect.
Before we look at handling missing values,
let’s first demonstrate that having missing values in a dataset can cause
problems.
'''
'''
3. Missing Values Causes Problems
Having missing values in a dataset can cause errors with some machine learning
algorithms.
In this section, we will try to evaluate a the Linear Discriminant Analysis
(LDA) algorithm
on the dataset with missing values.
This is an algorithm that does not work when there are missing values in the
dataset.
The below example marks the missing values in the dataset,
as we did in the previous section, then attempts to evaluate LDA using 3-fold
cross validation
and print the mean accuracy.
'''

import numpy
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)
result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(result.mean())
'''
Running the example results in an error, as follows:
ValueError: Input contains NaN, infinity or a value too large for
dtype('float64').
1
ValueError: Input contains NaN, infinity or a value too large for
dtype('float64').
This is as we expect.
We are prevented from evaluating an LDA algorithm (and other algorithms)
on the dataset with missing values.
'''
'''
Now, we can look at methods to handle the missing values.
'''
'''
!!!!!!!!!!!!!! 1
Remove Rows With Missing Values
The simplest strategy for handling missing data is to remove records that
contain a missing value.
We can do this by creating a new Pandas DataFrame with the rows containing
missing values removed.
Pandas provides the dropna() function that can be used to drop either columns
or rows with missing data.
We can use dropna() to remove all rows with missing data, as follows:
'''

import numpy
# drop rows with missing values
dataset.dropna(inplace=True)
# summarize the number of rows and columns in the dataset
print(dataset.shape)
'''
Running this example, we can see that the number of rows has been aggressively
cut from 768
in the original dataset to 392 with all rows containing a NaN removed.
(392, 9)
We now have a dataset that we could use to evaluate an algorithm sensitive to
missing values like LDA.
'''

import numpy
# drop rows with missing values
dataset.dropna(inplace=True)
X = values[:,0:8]
y = values[:,8]
result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
'''
The example runs successfully and prints the accuracy of the model.
0.78582892934
1
0.78582892934
Removing rows with missing values can be too limiting on some predictive
modeling problems,
an alternative is to impute missing values.
'''
'''
5. Impute Missing Values
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!2
Imputing refers to using a model to replace missing values.
There are many options we could consider when replacing a missing value, for
example:
A constant value that has meaning within the domain, such as 0, distinct from
all other values.
A value from another randomly selected record.
A mean, median or mode value for the column.
A value estimated by another predictive model.
Any imputing performed on the training dataset will have to be performed on
new data
in the future when predictions are needed from the finalized model.
This needs to be taken into consideration when choosing how to impute the
missing values.
For example, if you choose to impute with mean column values,
these mean column values will need to be stored to file for later use
on new data that has missing values.
Pandas provides the fillna() function for replacing missing values with a
specific value.
For example, we can use fillna() to replace missing values with the mean value
for each column,
as follows:
'''

import numpy
# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace=True)
print(dataset.isnull().sum())
'''
Running the example provides a count of the number of missing values in each
column,
showing zero missing values.
'''
'''
The scikit-learn library provides the Imputer() pre-processing class that can
be used to replace missing values.
It is a flexible class that allows you to specify the value to replace
(it can be something other than NaN) and the technique used to replace it
(such as mean, median, or mode).
The Imputer class operates directly on the NumPy array instead of the
DataFrame.
The example below uses the Imputer class to replace missing values
with the mean of each column then prints the number of NaN values in the
transformed matrix.
'''

import numpy
imputer = Imputer()
transformed_values = imputer.fit_transform(values)
print(numpy.isnan(transformed_values).sum())
'''
Running the example shows that all NaN values were imputed successfully.
1
In either case, we can train algorithms sensitive to NaN values in the
transformed dataset,
such as LDA.
The example below shows the LDA algorithm trained in the Imputer transformed
dataset.
'''

import numpy
X = values[:,0:8]
y = values[:,8]
imputer = Imputer()
transformed_X = imputer.fit_transform(X)
result = cross_val_score(model, transformed_X, y, cv=kfold,
scoring='accuracy')
'''
Running the example prints the accuracy of LDA on the transformed dataset.
0.766927083333
1
0.766927083333
'''
'''
Algorithms that Support Missing Values
Not all algorithms fail when there is missing data.
There are algorithms that can be made robust to missing data,
such as k-Nearest Neighbors that can ignore a column from a distance measure
when a value is missing.
There are also algorithms that can use the missing value as a unique and
different value
when building the predictive model, such as classification and regression
trees.
Sadly, the scikit-learn implementations of decision trees and k-Nearest
Neighbors
are not robust to missing values. Although it is being considered.
Nevertheless, this remains as an option if you consider using another
algorithm implementation
(such as xgboost) or developing your own implementation.
'''
#Rescale Data
#When your data is comprised of attributes with varying scales,
#many machine learning algorithms can benefit
#from rescaling the attributes to all have the same scale.
#Often this is referred to as normalization and attributes
#are often rescaled into the range between 0 and 1.
#This is useful for optimization algorithms in used in the core of machine
#learning algorithms like gradient descent. It is also useful for algorithms
#that weight inputs like regression and neural networks and algorithms
#that use distance measures like K-Nearest Neighbors.
#You can rescale your data using scikit-learn using the MinMaxScaler class.
#The Pima Indians Diabetes Dataset involves predicting the onset of diabetes
within 5 years
#in Pima Indians given medical details.
#It is a binary (2-class) classification problem.
#The number of observations for each class is not balanced.
#There are 768 observations with 8 input variables and 1 output variable.
#The variable names are as follows:
#0. Number of times pregnant.
#1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
#2. Diastolic blood pressure (mm Hg).
#3. Triceps skinfold thickness (mm).
#4. 2-Hour serum insulin (mu U/ml).
#5. Body mass index (weight in kg/(height in m)^2).
#6. Diabetes pedigree function.
#7. Age (years).
#8. Class variable (0 or 1).
# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-
indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
#dataframe = pandas.read_csv(url, names=names)
dataframe = read_csv('C:\\Urbino_MachineLearning\\0.DataPreprocessing\\pima-
indians-diabetes.data.csv', header=None, names = names)
array = dataframe.values
# separate array into input and output components

X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data

numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
#Standardize Data
#Standardization is a useful technique to transform attributes
#with a Gaussian distribution and differing means and standard deviations
#to a standard Gaussian distribution with a mean of 0 and a standard
deviation of 1.
#It is most suitable for techniques that assume a Gaussian distribution
#in the input variables and work better with rescaled data,
#such as linear regression, logistic regression and linear discriminate
analysis.
#You can standardize data using scikit-learn with the StandardScaler class.
# Standardize data (0 mean, 1 stdev)

from sklearn.preprocessing import StandardScaler
import pandas
import numpy
'class']
#dataframe = pandas.read_csv(url, names=names)

X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
print(rescaledX[0:5,:])
#Normalize Data
#Normalizing in scikit-learn refers to rescaling each observation (row) to
have a length of 1
#(called a unit norm in linear algebra).
#This preprocessing can be useful for sparse datasets (lots of zeros) with
attributes of varying scales
#when using algorithms that weight input values such as neural networks and
algorithms that use distance measures
#such as K-Nearest Neighbors.
#You can normalize data in Python with scikit-learn using the Normalizer
class.
# Normalize data (length of 1)

from sklearn.preprocessing import Normalizer
import pandas
import numpy
'class']
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
print(normalizedX[0:5,:])
#Binarize Data (Make Binary)

#You can transform your data using a binary threshold.
#All values above the threshold are marked 1 and all equal to or below are
marked as 0.
#This is called binarizing your data or threshold your data.
#It can be useful when you have probabilities that you want to make crisp
values.
#It is also useful when feature engineering and you want
#to add new features that indicate something meaningful.
#You can create new binary attributes in Python using scikit-learn
#with the Binarizer class.
# binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy
'class']
dataframe = pandas.read_csv(url, names=names)
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
print(binaryX[0:5,:])
2. Data Visualization
1. '''
2. !!!!!!!!!!!!!!!!!!!!LINE Charts
3. '''
4.
5. # libraries
6. import matplotlib.pyplot as plt
7. import numpy as np
8. import seaborn as sns
9.
10.# create data
11.values=np.cumsum(np.random.randn(1000,1)) #cumulative sum
12.
13.# use the plot function
14.plt.plot(values)
15.
16.import matplotlib.pyplot as plt
17.import numpy as np
18.import pandas as pd
19.df=pd.DataFrame({'x': range(1,11), 'y': np.random.randn(10) })
20.
21.plt.plot( 'x', 'y', data=df, color='skyblue')
22.plt.show()
23.
24.plt.plot( 'x', 'y', data=df, color='skyblue', alpha=0.3)
25.plt.show()
26.
27.plt.plot( 'x', 'y', data=df, linestyle='dashed')
28.plt.show()
29.
30.'''
31.The following 4 styles are available:
32.'''
33.
34.plt.plot( [1,1.1,1,1.1,1], linestyle='-' , linewidth=4)
35.plt.text(1.5, 1.3, "linestyle = '-' ", horizontalalignment='left',
size='medium', color='C0', weight='semibold')
36.plt.plot( [2,2.1,2,2.1,2], linestyle='--' , linewidth=4 )
37.plt.text(1.5, 2.3, "linestyle = '--' ", horizontalalignment='left',
38.plt.plot( [3,3.1,3,3.1,3], linestyle='-.' , linewidth=4 )
39.plt.text(1.5, 3.3, "linestyle = '-.' ", horizontalalignment='left',
40.plt.plot( [4,4.1,4,4.1,4], linestyle=':' , linewidth=4 )
41.plt.text(1.5, 4.3, "linestyle = ':' ", horizontalalignment='left',
42.plt.axis('off')
43.plt.show()
44.
45.plt.plot( 'x', 'y', data=df, linewidth=22)
46.plt.show()
47.
48.'''
49.multiple line chart
50.'''
51.# libraries
55.
56.# Data
57.df=pd.DataFrame({'x': range(1,11), 'y1': np.random.randn(10), 'y2':
np.random.randn(10)+range(1,11), 'y3': np.random.randn(10)+range(11,21)
})
58.
59.# multiple line plot
60.plt.plot( 'x', 'y1', data=df, marker='o', markerfacecolor='blue',
markersize=12, color='skyblue', linewidth=4)
61.plt.plot( 'x', 'y2', data=df, marker='', color='olive', linewidth=2)
62.plt.plot( 'x', 'y3', data=df, marker='', color='olive', linewidth=2,
linestyle='dashed', label="toto")
63.plt.legend()
64.
65.# libraries and data
69.
70.# Make a data frame
71.df=pd.DataFrame({'x': range(1,11), 'y1': np.random.randn(10), 'y2':
np.random.randn(10)+range(1,11), 'y3':
np.random.randn(10)+range(4,14)+(0,0,0,0,0,0,0,-3,-8,-6), 'y6':
np.random.randn(10)+range(2,12), 'y7': np.random.randn(10)+range(5,15),
'y8': np.random.randn(10)+range(4,14) })
72.
73.#plt.style.use('fivethirtyeight')
74.plt.style.use('seaborn-darkgrid')
75.my_dpi=96
76.plt.figure(figsize=(480/my_dpi, 480/my_dpi), dpi=my_dpi)
77.
78.# multiple line plot
79.for column in df.drop('x', axis=1):
80. plt.plot(df['x'], df[column], marker='', color='grey', linewidth=1,
alpha=0.4)
81.
82.# Now re do the interesting curve, but biger with distinct color
83.plt.plot(df['x'], df['y5'], marker='', color='orange', linewidth=4,
alpha=0.7)
84.
85.# Change xlim
86.plt.xlim(0,12)
87.
88.# Let's annotate the plot!!!!!!
89.num=0
90.for i in df.values[9][1:]:
91. num+=1
92. name=list(df)[num]
93. if name != 'y5':
94. plt.text(10.2, i, name, horizontalalignment='left', size='small',
color='grey')
95.
96.# And add a special annotation for the group we are interested in
97.plt.text(10.2, df.y5.tail(1), 'Mr Orange', horizontalalignment='left',
size='small', color='orange')
98.
99.# Add titles
100. plt.title("Evolution of Mr Orange vs other students", loc='left',
fontsize=12, fontweight=0, color='orange')
101. plt.xlabel("Time")
102. plt.ylabel("Score")
103.
104. '''
105. Students over time
106. '''
107.
108. # libraries and data
111. import pandas as pd
112.
113. # Make a data frame
114. df=pd.DataFrame({'x': range(1,11), 'y1': np.random.randn(10),
'y2': np.random.randn(10)+range(1,11), 'y3':
np.random.randn(10)+range(4,14)+(0,0,0,0,0,0,0,-3,-8,-6), 'y6':
np.random.randn(10)+range(2,12), 'y7': np.random.randn(10)+range(5,15),
'y8': np.random.randn(10)+range(4,14), 'y9':
np.random.randn(10)+range(4,14) })
115.
116. # Initialize the figure
117. plt.style.use('seaborn-darkgrid')
118.
119. # create a color palette
120. palette = plt.get_cmap('Set1')
121.
122. # multiple line plot
123. num=0
124. for column in df.drop('x', axis=1):
125. num+=1
126.
127. # Find the right spot on the plot
128. plt.subplot(3,3, num)
129.
130. # Plot the lineplot
131. plt.plot(df['x'], df[column], marker='', color=palette(num),
linewidth=1.9, alpha=0.9, label=column)
132.
133. # Same limits for everybody!
134. plt.xlim(0,10)
135. plt.ylim(-2,22)
136.
137. # Not ticks everywhere
138. if num in range(7) :
139. plt.tick_params(labelbottom='off')
140. if num not in [1,4,7] :
141. plt.tick_params(labelleft='off')
142.
143. # Add title
144. plt.title(column, loc='left', fontsize=12, fontweight=0,
color=palette(num) )
145.
146. # general title
147. plt.suptitle("How the 9 students improved\nthese past few days?",
fontsize=13, fontweight=0, color='black', style='italic', y=1.02)
148.
149. # Axis title
150. plt.text(0.5, 0.02, 'Time', ha='center', va='center')
151. plt.text(0.06, 0.5, 'Note', ha='center', va='center',
rotation='vertical')
152.
153. '''
154. !!!!!!!!!!!!!!!! SCATTER PLOTS
155. '''
156.
157. # library & dataset
159. df = sns.load_dataset('iris')
160.
161. # use the function regplot to make a scatterplot with a
regression fit
162. sns.regplot(x=df["sepal_length"], y=df["sepal_width"])
163. #sns.plt.show()
164.
165. # Without regression fit:
166. sns.regplot(x=df["sepal_length"], y=df["sepal_width"],
fit_reg=False)
168.
172.
173. # Use the 'hue' argument to provide a factor variable
174. sns.lmplot( x="sepal_length", y="sepal_width", data=df,
fit_reg=False, hue='species', legend=False)
175.
176. # Move the legend to an empty part of the plot
177. plt.legend(loc='lower right')
178.
180.
181. Map a marker per group
185.
186. # give a list to the marker argument
fit_reg=False, hue='species', legend=False, markers=["o", "x", "1"])
188.
191.
193.
194. Use another palette
195. Several palettes are available, for example: deep, muted, bright,
pastel, dark, colorblind. See a complete list here TODO
196.
200.
201. # Use the 'palette' argument
fit_reg=False, hue='species', legend=False, palette="Set2")
203.
206.
208.
212.
213. # Provide a dictionary to the palette argument
fit_reg=False, hue='species', legend=False,
palette=dict(setosa="#9b59b6", virginica="#3498db",
versicolor="#95a5a6"))
215.
218.
220.
221. '''
222. !!!!!!!!!!!!!!!!!! AVOID OVERLAPPING
223. '''
224.
225. # libraries and data
230. plt.style.use('seaborn')
231.
232. # Dataset:
233. df=pd.DataFrame({'x': np.random.normal(10, 1.2, 20000), 'y':
np.random.normal(10, 1.2, 20000), 'group': np.repeat('A',20000) })
234. tmp1=pd.DataFrame({'x': np.random.normal(14.5, 1.2, 20000), 'y':
np.random.normal(14.5, 1.2, 20000), 'group': np.repeat('B',20000) })
235. tmp2=pd.DataFrame({'x': np.random.normal(9.5, 1.5, 20000), 'y':
np.random.normal(15.5, 1.5, 20000), 'group': np.repeat('C',20000) })
236. df=df.append(tmp1).append(tmp2)
237.
238. # plot
239. plt.plot( 'x', 'y', data=df, linestyle='', marker='o')
240. plt.xlabel('Value of X')
241. plt.ylabel('Value of Y')
242. plt.title('Overplotting looks like that:', loc='left')
243.
244. '''
245. Reduce dot size
246. '''
247. plt.plot( 'x', 'y', data=df, linestyle='', marker='o',
markersize=0.7)
250. plt.title('Overplotting? Try to reduce the dot size', loc='left')
251.
252. '''
253. Transparency
254. '''
255. # Plot with transparency
markersize=3, alpha=0.05, color="purple")
257.
258. # Titles
261. plt.title('Overplotting? Try to use transparency', loc='left')
262.
263. '''
264. Sampling
265. '''
266. # Sample 1000 random lines
267. df_sample=df.sample(1000)
268.
269. # Make the plot with this subset
270. plt.plot( 'x', 'y', data=df_sample, linestyle='', marker='o')
271.
272. # titles
275. plt.title('Overplotting? Sample your data', loc='left')
276.
277. '''
278. Filtering
279. '''
280. # Filter the data randomly
281. df_filtered = df[ df['group'] == 'A']
282. # Plot the whole dataset
markersize=1.5, color="grey", alpha=0.3, label='other group')
284.
285. # Add the group to study
286. plt.plot( 'x', 'y', data=df_filtered, linestyle='', marker='o',
markersize=1.5, alpha=0.3, label='group A')
287.
288. # Add titles and legend
289. plt.legend(markerscale=8)
292. plt.title('Overplotting? Show a specific group', loc='left')
293.
294. '''
295. Plot categorical data! There are a few main plot types for this:
296. factorplot
297. boxplot
298. violinplot
299. stripplot
300. swarmplot
301. barplot
302. countplot
303. '''
304.
306.
307. tips = sns.load_dataset('tips')
308. tips.head()
309.
310. '''
311. barplot and countplot
312. These very similar plots allow you to get aggregate data off a
categorical feature
313. in your data. barplot is a general plot that allows you to
aggregate
314. the categorical data based off some function, by default the
mean:
315. '''
316.
317. sns.barplot(x='sex',y='total_bill',data=tips)
318.
320.
321. '''
322. You can change the estimator object to your own function,
323. that converts a vector to a scalar:
324. '''
325.
326. sns.barplot(x='sex',y='total_bill',data=tips,estimator=np.std)
327.
328. '''
329. countplot
330. This is essentially the same as barplot except the estimator is
explicitly
331. counting the number of occurrences. Which is why we only pass the
x value:
332. '''
333.
334. sns.countplot(x='sex',data=tips)
335.
336. '''
337. boxplot and violinplot
338. boxplots and violinplots are used to shown the distribution of
categorical data.
339. A box plot (or box-and-whisker plot) shows the distribution of
quantitative data
340. in a way that facilitates comparisons between variables
341. or across levels of a categorical variable.
342. The box shows the quartiles of the dataset
343. while the whiskers extend to show the rest of the distribution,
344. except for points that are determined to be “outliers” using a
method that is a function
345. of the inter-quartile range.
346. '''
347.
348. sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow')
349.
350. # Can do entire dataframe with orient='h'
351. sns.boxplot(data=tips,palette='rainbow',orient='h')
352.
353. sns.boxplot(x="day", y="total_bill", hue="smoker",data=tips,
palette="coolwarm")
354.
355. '''
356. violinplot
357. A violin plot plays a similar role as a box and whisker plot.
358. It shows the distribution of quantitative data across several
levels of one (or more)
359. categorical variables such that those distributions can be
compared.
360. Unlike a box plot, in which all of the plot components correspond
to actual datapoints,
361. the violin plot features a kernel density estimation of the
underlying distribution.
362. '''
363.
364. sns.violinplot(x="day", y="total_bill",
data=tips,palette='rainbow')
data=tips,hue='sex',palette='Set1')
data=tips,hue='sex',split=True,palette='Set1')
367.
368. '''
369. stripplot and swarmplot
370. The stripplot will draw a scatterplot where one variable is
categorical.
371. A strip plot can be drawn on its own,
372. but it is also a good complement to a box or violin plot in cases
373. where you want to show all observations
374. along with some representation of the underlying distribution.
375. The swarmplot is similar to stripplot(),
376. but the points are adjusted (only along the categorical axis)
377. so that they don’t overlap.
378. This gives a better representation of the distribution of values,
379. although it does not scale as well to large numbers of
observations
380. (both in terms of the ability to show all the points
381. and in terms of the computation needed to arrange them).
382. '''
383.
384. sns.stripplot(x="day", y="total_bill", data=tips)
385. sns.stripplot(x="day", y="total_bill", data=tips,jitter=True)
386. sns.stripplot(x="day", y="total_bill",
data=tips,jitter=True,hue='sex',palette='Set1')
387. sns.stripplot(x="day", y="total_bill",
data=tips,jitter=True,hue='sex',palette='Set1',split=True)
388.
389. sns.swarmplot(x="day", y="total_bill", data=tips)
390. sns.swarmplot(x="day", y="total_bill",hue='sex',data=tips,
palette="Set1", split=True)
391.
392. '''
393. Combining Categorical Plots
394. '''
395.
396. sns.violinplot(x="tip", y="day", data=tips,palette='rainbow')
397. sns.swarmplot(x="tip", y="day", data=tips,color='black',size=3)
398.
399. '''
400. factorplot
401. factorplot is the most general form of a categorical plot.
402. It can take in a kind parameter to adjust the plot type:
403. '''
404.
405. sns.factorplot(x='sex',y='total_bill',data=tips,kind='bar')
406.
407. '''
408. !!!!!!!!!!Regression plots
409. '''
410.
413. tips.head()
414. #lmplot()
415. sns.lmplot(x='total_bill',y='tip',data=tips)
416. sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex')
417. sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex',palette='co
olwarm')
418.
419. '''
420. Working with Markers
421. lmplot kwargs get passed through to regplot which is a more
general form of lmplot().
422. regplot has a scatter_kws parameter that gets passed to
plt.scatter.
423. So you want to set the s parameter in that dictionary, which
corresponds
424. (a bit confusingly) to the squared markersize.
425. In other words you end up passing a dictionary with the base
matplotlib arguments,
426. in this case, s for size of a scatter plot.
427. In general, you probably won't remember this off the top of your
head,
428. but instead reference the documentation.
429. '''
430.
431. # http://matplotlib.org/api/markers_api.html
432. sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex',palette='co
olwarm',
433. markers=['o','v'],scatter_kws={'s':100})
434.
435. '''
436. Using a Grid
437. We can add more variable separation through columns and rows with
the use of a grid.
438. Just indicate this with the col or row arguments:
439. '''
440.
441. sns.lmplot(x='total_bill',y='tip',data=tips,col='sex')
442. sns.lmplot(x="total_bill", y="tip", row="sex",
col="time",data=tips)
443. sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',p
alette='coolwarm')
444.
445. '''
446. Aspect and Size
447. Seaborn figures can have their size and aspect ratio adjusted
with the size and
448. aspect parameters:
449. '''
450. sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',p
alette='coolwarm',
451. aspect=0.6,size=8)
452.
453. '''
454. !!!!!!!!!!!Matrix ScatterPlots
455. '''
456. ''''
457. Matrix Plots
458. Matrix plots allow you to plot data as color-encoded matrices
459. and can also be used to indicate clusters within the data
460. '''
461.
463. flights = sns.load_dataset('flights')
465. tips.head()
466. flights.head()
467.
468. '''
469. Heatmap
470. In order for a heatmap to work properly, your data should already
be in a matrix form,
471. the sns.heatmap function basically just colors it in for you. For
example:
472. '''
473.
474. tips.head()
475.
476. # Matrix form for correlation data
477. tips.corr()
478. sns.heatmap(tips.corr())
479. sns.heatmap(tips.corr(),cmap='coolwarm',annot=True)
480. flights.pivot_table(values='passengers',index='month',columns='ye
ar')
481.
482. pvflights =
flights.pivot_table(values='passengers',index='month',columns='year')
483. sns.heatmap(pvflights)
484.
485. sns.heatmap(pvflights,cmap='magma',linecolor='white',linewidths=1
)
486.
487. '''
488. clustermap
489. The clustermap uses hierarchal clustering to produce a clustered
version of the heatmap.
490. For example:
491. '''
492.
493. sns.clustermap(pvflights)
494.
495. '''
496. Notice now how the years and months are no longer in order,
497. instead they are grouped by similarity in value (passenger
count).
498. That means we can begin to infer things from this plot,
499. such as August and July being similar (makes sense, since they
are both summer travel months)
500. '''
501.
502. # More options to get the information a little clearer like
normalization
503. sns.clustermap(pvflights,cmap='coolwarm',standard_scale=1)
504.
505. '''
506. !!!!!!!!!!!!!!!!!!!!!!PROJECT IRIS
507. '''
508. '''
509. Data Exploration and Analysis
510. '''
511.
516. sns.set(color_codes=True)
517.
518. df = pd.read_csv('C:\\Urbino_MachineLearning\\1.
DataVisualization\\iris.csv')
519. df.head()
520.
521. col_name = ['sepal length', 'sepal width', 'petal length',
'petal width', 'class']
522. df.columns = col_name
523. df.head()
524.
525. '''
526. Iris Data from Seaborn
527. '''
528.
529. iris = sns.load_dataset('iris')
530. iris.head()
531.
532. df.describe()
533. iris.describe()
534. print(iris.info())
535. print(iris.groupby('species').size())
536.
537. '''
538. Visualisation
539. '''
540.
541. sns.pairplot(iris, hue='species', size=3, aspect=1);
542. iris.hist(edgecolor='black', linewidth=1.2, figsize=(12,8));
543. plt.show();
544.
545. plt.figure(figsize=(12,8));
546. plt.subplot(2,2,1)
547. sns.violinplot(x='species', y='sepal_length', data=iris)
549. sns.violinplot(x='species', y='sepal_width', data=iris)
551. sns.violinplot(x='species', y='petal_length', data=iris)
553. sns.violinplot(x='species', y='petal_width', data=iris);
554.
555. iris.boxplot(by='species', figsize=(12,8));
556.
557. pd.plotting.scatter_matrix(iris, figsize=(12,10))
558. plt.show()
559.
560. sns.pairplot(iris, hue="species",diag_kind="kde");
3. Regression
a. Multiple Linear Regression
# Multiple Linear Regression

import numpy as np
import pandas as pd

dataset = pd.read_csv('C:\\Urbino_MachineLearning\\2.
Regression\\MultipleRegression\\50_Startups.csv')
X = dataset.iloc[:, :-1].values #take all but profit
y = dataset.iloc[:, 4].values #profit
# Encoding categorical data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3]) #0, 1, 2, etc
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap

X = X[:, 1:] #independent variables are multicollinear, two or more variables
are highly correlated
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 0)
# Feature Scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)"""
# Fitting Multiple Linear Regression to the Training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the Test set results

y_pred = regressor.predict(X_test)
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)

X_opt = X[:,[0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
X_opt = X[:,[0,1,3,4,5]]
X_opt = X[:,[0,3,4,5]]
X_opt = X[:,[0,3,5]]
X_opt = X[:,[0,3]]
def backwardElimination(x, sl):
numVars = len(x[0])
for i in range(0, numVars):
regressor_OLS = sm.OLS(y, x).fit()
maxVar = max(regressor_OLS.pvalues).astype(float)
if maxVar > sl:
for j in range(0, numVars - i):
if (regressor_OLS.pvalues[j].astype(float) == maxVar):
x = np.delete(x, j, 1)
return x
SL = 0.05
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
X_Modeled = backwardElimination(X_opt, SL)
b. Multiple polynomial regression
# -*- coding: utf-8 -*-

"""
Spyder Editor
This is a temporary script file.

"""
import pandas as pd
from pandas import DataFrame
from sklearn import linear_model
import statsmodels.api as sm
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from scipy import stats
from sklearn.metrics import mean_squared_error
Stock_Market = {'Year':
['2017','2017','2017','2017','2017','2017','2017','2017','2017','2017','2017',
'2017','2016','2016','2016','2016','2016','2016','2016','2016','2016','2016','
2016','2016'],
'Month': ['12',
'11','10','9','8','7','6','5','4','3','2','1','12','11','10','9','8','7','6','
5','4','3','2','1'],
'Interest_Rate':
['2.75','2.5','2.5','2.5','2.5','2.5','2.5','2.25','2.25','2.25','2','2','2','
1.75','1.75','1.75','1.75','1.75','1.75','1.75','1.75','1.75','1.75','1.75'],
'Unemployment_Rate':
['5.3','5.3','5.3','5.3','5.4','5.6','5.5','5.5','5.5','5.6','5.7','5.9','6','
5.9','5.8','6.1','6.2','6.1','6.1','6.1','5.9','6.2','6.2','6.1'],
'Stock_Index_Price':
['1464','1394','1357','1293','1256','1254','1234','1195','1159','1167','1130',
'1075','1047','965','943','958','971','949','884','866','876','822','704','719
']
}
Stock_Market_Unknown = {'Year': ['2018','2018'],

'Month': ['2', '1'],
'Interest_Rate': ['2.75','2.5'],
'Unemployment_Rate': ['5.3','5.3'],
'Stock_Index_Price': ['0','0']
}
df =
DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_R
ate','Stock_Index_Price'])
df_predict =
DataFrame(Stock_Market_Unknown,columns=['Year','Month','Interest_Rate','Unempl
oyment_Rate','Stock_Index_Price'])
#check for linearity
plt.scatter(df['Interest_Rate'].astype(float),
df['Stock_Index_Price'].astype(float), color='red')
plt.title('Stock Index Price Vs Interest Rate', fontsize=14)
plt.xlabel('Interest Rate', fontsize=14)
plt.ylabel('Stock Index Price', fontsize=14)
plt.grid(True)
plt.show()
plt.scatter(df['Unemployment_Rate'].astype(float),
df['Stock_Index_Price'].astype(float), color='green')
plt.title('Stock Index Price Vs Unemployment Rate', fontsize=14)
plt.xlabel('Unemployment Rate', fontsize=14)
plt.ylabel('Stock Index Price', fontsize=14)
plt.grid(True)
plt.show()
X = df[['Interest_Rate','Unemployment_Rate']].astype(float)
Y = df['Stock_Index_Price'].astype(float)
x_toPredict = df_predict[['Interest_Rate','Unemployment_Rate']].astype(float)
#get coefficients for a quadratic

poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
test = poly.fit (X)
test.get_feature_names(X.columns)
predict_ = poly.fit_transform(x_toPredict)
#here we can remove polynomial orders we don't want

#for instance I'm removing the `x` component
X_ = np.delete(X_,(1),axis=1)
predict_ = np.delete(predict_,(1),axis=1)
#generate the regression object

clf = linear_model.LinearRegression()
#perform the actual regression

clf.fit(X_, Y)
clf.predict(predict_)
'''
# Instantiate
lg = LinearRegression()
# Fit
model = lg.fit(X_, y_train)
# prediction with sklearn
train_predictions = lg.predict(X_train)
params = np.append(lg.intercept_,lg.coef_)
MSE = mean_squared_error(y_train, train_predictions)
lg.score(X_test, y_test)
'''
'''
x_train = sm.add_constant(X_train)
model = sm.OLS(y_train, x_train)
results = model.fit()
results.summary()
'''
c. Robust linear regression
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.read_csv('C:\\Urbino_MachineLearning\\2.
Regression\\RobustLinearRegression\\housing.data', delim_whitespace=True,
header=None)
df.head()
col_name = ['CRIM', 'ZN' , 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df.columns = col_name
df.head()
df.describe()
sns.pairplot(df, size=1.5);
plt.show()
col_study = ['ZN', 'INDUS', 'NOX', 'RM']
sns.pairplot(df[col_study], size=2.5);
plt.show()
col_study = ['PTRATIO', 'B', 'LSTAT', 'MEDV']
sns.pairplot(df[col_study], size=2.5);
plt.show()
pd.options.display.float_format = '{:,.2f}'.format
df.corr()
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(), annot=True)
plt.show()
sns.heatmap(df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'MEDV']].corr(), annot=True)
plt.show()
df.head()
X = df['RM'].values.reshape(-1,1)
y = df['MEDV'].values
model = LinearRegression()
model.fit(X, y)
model.score(X,y)
model.coef_
model.intercept_
plt.figure(figsize=(12,10));
sns.regplot(X, y);
plt.xlabel('average number of rooms per dwelling')
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.show();
sns.jointplot(x='RM', y='MEDV', data=df, kind='reg', size=10);
plt.show();
X = df['LSTAT'].values.reshape(-1,1)
model.fit(X, y)
sns.regplot(X, y);
plt.xlabel('% lower status of the population')
plt.show();
sns.jointplot(x='LSTAT', y='MEDV', data=df, kind='reg', size=10);

plt.show();
df.head()
X = df['RM'].values.reshape(-1,1)
from sklearn.linear_model import RANSACRegressor

ransac = RANSACRegressor()
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
np.arange(3, 10, 1)
line_X = np.arange(3, 10, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))
sns.set(style='darkgrid', context='notebook')
plt.scatter(X[inlier_mask], y[inlier_mask],
c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('average number of rooms per dwelling')
plt.legend(loc='upper left')
plt.show()
ransac.estimator_.coef_
ransac.estimator_.intercept_
X = df['LSTAT'].values.reshape(-1,1)
ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
line_X = np.arange(0, 40, 1)
line_y_ransac = ransac.predict(line_X.reshape(-1, 1))
sns.set(style='darkgrid', context='notebook')
plt.scatter(X[inlier_mask], y[inlier_mask],
c='blue', marker='o', label='Inliers')
plt.scatter(X[outlier_mask], y[outlier_mask],
c='brown', marker='s', label='Outliers')
plt.plot(line_X, line_y_ransac, color='red')
plt.xlabel('% lower status of the population')
plt.legend(loc='upper right')
plt.show()
d. Decision trees regression
# Decision Tree Regression

import numpy as np
import pandas as pd

Regression\\DecisionTreeRegression\\Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
"""from sklearn.cross_validation import train_test_split
random_state = 0)"""
# Feature Scaling
# Fitting Decision Tree Regression to the dataset

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)
# Predicting a new result

y_pred = regressor.predict(6.5)
# Visualising the Decision Tree Regression results (higher resolution)

X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (Decision Tree Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
e. Random forest regression
# Random Forest Regression

import numpy as np
import pandas as pd

Regression\\RandomForestRegression\\Position_Salaries.csv')
# Feature Scaling
# Fitting Random Forest Regression to the dataset

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X, y)

# Visualising the Random Forest Regression results (higher resolution)

X_grid = np.arange(min(X), max(X), 0.01)
plt.title('RF Regression')
plt.show()
f. SVR Regression
# SVR

import numpy as np
import pandas as pd

Regression\\SvrRegression\\Position_Salaries.csv')
# Feature Scaling
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
# Fitting SVR to the dataset

from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)

y_pred = sc_y.inverse_transform(y_pred)
# Visualising the SVR results

plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.show()
# Visualising the SVR results (for higher resolution and smoother curve)
X_grid = np.arange(min(X), max(X), 0.01) # choice of 0.01 instead of 0.1 step
because the data is feature scaled
plt.title('Truth or Bluff (SVR)')
plt.show()
g. More on non-linear regression
import numpy as np
sns.set_style('whitegrid')
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston

boston_data = load_boston()
df = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)
df.head()
y = boston_data.target
X = df[['LSTAT']].values
tree = DecisionTreeRegressor(max_depth=5)
tree.fit(X, y)
sort_idx = X.flatten().argsort()
plt.scatter(X[sort_idx], y[sort_idx])
plt.plot(X[sort_idx], tree.predict(X[sort_idx]), color='k')
plt.xlabel('LSTAT')
plt.ylabel('MEDV');
tree.fit(X, y)
plt.plot(X[sort_idx], tree.predict(X[sort_idx]), color='k')
plt.xlabel('LSTAT')
plt.ylabel('MEDV');

from sklearn.metrics import mean_squared_error, r2_score
X = df.values
#y = df['MEDV'].values
X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.3,
random_state=42)
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=500, criterion='mse',

random_state=42, n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print("MSE train: {0:.4f}, test: {1:.4f}".\

format(mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
print("R^2 train: {0:.4f}, test: {1:.4f}".\

format(r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=500, random_state=42)
ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)


ada.feature_importances_
df.columns
result = pd.DataFrame(ada.feature_importances_, df.columns)
result.columns = ['feature']
result.sort_values(by='feature', ascending=False)
result.sort_values(by='feature', ascending=False).plot(kind='bar');
forest.feature_importances_
result = pd.DataFrame(forest.feature_importances_, df.columns)

tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)


result = pd.DataFrame(tree.feature_importances_, df.columns)

4. Classification
a. Logistic (multinominal) Regression
'''
In this project we will be working with a fake advertising data set,
indicating whether or not a particular internet user clicked on an
Advertisement on a company website.
We will try to create a model that will predict whether or not
they will click on an ad based off the features of that user.
This data set contains the following features:

'Daily Time Spent on Site': consumer time on site in minutes
'Age': cutomer age in years
'Area Income': Avg. Income of geographical area of consumer
'Daily Internet Usage': Avg. minutes a day consumer is on the internet
'Ad Topic Line': Headline of the advertisement
'City': City of consumer
'Male': Whether or not consumer was male
'Country': Country of consumer
'Timestamp': Time at which consumer clicked on Ad or closed window
'Clicked on Ad': 0 or 1 indicated clicking on Ad
'''
'''
Import Libraries
'''
import pandas as pd
import numpy as np
'''
Get the Data
Read in the advertising.csv file and set it to a data frame called ad_data.
'''
ad_data = pd.read_csv('C:\\Urbino_MachineLearning\\3.
Classification\\LogisticRegression\\advertising.csv')
'''
Check the head of ad_data
'''
ad_data.head()
'''
Use info and describe() on ad_data
'''
ad_data.info()
ad_data.describe()
'''
Exploratory Data Analysis
Let's use seaborn to explore the data!
Try recreating the plots shown below!
Create a histogram of the Age
'''
ad_data['Age'].hist(bins=30)
plt.xlabel('Age')
'''
Create a jointplot showing Area Income versus Age.
'''
sns.jointplot(x='Age',y='Area Income',data=ad_data)
'''
Create a jointplot showing the kde (kernel distribution estimates)
distributions of Daily Time spent on site vs. Age.
'''
sns.jointplot(x='Age',y='Daily Time Spent on
Site',data=ad_data,color='red',kind='kde');
'''
Create a jointplot of 'Daily Time Spent on Site' vs. 'Daily Internet Usage'
'''
sns.jointplot(x='Daily Time Spent on Site',y='Daily Internet
Usage',data=ad_data,color='green')
sns.pairplot(ad_data,hue='Clicked on Ad',palette='bwr')
'''
Logistic Regression
Now it's time to do a train test split, and train our model!
You'll have the freedom here to choose columns that you want to train on!
Split the data into training set and testing set using train_test_split
'''

X = ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet
Usage', 'Male']]
y = ad_data['Clicked on Ad']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,

random_state=42)
'''
Train and fit a logistic regression model on the training set.
'''
from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
'''
Predictions and Evaluations
Now predict values for the testing data.
'''
predictions = logmodel.predict(X_test)
'''
Create a classification report for the model.
'''
from sklearn.metrics import classification_report

from sklearn.metrics import classification_report, confusion_matrix
cm = confusion_matrix(y_test, predictions)
print(classification_report(y_test,predictions))
import pandas as pd
import numpy as np
%matplotlib inline
train =
pd.read_csv('D:\\Urbino_MachineLearning\\Classification\\LogisticRegression\\t
itanic_train.csv')
train.head()
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
sns.countplot(x='Survived',data=train,palette='RdBu_r')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')
sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30)
train['Age'].hist(bins=30,color='darkred',alpha=0.7)
sns.countplot(x='SibSp',data=train)
train['Fare'].hist(color='green',bins=40,figsize=(8,4))
import cufflinks as cf
cf.go_offline()
train['Fare'].iplot(kind='hist',bins=30,color='green')
'''
Data Cleaning
We want to fill in missing age data instead of just dropping the missing age
data rows.
One way to do this is by filling in the mean age of all the passengers
(imputation).
However we can be smarter about this and check the average age by passenger
class. For example:
'''
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')
'''
We can see the wealthier passengers in the higher classes tend to be older,
which makes sense.
We'll use these average age values to impute based on Pclass for Age.
'''
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
train.drop('Cabin',axis=1,inplace=True)
train.head()
train.dropna(inplace=True)
'''
Converting Categorical Features
We'll need to convert categorical features to dummy variables using pandas!
Otherwise our machine learning algorithm won't be able to directly take in
those features as inputs.
'''
train.info()
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
train = pd.concat([train,sex,embark],axis=1)
train.head()
'''
Building a Logistic Regression model
Let's start by splitting our data into a training set and test set
(there is another test.csv file that you can play around with
in case you want to use all this data for training).
'''
X_train, X_test, y_train, y_test =
train_test_split(train.drop('Survived',axis=1),
train['Survived'],
test_size=0.30,
random_state=101)

logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
predictions = logmodel.predict(X_test)
from sklearn.metrics import classification_report

# Logistic Regression

import numpy as np
import pandas as pd

dataset =
pd.read_csv('D:\\Urbino_MachineLearning\\Classification\\LogisticRegression\\S
ocial_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
random_state = 0)
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Fitting Logistic Regression to the Training set

classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
# Making the Confusion Matrix

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results

from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:,
0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:,
1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
# Visualising the Test set results

X_set, y_set = X_test, y_test
0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.legend()
plt.show()
'''
Implementing Titanic Solution using LOGISTIC REGRESSION
'''
'''
About the features:
Here a small description for each features contained in the dataset:

– survival: Survival 0 = No, 1 = Yes (the feature that we are trying to
predict)
– pclass: A proxy for socio-economic status (1st = Upper, 2nd = Middle, 3rd =
Lower)
– Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd
– sibsp: number of siblings / spouses aboard the Titanic
– parch: number of parents / children aboard the Titanic (Some children
travelled only with a nanny, therefore parch=0 for them)
– ticket: Ticket number
– fare: Passenger fare
– cabin: Cabin number
– embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
'''
# Basics
import numpy as np
import pandas as pd
# Visualisation
# Preprocessing
import missingno as msno
from collections import OrderedDict
# Sampling
# Classifiier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
# Metrics
from sklearn.metrics import accuracy_score
sns.set(color_codes=True)
pal = sns.color_palette("Set2", 10)
sns.set_palette(pal)
TitanicTrain = pd.read_csv("C:\\Urbino_MachineLearning\\3.
Classification\\LogisticRegression\\train.csv")
TitanicTrain.columns, TitanicTrain.shape
TitanicTrain.info()
'''
The dataset is composed of 2 features float, 5 integer, and 6 objects.
We can see with that there is a few missing values in the columns “age”.
With the function describe we can see that the function count 714 values
against 891 for the others.
Because we have few features we can use the package missingno which allows you
to display the completeness of the dataset. It looks there are a lot of
missing values for “age” and “cabin”
and only 2 for “embarked”.
'''
msno.matrix(TitanicTrain)
'''
Univariate Analysis
To have a better vision of the data we are going to display our feature with a
countplot of seaborn.
Show the counts of observations in each categorical bin using bars.
The categorical features of our dataset are these are integer and object.
We are going to separate our features into two lists: “categ” for the
categorical features
and “conti” for the continuous features. The “age” and the “fare” are the only
two features
that we can consider as continuous. In order to plot the distribution of the
features
with seaborn we are going to use distplot. According to the charts, there are
no weird values
(superior at 100) for “age” but we can see that the feature “fare” have a
large scale and
the most of value are between 0 and 100.
'''
categ = [ 'Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']

conti = ['Fare', 'Age']
#Distribution
fig = plt.figure(figsize=(30, 10))
for i in range (0,len(categ)):
fig.add_subplot(3,3,i+1)
sns.countplot(x=categ[i], data=TitanicTrain);
for col in conti:

fig.add_subplot(3,3,i + 2)
sns.distplot(TitanicTrain[col].dropna());
i += 1
plt.show()
fig.clear()
'''
Bivariate Analysis
The next charts show us the repartition of survival (and non-survival) for
each features categ and conti.
We are going to use others kind of charts to display the relation between
‘survival” and our features.
It seems that there are a lot of “female” who have not survived when we take a
look at the 6th chart.
WIth the boxplot, we can see that there are no outliers in the features “age”
(maybe 3-4 observations which are out of the frame but nothing alarming).
As concerning the correlation between the features, we can see that the
stronger correlation in absolute
with “survived” are “fare” and “pclass”.
The fact of “fare” and “pclass” have a strong correlation in absolute is
consistent and it shows that
a priori the people with the upper class spend more money (to have a better
place).
'''
'''
The next charts show us the repartition of survival (and non-survival)
for each features categ and conti.
'''
fig = plt.figure(figsize=(30, 10))
i = 1
for col in categ:
if col != 'Survived':
fig.add_subplot(3,3,i)
sns.countplot(x=col, data=TitanicTrain,hue='Survived');
i += 1
# Box plot survived x age

fig.add_subplot(3,3,6)
sns.swarmplot(x="Survived", y="Age", hue="Sex", data=TitanicTrain);
sns.boxplot(x="Survived", y="Age", data=TitanicTrain)
# fare and Survived

sns.violinplot(x="Survived", y="Fare", data=TitanicTrain)
# correlations with the new features

corr = TitanicTrain.drop(['PassengerId'], axis=1).corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, cbar_kws={"shrink": .5})
plt.show()
fig.clear()
'''
Feature engineering
We going to create one new feature from the dataset.
It is possible to create lot of new features with this data.
But we are going to exploit the data that we can find through the “title”.
I advise you to implement a first model (Quick & Dirty model) before the
creation of new features.
It is not always interesting to create a new features so you must quantify
(with the metrics)
if the creation has a positive impact on your model.
In this example we start from the assumption that the new feature have a
positive impact.
We are not sure that the “title” feature give more information than the sex
feature.
'''
title = ['Mlle','Mrs', 'Mr',
'Miss','Master','Don','Rev','Dr','Mme','Ms','Major','Col','Capt','Countess']
def ExtractTitle(name):
tit = 'missing'
for item in title :
if item in name:
tit = item
if tit == 'missing':
tit = 'Mr'
return tit
TitanicTrain["Title"] = TitanicTrain.apply(lambda row:

ExtractTitle(row["Name"]),axis=1)
sns.countplot(x='Title', data=TitanicTrain,hue='Survived');
'''
Impute missing value
In the first part we have already seen that there are missing values. In
general rules a good imputation is the median (for the numerical features).
Yet it is the same thing that the feature engineering: It will be more
interesting if you can test different imputations and find the values with the
best impact on your metrics. The median is a simple method and you can
implement a machine learning method to impute the missing values. For the
categorical features we are imputed the missing values by the most frequent
values.
'''
# Age
MedianAge = TitanicTrain.Age.median()
TitanicTrain.Age = TitanicTrain.Age.fillna(value=MedianAge)
# Embarked replace NaN with the mode value
ModeEmbarked = TitanicTrain.Embarked.mode()[0]
TitanicTrain.Embarked = TitanicTrain.Embarked.fillna(value=ModeEmbarked)
# Fare have 1 NaN missing value on the Submission dataset
MedianFare = TitanicTrain.Fare.median()
TitanicTrain.Fare = TitanicTrain.Fare.fillna(value=MedianFare)
'''
Encode Categorical features
Another important part of the preprocessing is the processing of categorical
features.
There are several ways to do that and certain are more suited
if there are several categories through the features (Entity Embeddings).
In our case we are going to use the function get_dummies which allow you to
transform the categorical
features in binary features.
You can try the function OneHotEncoder of sklearn to do the same thing.
In our example there are several values for the features “cabin”,
so we have decided to binarize this one in order to use get_dummies.
The function get_dummies is not recommended if your categorical features
have too many categories also you must investigate on the techniqu of entity
embeddings.
'''
# Cabin
TitanicTrain["Cabin"] = TitanicTrain.apply(lambda obs: "No" if
pd.isnull(obs['Cabin']) else "Yes", axis=1)
TitanicTrain =
pd.get_dummies(TitanicTrain,drop_first=True,columns=['Sex','Title','Cabin','Em
barked'])
'''
Scaling numerical features
The final part of our preprocessing !
The goal of this one is to rescale the numerical features.
You can find lot of method through this link http://scikit-
learn.org/stable/modules/classes.html#module-sklearn.preprocessing.
We are used a simple method for the feature Fare and Age.
'''
scale = StandardScaler().fit(TitanicTrain[['Age', 'Fare']])
TitanicTrain[['Age', 'Fare']] = scale.transform(TitanicTrain[['Age', 'Fare']])
Target = TitanicTrain.Survived
Features =
TitanicTrain.drop(['Survived','Name','Ticket','PassengerId'],axis=1)
# Create training and test sets

X_train, X_test, y_train, y_test = train_test_split(Features, Target,
test_size = 0.3, random_state=42)
Target = TitanicTrain.Survived
MlRes= {}
def MlResult(model,score):
MlRes[model] = score
print(MlRes)
roc_curve_data = {}
def ConcatRocData(algoname, fpr, tpr, auc):

data = [fpr, tpr, auc]
roc_curve_data[algoname] = data
# Logistic Regression :
logi_reg = LogisticRegression()
# Fit the regressor to the training data
logi_reg.fit(X_train, y_train)
# Predict on the test data: y_pred
y_pred = logi_reg.predict(X_test)
# Score / Metrics

accuracy = logi_reg.score(X_test, y_test)

MlResult('Logistic Regression',accuracy)
'''
The glass identification dataset having 7 different glass types for the
target.
These different glass types differ from the usage.
1.building_windows_float_processed
2.building_windows_non_float_processed
3.vehicle_windows_float_processed
4.vehicle_windows_non_float_processed (none in this database)
5.containers
6.tableware
7.headlamps
'''
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics
import plotly.graph_objs as go
import plotly.plotly as py
from plotly.graph_objs import *
py.sign_in('dragoscristea', 'XFCbmM9UbeJWVsYDMM9i')
# Dataset Path
DATASET_PATH = "C:\\Urbino_MachineLearning\\3.
Classification\\LogisticRegression\\glass.csv"
def scatter_with_color_dimension_graph(feature, target, layout_labels):

"""
Scatter with color dimension graph to visualize the density of the
Given feature with target
:param feature:
:param target:
:param layout_labels:
:return:
"""
trace1 = go.Scatter(
y=feature,
mode='markers',
marker=dict(
size=16,
color=target,
colorscale='Viridis',
showscale=True
)
)
layout = go.Layout(
title=layout_labels[2],
xaxis=dict(title=layout_labels[0]),
yaxis=dict(title=layout_labels[1]))
data = [trace1]
fig = Figure(data=data, layout=layout)
# plot_url = py.plot(fig)
py.image.save_as(fig, filename=layout_labels[1] + '_Density.png')
def create_density_graph(dataset, features_header, target_header):

"""
Create density graph for each feature with target
:param dataset:
:param features_header:
:param target_header:
:return:
"""
for feature_header in features_header:
print ("Creating density graph for feature:: {}
".format(feature_header))
layout_headers = ["Number of Observation", feature_header + " & " +
target_header,
feature_header + " & " + target_header + " Density
Graph"]
scatter_with_color_dimension_graph(dataset[feature_header],
dataset[target_header], layout_headers)
glass_data_headers = ["Id", "RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba",
"Fe", "glass-type"]
glass_data = pd.read_csv(DATASET_PATH, names=glass_data_headers)
print ("Number of observations :: ", len(glass_data.index))

print ("Number of columns :: ", len(glass_data.columns))
print ("Headers :: ", glass_data.columns.values)
print ("Target :: ", glass_data[glass_data_headers[-1]])
print ("glass_data_RI :: ", list(glass_data["RI"][:10]))
print ("glass_data_target :: ", np.array([1, 1, 1, 2, 2, 3, 4, 5, 6, 7]))
graph_labels = ["Number of Observations", "RI & Glass Type", "Sample RI -
Glass Type Density Graph"]
# scatter_with_color_dimension_graph(list(glass_data["RI"][:10]),
# np.array([1, 1, 1, 2, 2, 3, 4, 5, 6,
7]), graph_labels)
# print "glass_data_headers[:-1] :: ", glass_data_headers[:-1]

# print "glass_data_headers[-1] :: ", glass_data_headers[-1]
# create_density_graph(glass_data, glass_data_headers[1:-1],
glass_data_headers[-1])
train_x, test_x, train_y, test_y =

train_test_split(glass_data[glass_data_headers[:-1]],
glass_data[glass_data_headers[-1]], train_size=0.7)
# Train multi-classification model with logistic regression
lr = linear_model.LogisticRegression()
lr.fit(train_x, train_y)

# Train multinomial logistic regression model
mul_lr = linear_model.LogisticRegression(multi_class='multinomial',
solver='newton-cg').fit(train_x, train_y)
pred = mul_lr.predict(test_x)
print(confusion_matrix(test_y, pred))
print ("Logistic regression Train Accuracy :: ",
metrics.accuracy_score(train_y, lr.predict(train_x)))
print ("Logistic regression Test Accuracy :: ", metrics.accuracy_score(test_y,
lr.predict(test_x)))
print ("Multinomial Logistic regression Train Accuracy :: ",
metrics.accuracy_score(train_y, mul_lr.predict(train_x)))
print ("Multinomial Logistic regression Test Accuracy :: ",
metrics.accuracy_score(test_y, mul_lr.predict(test_x)))
glass_data = pd.read_csv(DATASET_PATH, names=glass_data_headers)

glass_data_headers = ["Id", "RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba",
"Fe", "glass-type"]
create_density_graph(glass_data, glass_data_headers[1:-1],
glass_data_headers[-1])
b. Decision trees and random forest classification
import pandas as pd
import numpy as np
'''
Get the Data
'''
Classification\\DecisionTrees&RandomForest\\kyphosis.csv')
df.head()
'''
EDA
We'll just check out a simple pairplot for this small dataset.
'''
sns.pairplot(df,hue='Kyphosis',palette='Set1')
'''
Train Test Split
Let's split up the data into a training set and a test set!
'''

X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
'''
Decision Trees
We'll start just by training a single decision tree.
'''
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
'''
Prediction and Evaluation
Let's evaluate our decision tree.
'''
predictions = dtree.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
'''
Random Forests
Now let's compare the decision tree model to a random forest.
'''

rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',

max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
rfc_pred = rfc.predict(X_test)
print(confusion_matrix(y_test,rfc_pred))
print(classification_report(y_test,rfc_pred))
'''
For this project we will be exploring publicly available data from
LendingClub.com.
Lending Club connects people who need money (borrowers) with people who have
money (investors).
Hopefully, as an investor you would want to invest in people
who showed a profile of having a high probability of paying you back.
We will try to create a model that will help predict this.
Lending club had a very interesting year in 2016,
so let's check out some of their data and keep the context in mind.
This data is from before they even went public.
We will use lending data from 2007-2010 and be trying to classify and predict
whether or not the borrower paid back their loan in full.
Y
It's recommended you use the csv provided as it has been cleaned of NA values.
Here are what the columns represent:

- credit.policy: 1 if the customer meets the credit underwriting criteria of
LendingClub.com,
and 0 otherwise.
- purpose: The purpose of the loan (takes values "credit_card",
"debt_consolidation", "educational", "major_purchase", "small_business", and
"all_other").
- int.rate: The interest rate of the loan, as a proportion (a rate of 11%
would be stored as 0.11).
Borrowers judged by LendingClub.com to be more risky are assigned higher
interest rates.
-installment: The monthly installments owed by the borrower if the loan is
funded.
-log.annual.inc: The natural log of the self-reported annual income of the
borrower.
-dti: The debt-to-income ratio of the borrower (amount of debt divided by
annual income).
-fico: The FICO credit score of the borrower.
-days.with.cr.line: The number of days the borrower has had a credit line.
-revol.bal: The borrower's revolving balance (amount unpaid at the end of the
credit card billing cycle).
-revol.util: The borrower's revolving line utilization rate (the amount of the
credit line used relative to total credit available).
-inq.last.6mths: The borrower's number of inquiries by creditors in the last 6
months.
-delinq.2yrs: The number of times the borrower had been 30+ days past due on a
payment in the past 2 years.
-pub.rec: The borrower's number of derogatory public records (bankruptcy
filings, tax liens, or judgments).
'''
'''
Import Libraries
Import the usual libraries for pandas and plotting. You can import sklearn
later on.
'''
import pandas as pd
import numpy as np
'''
Get the Data
'''
'''
Use pandas to read loan_data.csv as a dataframe called loans.
'''
loans = pd.read_csv('C:\\Urbino_MachineLearning\\3.
Classification\\DecisionTrees&RandomForest\\loan_data.csv')
'''
Check out the info(), head(), and describe() methods on loans.
'''
loans.info()
loans.describe()
loans.head()
'''
'''
'''
Let's do some data visualization! We'll use seaborn and pandas built-in
plotting capabilities,
but feel free to use whatever library you want.
Don't worry about the colors matching, just worry about getting the main idea
of the plot.
Create a histogram of two FICO distributions on top of each other,
one for each credit.policy outcome.
'''
loans[loans['credit.policy']==1]
['fico'].hist(alpha=0.5,color='blue',bins=30,label='Credit.Policy=1')
loans[loans['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')
'''
Create a similar figure, except this time select by the not.fully.paid column.
'''
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='blue',
bins=30,label='not.fully.paid=1')
loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='red',
bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')
'''
Create a countplot using seaborn showing the counts of loans by purpose,
with the color hue defined by not.fully.paid.
'''
sns.countplot(x='purpose',hue='not.fully.paid',data=loans,palette='Set1')
'''
Let's see the trend between FICO score and interest rate. Recreate the
following jointplot.
'''
sns.jointplot(x='fico',y='int.rate',data=loans,color='purple')
'''
Create the following lmplots to see if the trend differed between
not.fully.paid
and credit.policy.
Check the documentation for lmplot()
if you can't figure out how to separate it into columns.
'''
sns.lmplot(y='int.rate',x='fico',data=loans,hue='credit.policy',
col='not.fully.paid',palette='Set1')
'''
Setting up the Data
Let's get ready to set up our data for our Random Forest Classification Model!
Check loans.info() again.
'''
loans.info()
'''
Categorical Features
'''
'''
Notice that the purpose column as categorical
That means we need to transform them using dummy variables
so sklearn will be able to understand them.
Let's do this in one clean step using pd.get_dummies.
Let's show a way of dealing with these columns that can be expanded
to multiple categorical features if necessary.
'''
'''
Create a list of 1 element containing the string 'purpose'. Call this list
cat_feats.
'''
cat_feats = ['purpose']
'''
Now use pd.get_dummies(loans,columns=cat_feats,drop_first=True) to create a
fixed larger dataframe that has new feature columns with dummy variables.
Set this dataframe as final_data.
'''
final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)
final_data.info()
'''
Train Test Split
'''
'''
Now its time to split our data into a training set and a testing set!
'''
'''
Use sklearn to split your data into a training set and a testing set as we've
done in the past.
'''
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
random_state=101)
'''
Training a Decision Tree Model
Let's start by training a single decision tree first!
'''
'''
Import DecisionTreeClassifier
'''
'''
Create an instance of DecisionTreeClassifier() called dtree and fit it to the
training data.
'''
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,

max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
'''
Predictions and Evaluation of Decision Tree
'''
'''
Create predictions from the test set and create a classification report and a
confusion matrix.
'''
predictions = dtree.predict(X_test)
'''
Training the Random Forest model
'''
'''
Now its time to train our model!
'''
'''
Create an instance of the RandomForestClassifier class and fit it to our
training data from the previous step.
'''

rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=600, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
'''
Predictions and Evaluation
Let's predict off the y_test values and evaluate our model.
Predict the class of not.fully.paid for the X_test data.
'''
predictions = rfc.predict(X_test)
'''
Now create a classification report from the results. Do you get anything
strange or some sort of warning?
'''

'''
Show the Confusion Matrix for the predictions.
'''
# Bagged Decision Trees for Classification

import pandas
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
'class']
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees,
random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
# Random Forest Classification

import pandas
'class']
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 100
max_features = 3
model = RandomForestClassifier(n_estimators=num_trees,
max_features=max_features)
# Extra Trees Classification

import pandas
from sklearn.ensemble import ExtraTreesClassifier
'class']
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 100
max_features = 7
model = ExtraTreesClassifier(n_estimators=num_trees,
max_features=max_features)
# AdaBoost Classification
import pandas
from sklearn.ensemble import AdaBoostClassifier
'class']
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 30
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
# Stochastic Gradient Boosting Classification

import pandas
from sklearn.ensemble import GradientBoostingClassifier
'class']
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 100
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
# Voting Ensemble for Classification

import pandas
from sklearn.ensemble import VotingClassifier
'class']
X = array[:,0:8]
Y = array[:,8]
seed = 7
# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
import pandas as pd
import numpy as np
from plotnine import *
from sklearn.preprocessing import LabelEncoder

from sklearn_pandas import DataFrameMapper
training_data = 'C:\\Urbino_MachineLearning\\3.
Classification\\DecisionTrees&RandomForest\\adult-training.csv'
test_data = 'C:\\Urbino_MachineLearning\\3.
Classification\\DecisionTrees&RandomForest\\adult-test.csv'
columns = ['Age','Workclass','fnlgwt','Education','EdNum','MaritalStatus',
'Occupation','Relationship','Race','Sex','CapitalGain','CapitalLoss',
'HoursPerWeek','Country','Income']
df_train_set = pd.read_csv(training_data, names=columns)

df_test_set = pd.read_csv(test_data, names=columns, skiprows=1)
df_train_set.drop('fnlgwt', axis=1, inplace=True)
df_test_set.drop('fnlgwt', axis=1, inplace=True)
'''
In the above code, we imported all needed modules,
loaded both test and training data as data-frames.
We also got rid of the fnlgwt column that is of no importance in our modeling
exercise.
Let us look at the first 5 rows of the training data:
'''
df_train_set.head()
'''
We also need to do some data cleanup.
First, I will be removing any special characters from all columns.
Furthermore, any space or “.” characters too will be removed from any str
data.
'''
#replace the special character to "Unknown"

for i in df_train_set.columns:
df_train_set[i].replace(' ?', 'Unknown', inplace=True)
df_test_set[i].replace(' ?', 'Unknown', inplace=True)
for col in df_train_set.columns:

if df_train_set[col].dtype != 'int64':
df_train_set[col] = df_train_set[col].apply(lambda val: val.replace("
", ""))
df_train_set[col] = df_train_set[col].apply(lambda val:
val.replace(".", ""))
df_test_set[col] = df_test_set[col].apply(lambda val: val.replace(" ",
""))
df_test_set[col] = df_test_set[col].apply(lambda val: val.replace(".",
""))
'''
As you can see, there are two columns that describe education of individuals -
Education and EdNum.
I would assume both of these to be highly correlated and hence remove the
Education column.
The Country column too should not play a role in prediction of Income
and hence we would remove that as well.
'''
df_train_set.drop(["Country", "Education"], axis=1, inplace=True)

df_test_set.drop(["Country", "Education"], axis=1, inplace=True)
'''
Although the Age and EdNum columns are numeric,

they can be easily binned and be more effective.
We will bin age in bins of 10 and no. of years of education into bins of 5.
'''
colnames = list(df_train_set.columns)
colnames.remove('Age')
colnames.remove('EdNum')
colnames = ['AgeGroup', 'Education'] + colnames
labels = ["{0}-{1}".format(i, i + 9) for i in range(0, 100, 10)]

df_train_set['AgeGroup'] = pd.cut(df_train_set.Age, range(0, 101, 10),
right=False, labels=labels)
df_test_set['AgeGroup'] = pd.cut(df_test_set.Age, range(0, 101, 10),
labels = ["{0}-{1}".format(i, i + 4) for i in range(0, 20, 5)]

df_train_set['Education'] = pd.cut(df_train_set.EdNum, range(0, 21, 5),
df_test_set['Education'] = pd.cut(df_test_set.EdNum, range(0, 21, 5),
df_train_set = df_train_set[colnames]
df_test_set = df_test_set[colnames]
'''
Now that we have cleaned the data, let us look how balanced out data set is:
'''
df_train_set.Income.value_counts()
df_test_set.Income.value_counts()
'''
In both training and the test data sets,
we find <=50K class to be about 3 times larger than the >50K class.
this could be a problem because of quite imbalanced data.
However, for simplicity we will be treating this exercise as a regular
problem.
'''
'''
EDA
Now, let us look at distribution and inter-dependence of different features
in the training data graphically.
'''
'''
Let us first see how Relationships and MaritalStatus features are
interrelated.
'''
(ggplot(df_train_set, aes(x = "Relationship", fill = "MaritalStatus"))

+ geom_bar(position="fill")
+ theme(axis_text_x = element_text(angle = 60, hjust = 1))
)
'''
Let us look at effect of Education
(measured in terms of bins of no. of years of education)
on Income for different Age groups.
'''
(ggplot(df_train_set, aes(x = "Education", fill = "Income"))

+ theme(axis_text_x = element_text(angle = 60, hjust = 1))
+ facet_wrap('~AgeGroup')
)
'''
Recently, there has been a lot of talk about effect of gender based bias/gap
in the income.
We can look at the effect of Education and Race
for males and females separately.
'''
(ggplot(df_train_set, aes(x = "Education", fill = "Income"))

+ theme(axis_text_x = element_text(angle = -90, hjust = 1))
+ facet_wrap('~Sex')
)
(ggplot(df_train_set, aes(x = "Race", fill = "Income"))

+ theme(axis_text_x = element_text(angle = -90, hjust = 1))
+ facet_wrap('~Sex')
)
'''
Until now, we have only looked at the inter-dependence of non-numeric
features.
Let us now look at the effect of CapitalGain and CapitalLoss on income.
'''
(ggplot(df_train_set, aes(x="Income", y="CapitalGain"))

+ geom_jitter(position=position_jitter(0.1))
)
(ggplot(df_train_set, aes(x="Income", y="CapitalLoss"))
+ geom_jitter(position=position_jitter(0.1))
)
'''
Tree Classifier
Now that we understand some relationship in our data,
let us build a simple tree classifier model
using sklearn.tree.DecisionTreeClassifier.
However, in order to use this module,
we need to convert all of our non-numeric data to numeric ones.
This can be quite easily achieved using the sklearn.preprocessing.LabelEncoder
module
along with the sklearn_pandas module to apply this on pandas data-frames
directly.
'''
mapper = DataFrameMapper([
('AgeGroup', LabelEncoder()),
('Education', LabelEncoder()),
('Workclass', LabelEncoder()),
('MaritalStatus', LabelEncoder()),
('Occupation', LabelEncoder()),
('Relationship', LabelEncoder()),
('Race', LabelEncoder()),
('Sex', LabelEncoder()),
('Income', LabelEncoder())
], df_out=True, default=None)
cols = list(df_train_set.columns)
cols.remove("Income")
cols = cols[:-3] + ["Income"] + cols[-3:]
df_train = mapper.fit_transform(df_train_set.copy())
df_train.columns = cols
df_test = mapper.transform(df_test_set.copy())
df_test.columns = cols
cols.remove("Income")
x_train, y_train = df_train[cols].values, df_train["Income"].values
x_test, y_test = df_test[cols].values, df_test["Income"].values
'''
Now we have training as well testing data in correct format to build
our first model!
'''
treeClassifier = DecisionTreeClassifier()
treeClassifier.fit(x_train, y_train)
treeClassifier.score(x_test, y_test)
'''
The simplest possible tree classifier model
with no optimization gave us an accuracy of 83.5%.
In the case of classification problems,
confusion matrix is a good way to judge the accuracy of models.
Using the following code we can plot the confusion matrix
for any of the tree-based models.
'''
import itertools
def plot_confusion_matrix(cm, classes, normalize=False):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
cmap = plt.cm.Blues
title = "Confusion Matrix"
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm = np.around(cm, decimals=3)
plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
'''
Now, we can take a look at the confusion matrix of out first model:
'''
y_pred = treeClassifier.predict(x_test)
cfm = confusion_matrix(y_test, y_pred, labels=[0, 1])
plot_confusion_matrix(cfm, classes=["<=50K", ">50K"], normalize=True)
'''
We find that the majority class (<=50K Income) has an accuracy of 90.5%,
while the minority class (>50K Income) has an accuracy of only 60.8%.
Let us look at ways of tuning this simple classifier.
We can use GridSearchCV() with 5-fold cross-validation to tune various
important parameters of tree classifiers.
'''
from sklearn.model_selection import GridSearchCV

parameters = {
'max_features':(None, 9, 6),
'max_depth':(None, 24, 16),
'min_samples_split': (2, 4, 8),
'min_samples_leaf': (16, 4, 12)
}
clf = GridSearchCV(treeClassifier, parameters, cv=5, n_jobs=4)

clf.fit(x_train, y_train)
clf.best_score_, clf.score(x_test, y_test), clf.best_params_
'''
With the optimization, we find the accuracy to increase to 85.9%.
In the above, we can also look at the parameters of the best model.
Now, let us have a look at the confusion matrix of the optimized model.
'''
y_pred = clf.predict(x_test)
cfm = confusion_matrix(y_test, y_pred, labels=[0, 1])
plot_confusion_matrix(cfm, classes=["<=50K", ">50K"], normalize=True)
# Decision Tree Classification

import numpy as np
import pandas as pd

dataset = pd.read_csv('C:\Urbino_MachineLearning\\3.
Classification\\DecisionTrees&RandomForest\\DecisionTrees\\Social_Network_Ads.
csv')
random_state = 0)
# Feature Scaling
# Fitting Decision Tree Classification to the Training set

classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)



0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('Decision Tree Classification (Training set)')
plt.xlabel('Age')
plt.legend()
plt.show()

0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('Decision Tree Classification (Test set)')
plt.xlabel('Age')
plt.legend()
plt.show()
# Random Forest Classification

import numpy as np
import pandas as pd

Classification\\DecisionTrees&RandomForest\\RandomForest\\Social_Network_Ads.c
sv')
random_state = 0)
# Feature Scaling
# Fitting Random Forest Classification to the Training set

classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy',
random_state = 0)


0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('Random Forest Classification (Training set)')
plt.xlabel('Age')
plt.legend()
plt.show()

0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('Random Forest Classification (Test set)')
plt.xlabel('Age')
plt.legend()
plt.show()
c. K Nearest Neighbour
# loading libraries
import pandas as pd
# define column names

#names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'class']
#url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data"
# loading training data
#df = pd.read_csv(url, names=names)
Classification\\KNearestNeighbour\\iris.csv')
df.head()
'''
scikit-learn requires that the design matrix X and target vector y be numpy
arrays
so let’s oblige. Furthermore, we need to split our data into training and test
sets.
The following code does just that.
'''
# loading libraries
import numpy as np
from sklearn.metrics import accuracy_score
# create design matrix X and target vector y

X = np.array(df.ix[:, 0:4]) # end index is exclusive
y = np.array(df['species']) # another way of indexing a pandas df
# split into train and test

random_state=42)
'''
Finally, following the above modeling pattern, we define our classifer,
in this case KNN, fit it to our training data and evaluate its accuracy.
We’ll be using an arbitrary K
but we will see later on how cross validation can be used to find its optimal
value.
'''
# loading library
from sklearn.neighbors import KNeighborsClassifier
# instantiate learning model (k = 3)

knn = KNeighborsClassifier(n_neighbors=3)
# fitting the model

knn.fit(X_train, y_train)
# predict the response

pred = knn.predict(X_test)
print (accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
# creating odd list of K for KNN

myList = list(range(1,50))
# subsetting just the odd ones

neighbors = list(range(1,50,2))
# empty list that will hold cv scores

cv_scores = []
# perform 10-fold cross validation
'''
by default, cross_val_score uses KFold cross-validation.
This works by splitting the data set into K equal folds.
Say we have 3 folds (fold1, fold2, fold3), then the algorithm works as
follows:
Use fold1 and fold2 as your training set in svm and test performance on fold3.
Use fold1 and fold3 as our training set in svm and test performance on fold2.
Use fold2 and fold3 as our training set in svm and test performance on fold1.
So each fold is used for both training and testing.
'''
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
# changing to misclassification error

MSE = [1 - x for x in cv_scores]
# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print ("The optimal number of neighbors is %d" % optimal_k)
# plot misclassification error vs k

plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
'''
Second Example
'''
# K-Nearest Neighbors (K-NN)

import numpy as np
import pandas as pd

Classification\\KNearestNeighbour\\Social_Network_Ads.csv')
random_state = 0)
# Feature Scaling
# Fitting K-NN to the Training set

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p =
2)


0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('K-NN (Training set)')
plt.xlabel('Age')
plt.legend()
plt.show()

0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('K-NN (Test set)')
plt.xlabel('Age')
plt.legend()
plt.show()
d. K-means (Clustering technique)
'''
For this project we will attempt to use KMeans Clustering to cluster
Universities into to two groups,
Private and Public.
It is very important to note,

we actually have the labels for this data set,
but we will NOT use them for the KMeans clustering algorithm,
since that is an unsupervised learning algorithm.
When using the Kmeans algorithm under normal circumstances, it is because you
don't have labels.
In this case we will use the labels to try to get an idea of how well the
algorithm performed,
but you won't usually do this for Kmeans,
so the classification report and confusion matrix at the end of this project,
don't truly make sense in a real world setting!.
'''
'''
The Data
We will use a data frame with 777 observations on the following 18 variables.
- Private A factor with levels No and Yes indicating private or public
university
- Apps Number of applications received
- Accept Number of applications accepted
- Enroll Number of new students enrolled
- Top10perc Pct. new students from top 10% of H.S. class
- Top25perc Pct. new students from top 25% of H.S. class
- F.Undergrad Number of fulltime undergraduates
- P.Undergrad Number of parttime undergraduates
- Outstate Out-of-state tuition
- Room.Board Room and board costs
- Books Estimated book costs
- Personal Estimated personal spending
- PhD Pct. of faculty with Ph.D.’s
- Terminal Pct. of faculty with terminal degree
- S.F.Ratio Student/faculty ratio perc.alumni Pct. alumni who donate
- Expend Instructional expenditure per student
- Grad.Rate Graduation rate
'''
'''
Import Libraries
Import the libraries you usually use for data analysis.
'''
import pandas as pd
import numpy as np
%matplotlib inline
'''
Get the Data
Read in the College_Data file using read_csv. Figure out how to set the first
column as the index.
'''
df =
pd.read_csv('D:\\Urbino_MachineLearning\\Classification\\KNearestNeighbour\\Co
llege_Data',index_col=0)
'''
Check the head of the data
'''
df.head()
'''
Check the info() and describe() methods on the data.
'''
df.info()
df.describe()
'''
EDA
'''
'''
It's time to create some data visualizations!
'''
'''
Create a scatterplot of Grad.Rate versus Room.Board where the points are
colored by the Private column.
'''
sns.lmplot('Room.Board','Grad.Rate',data=df, hue='Private',
palette='coolwarm',size=6,aspect=1,fit_reg=False)
''''
Create a scatterplot of F.Undergrad versus Outstate where the points are
colored by the Private column.
'''
sns.lmplot('Outstate','F.Undergrad',data=df, hue='Private',
palette='coolwarm',size=6,aspect=1,fit_reg=False)
'''
Create a stacked histogram showing Out of State Tuition based on the Private
column. Try doing this using sns.FacetGrid. If that is too tricky, see if you
can do it just by using two instances of pandas.plot(kind='hist').
'''
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="Private",palette='coolwarm',size=6,aspect=2)
g = g.map(plt.hist,'Outstate',bins=20,alpha=0.7)
'''
Create a similar histogram for the Grad.Rate column.
'''
g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7)
'''
Notice how there seems to be a private school with a graduation rate of higher
than 100%.What is the name of that school?
'''
df[df['Grad.Rate'] > 100]
'''
Set that school's graduation rate to 100 so it makes sense. You may get a
warning not an error) when doing this operation, so use dataframe operations
or just re-do the histogram visualization to make sure it actually went
through.
'''
df['Grad.Rate']['Cazenovia College'] = 100
df[df['Grad.Rate'] > 100]

g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7)
'''
K Means Cluster Creation
Now it is time to create the Cluster labels!
'''
'''
Import KMeans from SciKit Learn.
'''
from sklearn.cluster import KMeans
'''
Create an instance of a K Means model with 2 clusters.
'''
kmeans = KMeans(n_clusters=2)
'''
Fit the model to all the data except for the Private label.
'''
kmeans.fit(df.drop('Private',axis=1))
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
verbose=0)
'''
What are the cluster center vectors?
'''
kmeans.cluster_centers_
array([[ 1.81323468e+03, 1.28716592e+03, 4.91044843e+02,

2.53094170e+01, 5.34708520e+01, 2.18854858e+03,
5.95458894e+02, 1.03957085e+04, 4.31136472e+03,
5.41982063e+02, 1.28033632e+03, 7.04424514e+01,
7.78251121e+01, 1.40997010e+01, 2.31748879e+01,
8.93204634e+03, 6.51195815e+01],
[ 1.03631389e+04, 6.55089815e+03, 2.56972222e+03,
4.14907407e+01, 7.02037037e+01, 1.30619352e+04,
2.46486111e+03, 1.07191759e+04, 4.64347222e+03,
5.95212963e+02, 1.71420370e+03, 8.63981481e+01,
9.13333333e+01, 1.40277778e+01, 2.00740741e+01,
1.41705000e+04, 6.75925926e+01]])
'''
Evaluation
There is no perfect way to evaluate clustering if you don't have the labels,
however since this is just an exercise,
we do have the labels,
so we take advantage of this to evaluate our clusters,
keep in mind, you usually won't have this luxury in the real world.
Create a new column for df called 'Cluster',

which is a 1 for a Private school, and a 0 for a public school.
'''
def converter(cluster):
if cluster=='Yes':
return 1
else:
return 0
df['Cluster'] = df['Private'].apply(converter)
df.head()
'''
Create a confusion matrix and classification report to see how well the Kmeans
clustering worked
without being given any labels.
'''
from sklearn.metrics import confusion_matrix,classification_report

print(confusion_matrix(df['Cluster'],kmeans.labels_))
print(classification_report(df['Cluster'],kmeans.labels_))
e. Naïve Bayes
# -*- coding: utf-8 -*-

"""
Created on Thu Aug 30 10:27:46 2018
@author: Dragos
"""
import pandas as pd
df = pd.read_table('C:\\Urbino_MachineLearning\\3.
Classification\\NaiveBayes\\SMSSpamCollection',
sep='\t',
header=None,
names=['label', 'message'])
''''
Pre-processing
Once we have our data ready, it is time to do some preprocessing.
We will focus on removing useless variance for our task at hand.
First, we have to convert the labels from strings to binary values for our
classifier:
'''
df['label'] = df.label.map({'ham': 0, 'spam': 1})
'''
Second, convert all characters in the message to lower case:
'''
df['message'] = df.message.map(lambda x: x.lower())
'''
Third, remove any punctuation:
'''
df['message'] = df.message.str.replace('[^\w\s]', '')
'''
Fourth, tokenize the messages into into single words using nltk.
First, we have to import and download the tokenizer from the console:
'''
import nltk
nltk.download()
'''
An installation window will appear.
Go to the "Models" tab and select "punkt" from the "Identifier" column.
Then click "Download" and it will install the necessary files.
Then it should work! Now we can apply the tokenization:
'''
df['message'] = df['message'].apply(nltk.word_tokenize)
'''
Fifth, we will perform some word stemming.
The idea of stemming is to normalize our text for all variations of words
carry the same meaning,
regardless of the tense. One of the most popular stemming algorithms is the
Porter Stemmer:
'''
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
df['message'] = df['message'].apply(lambda x: [stemmer.stem(y) for y in x])
'''
Finally, we will transform the data into occurrences,
which will be the features that we will feed into our model:
'''
from sklearn.feature_extraction.text import CountVectorizer
# This converts the list of words into space-separated strings

df['message'] = df['message'].apply(lambda x: ' '.join(x))
count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['message'])
'''
We could leave it as the simple word-count per message,
but it is better to use Term Frequency Inverse Document Frequency,
more known as tf-idf:
'''
'''
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse
document frequency,
is a numerical statistic that is intended to reflect how important a word is
to a document in a collection
or corpus.[1] It is often used as a weighting factor in searches of
information retrieval,
text mining, and user modeling.
The tf–idf value increases proportionally to the number of times a word
appears
in the document and is offset by the number of documents in the corpus that
contain the word,
which helps to adjust for the fact that some words appear more frequently in
general.
Tf–idf is one of the most popular term-weighting schemes today;
83% of text-based recommender systems in digital libraries use tf–idf.[2]
'''
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer().fit(counts)
counts = transformer.transform(counts)
'''
Training the Model
Now that we have performed feature extraction from our data,
it is time to build our model.
We will start by splitting our data into training and test sets:
'''

X_train, X_test, y_train, y_test = train_test_split(counts, df['label'],
test_size=0.1, random_state=69)
'''
Then, all that we have to do is initialize the Naive Bayes Classifier and fit
the data.
For text classification problems, the Multinomial Naive Bayes Classifier is
well-suited:
'''
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)
'''
Evaluating the Model
Once we have put together our classifier, we can evaluate its performance in
the testing set:
'''
import numpy as np
predicted = model.predict(X_test)
print(np.mean(predicted == y_test))
'''
https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-
learn/
'''
# Naive Bayes

import numpy as np
import pandas as pd

Classification\\NaiveBayes\\Social_Network_Ads.csv')
random_state = 0)
# Feature Scaling
# Fitting Naive Bayes to the Training set

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()



0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('Naive Bayes (Training set)')
plt.xlabel('Age')
plt.legend()
plt.show()

0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('Naive Bayes (Test set)')
plt.xlabel('Age')
plt.legend()
plt.show()
f. Support vector machines
# Support Vector Machine (SVM)

import numpy as np
import pandas as pd

Classification\\SVM\\Social_Network_Ads.csv')
random_state = 0)
# Feature Scaling
# Fitting SVM to the Training set

classifier = SVC(kernel = 'linear', random_state = 0)


0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.legend()
plt.show()

0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('SVM (Test set)')
plt.xlabel('Age')
plt.legend()
plt.show()
'''
We will be using the famous Iris flower data set.
The Iris flower data set or Fisher's Iris data set is a multivariate data set
introduced by Sir Ronald Fisher in the 1936 as an example of discriminant
analysis.
The data set consists of 50 samples from each of three species of Iris (Iris
setosa, Iris virginica and Iris versicolor), so 150 total samples. Four
features were measured from each sample: the length and the width of the
sepals and petals, in centimeters.
'''
'''
Here's a picture of the three different Iris types:
'''
# The Iris Setosa
from IPython.display import Image
url =
'http://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Ir
is_setosa.jpg'
Image(url,width=300, height=300)
# The Iris Versicolor

url =
'http://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg'
# The Iris Virginica

url = 'http://upload.wikimedia.org/wikipedia/commons/9/9f/Iris_virginica.jpg'
'''
The iris dataset contains measurements for 150 iris flowers from three
different species.
'''
'''
The three classes in the Iris dataset:
Iris-setosa (n=50)
Iris-versicolor (n=50)
Iris-virginica (n=50)
'''
'''
The four features of the Iris dataset:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
'''
'''
Get the data
'''
'''
Use seaborn to get the iris data by using: iris = sns.load_dataset('iris')
'''

iris = sns.load_dataset('iris')
'''
Let's visualize the data and get started!
'''
'''
'''
import pandas as pd
'''
Create a pairplot of the data set. Which flower species seems to be the most
separable?
'''
# Setosa is the most separable.
sns.pairplot(iris,hue='species',palette='Dark2')
'''
Create a kde plot of sepal_length versus sepal width for setosa species of
flower.
'''
setosa = iris[iris['species']=='setosa']
sns.kdeplot( setosa['sepal_width'], setosa['sepal_length'], cmap="plasma",
shade=True, shade_lowest=False)
'''
Train Test Split
'''
'''
Split your data into a training set and a testing set.
'''
X = iris.drop('species',axis=1)
y = iris['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
'''
Train a Model
'''
'''
Now its time to train a Support Vector Machine Classifier.
Call the SVC() model from sklearn and fit the model to the training data.
'''

svc_model = SVC()
svc_model.fit(X_train,y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
'''
Model Evaluation
'''
'''
Now get predictions from the model and create a confusion matrix
and a classification report.
'''
predictions = svc_model.predict(X_test)
'''
You should have noticed that your model was pretty good!
Let's see if we can tune the parameters to try to get even better
(unlikely, and you probably would be satisfied with these results
in real like because the data set is quite small,
but I just want you to practice using GridSearch.
Gridsearch Practice
'''
'''
Import GridsearchCV from SciKit Learn.
'''

'''
Create a dictionary called param_grid and fill out some parameters for C and
gamma.
'''
'''
C and Gamma are the parameters for a nonlinear support vector machine (SVM)
with a Gaussian radial basis function kernel.
A standard SVM seeks to find a margin that separates all positive and negative
examples.
However, this can lead to poorly fit models
if any examples are mislabeled or extremely unusual.
To account for this, in 1995,
Cortes and Vapnik proposed the idea of a "soft margin"
SVM that allows some examples to be "ignored" or placed on the wrong side of
the margin;
this innovation often leads to a better overall fit. C is the parameter for
the soft margin
cost function, which controls the influence of each individual support vector;
this process involves trading error penalty for stability.
Gamma is the free parameter of the Gaussian radial basis function.
'''
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001]}
'''
Create a GridSearchCV object and fit it to the training data.
'''
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
grid.fit(X_train,y_train)
GridSearchCV(cv=None, error_score='raise',
estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
fit_params={}, iid=True, n_jobs=1,
param_grid={'gamma': [1, 0.1, 0.01, 0.001], 'C': [0.1, 1, 10, 100]},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)
'''
Now take that grid model and create some predictions using the test set
and create classification reports and confusion matrices for them.
Were you able to improve?
'''
print (grid)
grid_predictions = grid.predict(X_test)
print(grid.best_score_)
print(grid.best_estimator_.C)
print(grid.best_estimator_.gamma)
print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import svm
#Linear SVM Classification
df = sns.load_dataset('iris')
df.head()
col = ['petal_length', 'petal_width', 'species']

df.loc[:, col].head() #Access a group of rows and columns by label(s) or a
boolean array.
df.species.unique()
col = ['petal_length', 'petal_width']

X = df.loc[:, col]
species_to_num = {'setosa': 0,
'versicolor': 1,
'virginica': 2}
df['tmp'] = df['species'].map(species_to_num)
y = df['tmp']
C = 0.001
clf = svm.SVC(kernel='linear', C=C)
#clf = svm.LinearSVC(C=C, loss='hinge')
#clf = svm.SVC(kernel='poly', degree=3, C=C)
#clf = svm.SVC(kernel='rbf', gamma=0.7, C=C)
clf.fit(X, y)
print ( clf.predict([[6, 2]]) )

Xv = X.values.reshape(-1,1) #trying to reshape with (-1, 1) . We have
provided column as 1 but rows as unknown
h = 0.02
x_min, x_max = Xv.min(), Xv.max() + 1
y_min, y_max = y.min(), y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)
fig = plt.figure(figsize=(16,10))
ax = plt.contourf(xx, yy, z, cmap = 'afmhot', alpha=0.3);
plt.scatter(X.values[:, 0], X.values[:, 1], c=y, s=80,
alpha=0.9, edgecolors='g');
z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)
fig = plt.figure(figsize=(16,10))
ax = plt.contourf(xx, yy, z, cmap = 'afmhot', alpha=0.3);
plt.scatter(X.values[:, 0], X.values[:, 1], c=y, s=80,
alpha=0.9, edgecolors='g');
#Linear SVM Implementation
X = df.loc[:, col]
'versicolor': 1,
'virginica': 2}
y = df['tmp']
X_train, X_std_test, y_train, y_test = train_test_split(X, y,
train_size=0.8,
random_state=0)
sc_x = StandardScaler()
X_std_train = sc_x.fit_transform(X_train)
#Polynomial Kernel
C = 1.0
clf = svm.SVC(kernel='poly', degree=3, C=C)
clf.fit(X_std_train, y_train)
res = cross_val_score(clf, X_std_train, y_train, cv=10, scoring='accuracy')

print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
y_train_pred = cross_val_predict(clf, X_std_train, y_train, cv=3)
confusion_matrix(y_train, y_train_pred)
print("Precision Score: \t {0:.4f}".format(precision_score(y_train,
y_train_pred,
average='weighted')))
print("Recall Score: \t\t {0:.4f}".format(recall_score(y_train,
y_train_pred,
print("F1 Score: \t\t {0:.4f}".format(f1_score(y_train,
y_train_pred,
y_test_pred = cross_val_predict(clf, sc_x.transform(X_test), y_test, cv=3)

confusion_matrix(y_test, y_test_pred)
print("Precision Score: \t {0:.4f}".format(precision_score(y_test,
y_test_pred,
print("Recall Score: \t\t {0:.4f}".format(recall_score(y_test,
y_test_pred,
print("F1 Score: \t\t {0:.4f}".format(f1_score(y_test,
y_test_pred,
#Gaussian Radial Basis Function (rbf)
X = df.loc[:, col]
'versicolor': 1,
'virginica': 2}
y = df['tmp']
X_train, X_std_test, y_train, y_test = train_test_split(X, y,
train_size=0.8,
random_state=0)
sc_x = StandardScaler()
X_std_train = sc_x.fit_transform(X_train)
C = 1.0
clf = svm.SVC(kernel='rbf', gamma=0.7, C=C)
clf.fit(X_std_train, y_train)
res = cross_val_score(clf, X_std_train, y_train, cv=10, scoring='accuracy')
print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
y_train_pred = cross_val_predict(clf, X_std_train, y_train, cv=3)

confusion_matrix(y_train, y_train_pred)
print("Precision Score: \t {0:.4f}".format(precision_score(y_train,

y_train_pred,
print("Recall Score: \t\t {0:.4f}".format(recall_score(y_train,
y_train_pred,
print("F1 Score: \t\t {0:.4f}".format(f1_score(y_train,
y_train_pred,
#grid search
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
pipeline = Pipeline([('clf', svm.SVC(kernel='rbf', C=1, gamma=0.1))])
params = {'clf__C':(0.1, 0.5, 1, 2, 5, 10, 20),
'clf__gamma':(0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1)}
svm_grid_rbf = GridSearchCV(pipeline, params, n_jobs=-1,
cv=3, verbose=1, scoring='accuracy')
svm_grid_rbf.fit(X_train, y_train)
svm_grid_rbf.best_score_
best = svm_grid_rbf.best_estimator_.get_params()
for k in sorted(params.keys()):
print('\t{0}: \t {1:.2f}'.format(k, best[k]))
y_test_pred = svm_grid_rbf.predict(X_test)
confusion_matrix(y_test, y_test_pred)
print("Precision Score: \t {0:.4f}".format(precision_score(y_test,
y_test_pred,
print("Recall Score: \t\t {0:.4f}".format(recall_score(y_test,
y_test_pred,
print("F1 Score: \t\t {0:.4f}".format(f1_score(y_test,
y_test_pred,
#support vector regression
import numpy as np
import pandas as pd
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import load_boston

boston_data = load_boston()
df = pd.DataFrame(boston_data.data, columns=boston_data.feature_names)
df.head()
y = boston_data.target
X = df[['LSTAT']].values
svr = SVR()
svr.fit(X, y)
plt.plot(X[sort_idx], svr.predict(X[sort_idx]), color='k')
plt.xlabel('LSTAT')
plt.ylabel('MEDV');
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=42)
#linear kernel
svr = SVR(kernel='linear')
svr.fit(X_train, y_train)
y_train_pred = svr.predict(X_train)
y_test_pred = svr.predict(X_test)
#polynomial kernel
svr = SVR(kernel='poly', C=1e3, degree=2)

#rbf kernel
svr = SVR(kernel='rbf', C=1e3, gamma=0.1)

g. SVM Kernel
# -*- coding: utf-8 -*-

"""
Created on Tue Aug 28 15:21:25 2018
@author: Dragos
"""
'''
A support vector machine (SVM) is a type of
supervised machine learning classification algorithm.
SVMs were introduced initially in 1960s and were later refined in 1990s.
However, it is only now that they are becoming extremely popular,
owing to their ability to achieve brilliant results.
SVMs are implemented in a unique way when compared to other machine learning
algorithms.
'''
'''
Implementing SVM with Scikit-Learn
'''
'''
Importing libraries
The following script imports required libraries:
'''
import pandas as pd
import numpy as np
bankdata = pd.read_csv("C:\\Urbino_MachineLearning\\3.
Classification\\SVMKernel\\bill_authentication.csv")
bankdata.shape
bankdata.head()
'''
Data preprocessing involves
(1) Dividing the data into attributes and labels and
(2) dividing the data into training and testing sets.
To divide the data into attributes and labels, execute the following code:
'''
X = bankdata.drop('Class', axis=1)
y = bankdata['Class']
'''
In the first line of the script above,
all the columns of the bankdata dataframe are being stored in the X variable
except the "Class" column, which is the label column.
The drop() method drops this column.
In the second line,
only the class column is being stored in the y variable.
At this point of time X variable contains attributes
while y variable contains corresponding labels.
'''

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
'''
Since we are going to perform a classification task,
we will use the support vector classifier class,
which is written as SVC in the Scikit-Learn's svm library.
This class takes one parameter, which is the kernel type.
This is very important.
In the case of a simple SVM we simply set this parameter as "linear"
since simple SVMs can only classify linearly separable data
'''
'''
The fit method of SVC class is called to train the algorithm
on the training data, which is passed as a parameter to the fit method.
Execute the following code to train the algorithm:
'''
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
'''
To make predictions, the predict method of the SVC class is used. Take a look
at the following code:
'''
y_pred = svclassifier.predict(X_test)
'''
Evaluating the Algorithm
Confusion matrix, precision, recall, and F1 measures are the most commonly
used metrics for classification tasks. Scikit-Learn's metrics library contains
the classification_report and confusion_matrix methods, which can be readily
used to find out the values for these important metrics.
'''

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
'''
Kernel SVM
we saw how the simple SVM algorithm can be used to find decision boundary
for linearly separable data.
However, in the case of non-linearly separable data,
a straight line
cannot be used as a decision boundary.
'''
import numpy as np
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data"
# Assign colum names to the dataset

#colnames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width',
'Class']
#pd.read_csv('pandas_dataframe_importing_csv/example.csv')
# Read dataset to pandas dataframe
#irisdata = pd.read_csv(url, names=colnames)
irisdata = pd.read_csv('C:\\Urbino_MachineLearning\\3.
Classification\\SVMKernel\\iris.csv')
X = irisdata.drop('species', axis=1)
y = irisdata['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
'''
Training the Algorithm
To train the kernel SVM,
we use the same SVC class of the Scikit-Learn's svm library.
The difference lies in the value for the kernel parameter of the SVC class.
In the case of the simple SVM we used "linear" as the value for the kernel
parameter.
'''
'''
We will implement polynomial, Gaussian, and sigmoid kernels to see which one
works better for our problem.
'''
'''
1. Polynomial Kernel
In the case of polynomial kernel, you also have to pass a value for the degree
parameter of the SVC class.
This basically is the degree of the polynomial.
Take a look at how we can use a polynomial kernel to implement kernel SVM:
'''
svclassifier = SVC(kernel='poly', degree=8)
'''
Making Predictions
Now once we have trained the algorithm, the next step is to make predictions
on the test data.
'''
'''
Evaluating the Algorithm
As usual, the final step of any machine learning algorithm is to make
evaluations for polynomial kernel. Execute the following script:
'''

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
'''
Now let's repeat the same steps for Gaussian and sigmoid kernels.
'''
'''
2. Gaussian Kernel
Take a look at how we can use polynomial kernel to implement kernel SVM:
'''
'''
To use Gaussian kernel, you have to specify 'rbf' as value for the Kernel
parameter of the SVC class.
'''

svclassifier = SVC(kernel='rbf')
'''
'''
'''
3. Sigmoid Kernel
Finally, let's use a sigmoid kernel for implementing Kernel SVM. Take a look
at the following script:
'''

svclassifier = SVC(kernel='sigmoid')
'''
To use the sigmoid kernel, you have to specify 'sigmoid' as value for the
kernel parameter of the SVC class.
'''
'''
'''
'''
Comparison of Kernel Performance
If we compare the performance of the different types of kernels
we can see that the sigmoid kernel performs the worst.
This is due to the reason that sigmoid function returns two values, 0 and 1,
therefore it is more suitable for binary classification problems.
However, in our case we had three output classes.
Amongst the Gaussian kernel and polynomial kernel,
we can see that Gaussian kernel achieved a perfect 100% prediction rate
while polynomial kernel misclassified one instance.
Therefore the Gaussian kernel performed slightly better.
However, there is no hard and fast rule as to which kernel performs
best in every scenario. It is all about testing all the kernels and
selecting the one with the best results on your test dataset.
'''
# Kernel SVM

import numpy as np
import pandas as pd

Classification\\SVMKernel\\Social_Network_Ads.csv')
random_state = 0)
# Feature Scaling
# Fitting Kernel SVM to the Training set

classifier = SVC(kernel = 'rbf', random_state = 0)


0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('Kernel SVM (Training set)')
plt.xlabel('Age')
plt.legend()
plt.show()

0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.title('Kernel SVM (Test set)')
plt.xlabel('Age')
plt.legend()
plt.show()
5. Recommender system
'''
import Libraries
'''
import numpy as np
import pandas as pd
'''
Get the Data
'''
column_names = ['user_id', 'item_id', 'rating', 'timestamp']

df = pd.read_csv('C:\\Urbino_MachineLearning\\4. Recommender System\\u.data',
sep='\t', names=column_names)
df.head()
'''
Now let's get the movie titles:
'''
movie_titles = pd.read_csv("C:\\Urbino_MachineLearning\\4. Recommender

System\\Movie_Id_Titles")
movie_titles.head()
'''
We can merge them together:
'''
df = pd.merge(df,movie_titles,on='item_id')
df.head()
'''
Let's explore the data a bit and get a look at some of the best rated movies.
Visualization Imports
'''

sns.set_style('white')
'''
Let's create a ratings dataframe with average rating and number of ratings:
'''
df.groupby('title')['rating'].mean().sort_values(ascending=False).head()
df.groupby('title')['rating'].count().sort_values(ascending=False).head()
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
ratings.head()
'''
Now set the number of ratings column:
'''
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')

['rating'].count())
ratings.head()
'''
Now a few histograms:
'''
ratings['num of ratings'].hist(bins=70)
ratings['rating'].hist(bins=70)
sns.jointplot(x='rating',y='num of ratings',data=ratings,alpha=0.5)
'''
Okay! Now that we have a general idea of what the data looks like,
let's move on to creating a simple recommendation system:
Recommending Similar Movies
Now let's create a matrix that has the user ids on one axis and the movie
title on another axis.
Each cell will then consist of the rating the user gave to that movie.
Note there will be a lot of NaN values,
because most people have not seen most of the movies.
'''
moviemat = df.pivot_table(index='user_id',columns='title',values='rating')
moviemat.head()
'''
Most rated movie:
'''
ratings.sort_values('num of ratings',ascending=False).head(10)
'''
Let's choose two movies: starwars, a sci-fi movie. And Liar Liar, a comedy.
'''
ratings.head()
'''
Now let's grab the user ratings for those two movies:
'''
starwars_user_ratings = moviemat['Star Wars (1977)']

liarliar_user_ratings = moviemat['Liar Liar (1997)']
starwars_user_ratings.head()
'''
We can then use corrwith() method to get correlations between two pandas
series:
'''
similar_to_starwars = moviemat.corrwith(starwars_user_ratings)
similar_to_liarliar = moviemat.corrwith(liarliar_user_ratings)
'''
Let's clean this by removing NaN values and using a DataFrame instead of a
series:
'''
corr_starwars = pd.DataFrame(similar_to_starwars,columns=['Correlation'])
corr_starwars.dropna(inplace=True)
corr_starwars.head()
'''
Now if we sort the dataframe by correlation, we should get the most similar
movies,
however note that we get some results that don't really make sense.
This is because there are a lot of movies only watched once by users who also
watched star wars
(it was the most popular movie).
'''
corr_starwars.sort_values('Correlation',ascending=False).head(10)
'''
Let's fix this by filtering out movies that have less than 100 reviews
(this value was chosen based off the histogram from earlier).
'''
corr_starwars = corr_starwars.join(ratings['num of ratings'])

corr_starwars.head()
'''
Now sort the values and notice how the titles make a lot more sense:
'''
corr_starwars[corr_starwars['num of
ratings']>100].sort_values('Correlation',ascending=False).head()
'''
Now the same for the comedy Liar Liar:
'''
corr_liarliar = pd.DataFrame(similar_to_liarliar,columns=['Correlation'])
corr_liarliar.dropna(inplace=True)
corr_liarliar = corr_liarliar.join(ratings['num of ratings'])
corr_liarliar[corr_liarliar['num of
ratings']>100].sort_values('Correlation',ascending=False).head()
# -*- coding: utf-8 -*-

"""
Created on Wed Sep 19 09:57:27 2018
@author: Dragos
"""
import numpy as np
import pandas as pd
column_names = ['user_id', 'item_id', 'rating', 'timestamp']

df = pd.read_csv('C:\\Urbino_MachineLearning\\4. Recommender System\\u.data',
sep='\t', names=column_names)
movie_titles = pd.read_csv("C:\\Urbino_MachineLearning\\4. Recommender

System\\Movie_Id_Titles")
movie_titles.head()
df = pd.merge(df,movie_titles,on='item_id')
df.head()
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()
print('Num. of Users: '+ str(n_users))

print('Num of Movies: '+str(n_items))

train_data, test_data = train_test_split(df, test_size=0.25)
'''
Memory-Based Collaborative Filtering
Memory-Based Collaborative Filtering approaches can be divided into two main
sections:
user-item filtering and item-item filtering.
A user-item filtering will take a particular user, find users that are similar
to that user based
on similarity of ratings, and recommend items that those similar users liked.
In contrast, item-item filtering will take an item, find users who liked that
item,
and find other items that those users or similar users also liked. It takes
items and outputs other
items as recommendations.
Item-Item Collaborative Filtering: “Users who liked this item also liked …”
User-Item Collaborative Filtering: “Users who are similar to you also liked …”
'''
'''
n both cases, you create a user-item matrix which built from the entire
dataset.
'''
#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
train_data_matrix[line[1]-1, line[2]-1] = line[3]
test_data_matrix = np.zeros((n_users, n_items))

for line in test_data.itertuples():
test_data_matrix[line[1]-1, line[2]-1] = line[3]
'''
You can use the pairwise_distances function from sklearn to calculate the
cosine similarity.
Note, the output will range from 0 to 1 since the ratings are all positive.
'''
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')
'''
predictions
'''
'''
ISSUE:
suppose, user k gives 4 stars to his favourite movies and 3 stars to all other
good movies.
Suppose now that another user t rates movies that he/she likes with 5 stars,
and the movies he/she fell asleep over with 3 stars.
These two users could have a very similar taste but treat the rating system
differently.
'''
def predict(ratings, similarity, type='user'):

if type == 'user':
mean_user_rating = ratings.mean(axis=1)
#You use np.newaxis so that mean_user_rating has same format as
ratings
ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff)
/ np.array([np.abs(similarity).sum(axis=1)]).T
elif type == 'item':
pred = ratings.dot(similarity) /
np.array([np.abs(similarity).sum(axis=1)])
return pred
item_prediction = predict(train_data_matrix, item_similarity, type='item')

user_prediction = predict(train_data_matrix, user_similarity, type='user')
from sklearn.metrics import mean_squared_error

from math import sqrt
def rmse(prediction, ground_truth):
prediction = prediction[ground_truth.nonzero()].flatten()
ground_truth = ground_truth[ground_truth.nonzero()].flatten()
return sqrt(mean_squared_error(prediction, ground_truth))
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))

print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))
'''
Memory-based algorithms are easy to implement and produce reasonable
prediction quality.
The drawback of memory-based CF is that it doesn't scale to real-world
scenarios
and doesn't address the well-known cold-start problem,
that is when new user or new item enters the system.
Model-based CF methods are scalable and can deal with higher sparsity level
than memory-based models,
but also suffer when new users or items that don't have any ratings enter the
system
'''
6. Feature Importance
# -*- coding: utf-8 -*-

"""
Created on Tue Sep 4 14:10:35 2018
@author: Dragos
"""
'''
Following is an example of features selection for the linear regression.
It is based on the Advertising Dataset,
taken from the book Introduction to Statistical Learning by Hastie,
Witten, Tibhirani, James.
The dataset contains statistics about the sales of a product in 200 different
markets,
together with advertising budgets in each of these markets for different media
channels:
TV, radio and newspaper.
Imaging being the Marketing responsible and you need to prepare a new
advertising plan
for next year.
You may be interested in answering questions such as:
which media contribute to sales?
Do all three media—TV, radio, and newspaper—contribute to sales,
or do just one or two of the media contribute?
'''
import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
ad = pd.read_csv("C:\\Urbino_MachineLearning\\6.
FeatureImportance\\Advertising.csv", index_col=0)
new_ad = ad[['TV','Radio', 'Newspaper']]

sales = ad[['Sales']]
ad.info()
ad.describe()
ad.head()
plt.scatter(ad.TV, ad.Sales, color='blue', label="TV")

plt.scatter(ad.Radio, ad.Sales, color='green', label='Radio')
plt.scatter(ad.Newspaper, ad.Sales, color='red', label='Newspaper')
plt.legend(loc="lower right")
plt.title("Sales vs. Advertising")
plt.xlabel("Advertising [1000 $]")
plt.ylabel("Sales [Thousands of units]")
plt.grid()
plt.show()
ad.corr()
plt.imshow(ad.corr(), cmap=plt.cm.Blues, interpolation='nearest')

plt.colorbar()
tick_marks = [i for i in range(len(ad.columns))]
plt.xticks(tick_marks, ad.columns, rotation='vertical')
plt.yticks(tick_marks, ad.columns)
#Is there a relationship between sales and advertising?

'''
The multiple linear regression model takes the form: Sales = β0 + β1TV +
β2Radio + β3*Newspaper + ε,
where Beta are the regression coefficients we want to find
and epsilon is the error that we want to minimise.
'''
modelAll = sm.ols('Sales ~ TV + Radio + Newspaper', ad).fit()
modelAll.params
'''
We interpret these results as follows: for a given amount of TV
and newspaper advertising,
spending an additional 1000 dollars on radio advertising leads
to an increase in sales by approximately 189 units.
In contrast, the coefficient for newspaper represents the average effect
(negligible) of increasing newspaper
spending by 1000 dollars while holding TV and radio fixed.
'''
'''
An F statistic is a value you get when you run an ANOVA
test or a regression analysis to find out if the means
between two populations are significantly different.
It’s similar to a T statistic from a T-Test;
A-T test will tell you if a single variable is statistically significant
and an F test will tell you if a group of variables are jointly significant.
'''
# we need first to calculate the Residual Sum of Squares (RSS)

y_pred = modelAll.predict(ad)
import numpy as np
RSS = np.sum((y_pred - ad.Sales)**2)
'''
Now we need the Total Sum of Squares (TSS): the total variance in the response
Y,
and can be thought of as the amount of variability inherent in the response
before the regression is performed.
The distance from any point in a collection of data, to the mean of the data,
is the deviation.
'''
y_mean = np.mean(ad.Sales) # mean of sales

TSS = np.sum((ad.Sales - y_mean)**2)
'''
The F-statistic is the ratio between (TSS-RSS)/p and RSS/(n-p-1):
'''
p=3 # we have three predictors: TV, Radio and Newspaper

n=200 # we have 200 data points (input samples)
F = ((TSS-RSS)/p) / (RSS/(n-p-1))
'''
When there is no relationship between the response and predictors,
one would expect the F-statistic to take on a value close to 1.
On the other hand, if Ha is true, then we expect F to be greater than 1.
In this case, F is far larger than 1: at least one of the three advertising
media must be related to sales.
BUT WHICH ONE??
'''
modelAll.summary()
'''
The R2 statistic records the percentage of variability in the response that
is explained by the predictors.
The predictors explain almost 90% of the variance in sales.
One thing to note is that R-squared will always increase when more variables
are added to the model, even if those variables are only weakly associated
with the response.
Therefore an adjusted R-squared is provided,
which is R-squared adjusted by the number of predictors.
'''
'''
Another thing to note is that the summary table provides also a t-statistic
and a p-value for each single feature. These provide information about
whether
each individual predictor is related to the response (high t-statistic or low
p-value).
But be careful looking only at these individual p-values instead of looking
at the overall F-statistic.
It seems likely that if any one of the p-values for the individual features is
very small,
then at least one of the predictors is related to the response.
However, this logic is flawed, especially when you have many predictors;
statistically about 5 % of the p-values will be below 0.05 by chance
(this is the effect infamously leveraged by the so-called p-hacking).
The F-statistic does not suffer from this problem because it adjusts
for the number of predictors.
'''
def evaluateModel (model):

print("RSS = ", ((ad.Sales - model.predict())**2).sum())
print("R2 = ", model.rsquared)
#WHICH MEDIA CONRIBUTES TO SALES???

'''
Ideally, we could perform the variable selection by trying out a lot of
different models,
each containing a different subset of the features.
We can then select the best model out of all of the models
that we have considered (for example, the model with the smallest RSS and the
biggest R-squared;
other used metrics are the Akaike Information Criterion (AIC),
Bayesian Information Criterion (BIC), and the adjusted R2.
All of them are visible in the summary model.
'''
#Forward selection vs backward selection
modelTV = sm.ols('Sales ~ TV', ad).fit()

modelTV.summary().tables[1]
evaluateModel(modelTV)
modelRadio = sm.ols('Sales ~ Radio', ad).fit()

modelRadio.summary().tables[1]
evaluateModel(modelRadio)
modelPaper = sm.ols('Sales ~ Newspaper', ad).fit()

modelPaper.summary().tables[1]
evaluateModel(modelPaper)
'''
The lowest RSS and the highest R2 are for the TV medium.
Now we have a best model M1 which contains TV advertising.
We then add to this M1 model
the variable that results in the lowest RSS for the new two-variable model.
This approach is continued until some stopping rule is satisfied.
'''
modelTVRadio = sm.ols('Sales ~ TV + Radio', ad).fit()

modelTVRadio.summary().tables[1]
evaluateModel(modelTVRadio)
modelTVPaper = sm.ols('Sales ~ TV + Newspaper', ad).fit()

modelTVPaper.summary().tables[1]
evaluateModel(modelTVPaper)
evaluateModel(modelAll)
'''
M3 is slightly better than M2
(but remember that R2 always increases when adding new variables)
so we call the approach completed and decide that
the M2 model with TV and Radio is the good compromise.
Adding the newspaper could possibly overfits on new test data.
Next year no budget
for newspaper advertising and that amount will be used for
TV and Radio instead.
'''
modelTVRadio.summary()
'''
Plotting the model
The M2 model has two variables therefore can be plotted as a plane in a 3D
chart.
'''
modelTVRadio.params
'''
The M2 model can be described by this equation:
Sales = 0.19 Radio + 0.05 TV + 2.9 which we can write as:
0.19x + 0.05y - z + 2.9 = 0
Its normal is (0.19, 0.05, -1)
and a point on the plane is (-2.9/0.19,0,0) = (-15.26,0,0)
'''
normal = np.array([0.19,0.05,-1])
point = np.array([-15.26,0,0])
# a plane is a*x + b*y +c*z + d = 0

# [a,b,c] is the normal. Thus, we have to calculate
# d and we're set
d = -np.sum(point*normal) # dot product
# create x,y
x, y = np.meshgrid(range(50), range(300))
# calculate corresponding z
z = (-normal[0]*x - normal[1]*y - d)*1./normal[2]
'''
Let's plot the actual values as red points and the model predictions as a cyan
plane:
'''
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
fig.suptitle('Regression: Sales ~ Radio + TV Advertising')
ax = Axes3D(fig)
ax.set_xlabel('Radio')
ax.set_ylabel('TV')
ax.set_zlabel('Sales')
ax.scatter(ad.Radio, ad.TV, ad.Sales, c='red')
ax.plot_surface(x,y,z, color='cyan', alpha=0.3)
#from sklearn.linear_model import LinearRegression

#model = LinearRegression()
#model.fit(new_ad,sales)
#LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
#result = pd.DataFrame(list(zip(model.coef_, new_ad.columns)),
columns=['coefficient', 'name']).set_index('name')
#np.abs(result).sort_values(by='coefficient', ascending=False)
7. Model Selection
a. Grid search
import pandas as pd
import numpy as np
dataset = pd.read_csv("C:\\Urbino_MachineLearning\\7.
Model_Selection\\wineQualityReds.csv", sep=',')
'''
Execute the following script to divide data into label and feature sets.
'''
'''
Since we are using cross validation,
we don't need to divide our data into training and test sets.
We want all of the data in the training set so that we can apply cross
validation on that.
The simplest way to do this is to set the value for the test_size parameter to
0.
This will return all the data in the training set as follows:
'''

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0,
random_state=0)
'''
If you look at the dataset you'll notice that it is not scaled well.
For instance the "volatile acidity" and "citric acid" column have values
between 0 and 1,
while most of the rest of the columns have higher values.
Therefore, before training the algorithm, we will need to scale our data down.
Here we will use the StandardScalar class.
'''

feature_scaler = StandardScaler()
train_features = feature_scaler.fit_transform(X_train)
test_features = feature_scaler.transform(X_test)
'''
The first step in the training and cross validation phase is simple.
You just have to import the algorithm class from the sklearn library as shown
below:
'''

classifier = RandomForestClassifier(n_estimators=300, random_state=0)
'''
Next, to implement cross validation,
the cross_val_score method of the sklearn.model_selection library can be used.
The cross_val_score returns the accuracy for all the folds.
Values for 4 parameters are required to be passed to the cross_val_score
class.
The first parameter is estimator which basically specifies the algorithm
that you want to use for cross validation.
The second and third parameters, X and y, contain the X_train and y_train data
i.e. features and labels.
Finally the number of folds is passed to the cv parameter as shown in the
following code:
'''

all_accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train,
cv=5)
print(all_accuracies)
print(all_accuracies.mean())
#Grid Search for Parameter Selection

'''
To implement the Grid Search algorithm we need to import GridSearchCV class
from the sklearn.model_selection library.
The first step you need to perform is to create a dictionary of all the
parameters
and their corresponding set of values that you want to test for best
performance.
The name of the dictionary items corresponds to the parameter name
and the value corresponds to the list of values for the parameter.
Let's create a dictionary of parameters and their corresponding values
for our Random Forest algorithm. Details of all the parameters for the random
forest algorithm
are available in the Scikit-Learn docs.
'''
grid_param = {
'n_estimators': [100, 300, 500, 800, 1000],
'criterion': ['gini', 'entropy'],
'bootstrap': [True, False]
}
'''
Here we create grid_param dictionary with three parameters n_estimators,
criterion, and bootstrap. The parameter values that we want to try out are
passed in the list. For instance, in the above script we want to find which
value (out of 100, 300, 500, 800, and 1000) provides the highest accuracy.
Similarly, we want to find which value results in the highest performance for
the criterion parameter: "gini" or "entropy"? The Grid Search algorithm
basically tries all possible combinations of parameter values and returns the
combination with the highest accuracy. For instance, in the above case the
algorithm will check 20 combinations (5 x 2 x 2 = 20).
The Grid Search algorithm can be very slow, owing to the potentially huge
number of combinations to test. Furthermore, cross validation further
increases the execution time and complexity.
'''
'''
Once the parameter dictionary is created, the next step is to create an
instance of the GridSearchCV class. You need to pass values for the estimator
parameter, which basically is the algorithm that you want to execute. The
param_grid parameter takes the parameter dictionary that we just created as
parameter, the scoring parameter takes the performance metrics, the cv
parameter corresponds to number of folds, which is 5 in our case, and finally
the n_jobs parameter refers to the number of CPU's that you want to use for
execution. A value of -1 for n_jobs parameter means that use all available
computing power. This can be handy if you have large number amount of data.
'''
gd_sr = GridSearchCV(estimator=classifier,
param_grid=grid_param,
scoring='accuracy',
cv=5,
n_jobs=-1)
'''
Once the GridSearchCV class is initialized,
the last step is to call the fit method of the class and pass it the training
and test set,
as shown in the following code:
'''
gd_sr.fit(X_train, y_train)
'''
This method can take some time to execute because we have 20 combinations of
parameters
and a 5-fold cross validation. Therefore the algorithm will execute a total of
100 times.
'''
'''
Once the method completes execution,
the next step is to check the parameters that return the highest accuracy.
To do so, print the sr.best_params_ attribute of the GridSearchCV object, as
shown below:
'''
best_parameters = gd_sr.best_params_
print(best_parameters)
'''
The result shows that the highest accuracy is achieved when the n_estimators
are 1000,
bootstrap is True and criterion is "gini".
'''
# Grid Search

import numpy as np
import pandas as pd

Model_Selection\\Social_Network_Ads.csv')
random_state = 0)
# Feature Scaling



# Applying k-Fold Cross Validation

accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train,
cv = 10)
accuracies.mean()
accuracies.std()
# Applying Grid Search to find the best model and the best parameters
parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2,
0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
grid_search = GridSearchCV(estimator = classifier,
param_grid = parameters,
scoring = 'accuracy',
cv = 10,
n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.xlabel('Age')
plt.legend()
plt.show()

0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.xlabel('Age')
plt.legend()
plt.show()
b. K-Fold cross-validation
# k-Fold Cross Validation

import numpy as np
import pandas as pd

Model_Selection\\Social_Network_Ads.csv')
random_state = 0)
# Feature Scaling



# Applying k-Fold Cross Validation

accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train,
cv = 10)
accuracies.mean()
accuracies.std()

0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.xlabel('Age')
plt.legend()
plt.show()

0].max() + 1, step = 0.01),
1].max() + 1, step = 0.01))
plt.xlabel('Age')
plt.legend()
plt.show()
8. Time Series
# quandly for financial data

import quandl
# pandas for data manipulation
import pandas as pd
# Matplotlib for plotting

import matplotlib
plt.style.use('fivethirtyeight')
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
quandl.ApiConfig.api_key = 'kJsMziyRdumxT89k2jyq'
# Retrieve TSLA data from Quandl

tesla = quandl.get('WIKI/TSLA')
# Retrieve the GM data from Quandl

gm = quandl.get('WIKI/GM')
gm.head(5)
# The adjusted close accounts for stock splits, so that is what we should
graph
plt.plot(gm.index, gm['Adj. Close'])
plt.title('GM Stock Price')
plt.ylabel('Price ($)');
plt.show()
plt.plot(tesla.index, tesla['Adj. Close'], 'r')

plt.title('Tesla Stock Price')
plt.ylabel('Price ($)');
plt.show();
#In order to compare the companies, we need to compute their market

capitalization.
#Quandl does not provide this data, but we can figure out the market cap
ourselves by multiplying
#the average number of shares outstanding in each year times the share price.
# Yearly average number of shares outstanding for Tesla and GM

tesla_shares = {2018: 168e6, 2017: 162e6, 2016: 144e6, 2015: 128e6, 2014:
125e6,
2013: 119e6, 2012: 107e6, 2011: 100e6, 2010: 51e6}
gm_shares = {2018: 1.42e9, 2017: 1.50e9, 2016: 1.54e9, 2015: 1.59e9, 2014:
1.61e9,
2013: 1.39e9, 2012: 1.57e9, 2011: 1.54e9, 2010: 1.50e9}
# Create a year column

tesla['Year'] = tesla.index.year
# Take Dates from index and move to Date column

tesla.reset_index(level=0, inplace = True)
tesla['cap'] = 0
# Calculate market cap for all years

for i, year in enumerate(tesla['Year']):
# Retrieve the shares for the year
shares = tesla_shares.get(year)
# Update the cap column to shares times the price

tesla.ix[i, 'cap'] = shares * tesla.ix[i, 'Adj. Close']
# Create a year column

gm['Year'] = gm.index.year
# Take Dates from index and move to Date column

gm.reset_index(level=0, inplace = True)
gm['cap'] = 0
# Calculate market cap for all years

for i, year in enumerate(gm['Year']):
# Retrieve the shares for the year
shares = gm_shares.get(year)
# Update the cap column to shares times the price

gm.ix[i, 'cap'] = shares * gm.ix[i, 'Adj. Close']
# Merge the two datasets and rename the columns

cars = gm.merge(tesla, how='inner', on='Date')
cars.rename(columns={'cap_x': 'gm_cap', 'cap_y': 'tesla_cap'}, inplace=True)
# Select only the relevant columns
cars = cars.ix[:, ['Date', 'gm_cap', 'tesla_cap']]
# Divide to get market cap in billions of dollars

cars['gm_cap'] = cars['gm_cap'] / 1e9
cars['tesla_cap'] = cars['tesla_cap'] / 1e9
cars.head()
plt.plot(cars['Date'], cars['gm_cap'], 'b-', label = 'GM')
plt.plot(cars['Date'], cars['tesla_cap'], 'r-', label = 'TESLA')
plt.xlabel('Date'); plt.ylabel('Market Cap (Billions $)'); plt.title('Market
Cap of GM and Tesla')
plt.legend();
import numpy as np
# Find the first and last time Tesla was valued higher than GM
first_date = cars.ix[np.min(list(np.where(cars['tesla_cap'] > cars['gm_cap'])
[0])), 'Date']
last_date = cars.ix[np.max(list(np.where(cars['tesla_cap'] > cars['gm_cap'])
[0])), 'Date']
print("Tesla was valued higher than GM from {} to

{}.".format(first_date.date(), last_date.date()))
import fbprophet
# Prophet requires columns ds (Date) and y (value)

gm = gm.rename(columns={'Date': 'ds', 'cap': 'y'})
# Put market cap in billions
gm['y'] = gm['y'] / 1e9
# Make the prophet models and fit on the data

# changepoint_prior_scale can be changed to achieve a better fit
gm_prophet = fbprophet.Prophet(changepoint_prior_scale=0.05)
gm_prophet.fit(gm)
# Repeat for the tesla data

tesla =tesla.rename(columns={'Date': 'ds', 'cap': 'y'})
tesla['y'] = tesla['y'] / 1e9
tesla_prophet = fbprophet.Prophet(changepoint_prior_scale=0.05,
n_changepoints=10)
tesla_prophet.fit(tesla);
# Make a future dataframe for 2 years

gm_forecast = gm_prophet.make_future_dataframe(periods=365 * 2, freq='D')
# Make predictions
gm_forecast = gm_prophet.predict(gm_forecast)
tesla_forecast = tesla_prophet.make_future_dataframe(periods=365*2, freq='D')

tesla_forecast = tesla_prophet.predict(tesla_forecast)
gm_prophet.plot(gm_forecast, xlabel = 'Date', ylabel = 'Market Cap (billions

$)')
plt.title('Market Cap of GM');
tesla_prophet.plot(tesla_forecast, xlabel = 'Date', ylabel = 'Market Cap

(billions $)')
plt.title('Market Cap of Tesla');
#Compare Forecasts
#We want to determine when Tesla will overtake GM in total market value.
#We already have the forecasts for two years into the future.
#We will now join them together and determine when the model predicts Tesla
will pull ahead.
gm_names = ['gm_%s' % column for column in gm_forecast.columns]

tesla_names = ['tesla_%s' % column for column in tesla_forecast.columns]
# Dataframes to merge
merge_gm_forecast = gm_forecast.copy()
merge_tesla_forecast = tesla_forecast.copy()
# Rename the columns

merge_gm_forecast.columns = gm_names
merge_tesla_forecast.columns = tesla_names
# Merge the two datasets

forecast = pd.merge(merge_gm_forecast, merge_tesla_forecast, how = 'inner',
left_on = 'gm_ds', right_on = 'tesla_ds')
# Rename date column

forecast = forecast.rename(columns={'gm_ds': 'Date'}).drop('tesla_ds', axis=1)
forecast.head()
plt.plot(forecast['Date'], forecast['gm_trend'], 'b-')
plt.plot(forecast['Date'], forecast['tesla_trend'], 'r-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Market Cap ($)')
plt.title('GM vs. Tesla Trend');
plt.plot(forecast['Date'], forecast['gm_yhat'], 'b-')
plt.plot(forecast['Date'], forecast['tesla_yhat'], 'r-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Market Cap (billions $)')
plt.title('GM vs. Tesla Estimate');
overtake_date = min(forecast.ix[forecast['tesla_yhat'] > forecast['gm_yhat'],

'Date'])
print('Tesla overtakes GM on {}'.format(overtake_date))
#Trends and Patterns

#Now, we can use the Prophet Models to inspect different trends in the data.
gm_prophet.plot_components(gm_forecast);
tesla_prophet.plot_components(tesla_forecast)
#These graphs show that Tesla tends to increase during the summer, and
decrease during the winter,
#while GM plummets during the summer and increases during the winter.
#We could compare GM sales with these graphs to see if there is any
correlation.
# Read in the sales data

gm_sales =
pd.read_csv('C:\\Urbino_MachineLearning\\11.TimeSeries\\gm_sales.csv')
gm_sales.head(5)
# Melt the sales data and rename columns

gm_sales = gm_sales.melt(id_vars='Year', var_name = 'Month', value_name =
'Sales')
gm_sales.head(8)
# Format the data for plotting

gm_sales = gm_sales[gm_sales['Month'] != 'Total']
gm_sales = gm_sales[gm_sales['Year'] > 2010]
gm_sales['Date'] = ['-'.join([str(year), month]) for year, month in
zip(gm_sales['Year'], gm_sales['Month'])]
gm_sales['Date'] = pd.to_datetime(gm_sales['Date'], format = "%Y-%b")
gm_sales.sort_values(by = 'Date', inplace=True)
gm_sales['Month'] = [date.month for date in gm_sales['Date']]
# Plot the sales over the period

plt.plot(gm_sales['Date'], gm_sales['Sales'], 'r');
plt.title('GM Monthly Sales 2011-2017'); plt.ylabel('Sales');
gm_sales_grouped = gm_sales.groupby('Month').mean()
plt.plot(list(range(1, 13)), gm_sales_grouped['Sales']);
plt.xlabel('Month'); plt.ylabel('Sales'); plt.title('GM Average Monthly Sales
2011-2017');
gm_prophet.plot_yearly(); plt.title('GM Yearly Component of Market Cap');

#It does not appear as if there is much correlation between market
capitalization (a proxy for share price) and sales over the course of a year.
# -*- coding: utf-8 -*-

"""
Created on Thu Aug 16 13:38:45 2018
@author: Dragos
"""
#US vs. China Gross Domestic Product

import quandl
# pandas for data manipulation
import pandas as pd
# Matplotlib for plotting
import matplotlib
# My personal api key, use your own

quandl.ApiConfig.api_key = 'kJsMziyRdumxT89k2jyq'
# Get data from quandl for US and China GDP

us_gdp = quandl.get('FRED/GDP', collapse='quarterly', start_date = '1950-12-
31', end_date='2017-12-31')
china_gdp = quandl.get('ODA/CHN_NGDPD', collapse='yearly', start_date = '1950-
12-31', end_date='2017-12-31')
us_gdp.plot(title = 'US Gross Domestic Product', legend=None);

plt.ylabel('Billion $');
china_gdp.plot(title = 'China Gross Domestic Product', color = 'r',
legend=None);
plt.ylabel('Billion $');
## Change index to date column
us_gdp = us_gdp.reset_index(level=0)
us_gdp.head(5)
china_gdp = china_gdp.reset_index(level=0)
china_gdp.head(5)
# Merge the two gdp data frames and rename columns

gdp = us_gdp.merge(china_gdp, on = 'Date', how =
'left').rename(columns={'Value_x': 'US', 'Value_y': 'China'})
gdp.head(5)
round(gdp.describe(), 2)
# Fill in missing China observations using backward fill

gdp = gdp.fillna(method='bfill')
plt.plot(gdp['Date'], gdp['US'], label = 'US', color = 'b')

plt.plot(gdp['Date'], gdp['China'], label = 'China', color = 'r')
plt.ylabel('Billions $'); plt.title('US and China GDP'); plt.xlabel('Date');
import fbprophet
# Create a prophet object for each dataframe

us_prophet = fbprophet.Prophet(changepoint_prior_scale=0.2)
china_prophet = fbprophet.Prophet(changepoint_prior_scale=0.2)
# Prophet needs dataframes with a ds (date) and y (variable) column

# Use pandas rename functionality (format is dictionary with {'old': 'new'})
us_gdp = us_gdp.rename(columns={'Date': 'ds', 'Value': 'y'})
china_gdp = china_gdp.rename(columns={'Date': 'ds', 'Value': 'y'})
us_prophet.fit(us_gdp);
china_prophet.fit(china_gdp);
#Compare US changepoints to recessions

#The prophet object only selects changepoints from the first 80% of the data
#which is why the recent recession does not appear.
#We can try and correlate the identified changepoints with actual recessions.
recessions =
pd.read_csv('C:\\Urbino_MachineLearning\\11.TimeSeries\\recessions.csv',
encoding='latin')
recessions[6:]
# Make a future dataframe with 50 years of observations
# US dataframe and predictions
us_forecast = us_prophet.make_future_dataframe(periods = 50, freq = 'Y')
us_forecast = us_prophet.predict(us_forecast)
# China dataframe and predictions

china_forecast = china_prophet.make_future_dataframe(periods = 50, freq = 'Y')
china_forecast = china_prophet.predict(china_forecast)
us_prophet.plot(us_forecast)
china_prophet.plot(china_forecast)
#When will China Overtake the United States?
us_names = ['us_%s' % column for column in us_forecast.columns]

china_names = ['china_%s' % column for column in china_forecast.columns]
# Dataframes to merge
merge_us_forecast = us_forecast.copy()
merge_china_forecast = china_forecast.copy()
# Rename the columns

merge_us_forecast.columns = us_names
merge_china_forecast.columns = china_names
# Merge the two datasets

gdp_forecast = pd.merge(merge_us_forecast, merge_china_forecast, how =
'inner', left_on = 'us_ds', right_on = 'china_ds')
# Rename date column

gdp_forecast = gdp_forecast.rename(columns={'us_ds': 'Date'}).drop('china_ds',
axis=1)
gdp_forecast.head()
fig, ax = plt.subplots(1, 1, figsize=(10, 8));

ax.plot(gdp_forecast['Date'], gdp_forecast['us_yhat'], label = 'us
prediction');
ax.fill_between(gdp_forecast['Date'].dt.to_pydatetime(),
gdp_forecast['us_yhat_upper'], gdp_forecast['us_yhat_lower'], alpha=0.6,
edgecolor = 'k');
ax.plot(gdp_forecast['Date'], gdp_forecast['china_yhat'], 'r', label = 'china
prediction');
ax.fill_between(gdp_forecast['Date'].dt.to_pydatetime(),
gdp_forecast['china_yhat_upper'], gdp_forecast['china_yhat_lower'], alpha=0.6,
edgecolor = 'k');
plt.legend();
plt.xlabel('Date'); plt.ylabel('Billions $'); plt.title('GDP Prediction for US
and China');
first_pass = min(gdp_forecast.ix[gdp_forecast['us_yhat'] <

gdp_forecast['china_yhat'], 'Date'])
print('China will overtake the US in GDP on {}.'.format(first_pass))
import warnings
import itertools
import pandas as pd
import numpy as np
# Defaults
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams.update({'font.size': 12})
plt.style.use('ggplot')
# Load the data
data = pd.read_csv('C:\\Urbino_MachineLearning\\11.TimeSeries\\international-
airline-passengers.csv', engine='python', skipfooter=3)
# A bit of pre-processing to make it nicer
data['Month']=pd.to_datetime(data['Month'], format='%Y-%m-%d')
data.set_index(['Month'], inplace=True)
# Plot the data

data.plot()
plt.ylabel('Monthly airline passengers (x1000)')
plt.xlabel('Date')
plt.show()
'''
Two obvious patterns appear in the data, an overall increase in the number of
passengers over time,
and a 12 months seasonality with peaks corresponding to the northern emisphere
summer period.
'''
'''
Here we use grid search
over all possible combinations of parameter values
within a predefined range of values
(heavily inspired by https://www.digitalocean.com/community/tutorials/a-guide-
to-time-series-forecasting-with-arima-in-python-3).
statsmodels.tsa.statespace.sarimax.SARIMAXResults
returns values for AIC (Akaike Information Criterion) and BIC (Bayes
Information Criterion)
that can be minimized to select the best fitting model.
We use the AIC value, which estimates the information lost
when a given model is used to represent the process that generates the data.
In doing so, it deals with the trade-off between the goodness of fit of the
model
and the complexity of the model itself.
'''
# Define the d and q parameters to take any value between 0 and 1

q = d = range(0, 2)
# Define the p parameters to take any value between 0 and 3
p = range(0, 4)
# Generate all different combinations of p, q and q triplets

pdq = list(itertools.product(p, d, q))
# Generate all different combinations of seasonal p, q and q triplets

seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d,
q))]
print('Examples of parameter combinations for Seasonal ARIMA...')

print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
'''
We select a subset of the data series as training data,
say the first 11 years.
Our goal is to predict the last year of the series based on this input.
'''
train_data = data['1949-01-01':'1959-12-01']
test_data = data['1960-01-01':'1960-12-01']
warnings.filterwarnings("ignore") # specify to ignore warning messages
AIC = []
SARIMAX_model = []
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(train_data,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('SARIMAX{}x{} - AIC:{}'.format(param, param_seasonal,

results.aic), end='\r')
AIC.append(results.aic)
SARIMAX_model.append([param, param_seasonal])
except:
continue
print('The smallest AIC is {} for model SARIMAX{}x{}'.format(min(AIC),

SARIMAX_model[AIC.index(min(AIC))][0],SARIMAX_model[AIC.index(min(AIC))][1]))
# Let's fit this model

mod = sm.tsa.statespace.SARIMAX(train_data,
order=SARIMAX_model[AIC.index(min(AIC))][0],
seasonal_order=SARIMAX_model[AIC.index(min(AIC))][1],
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
'''
Once the model has been fitted, we can check if does what we expect and if the
assumptions we made are violated.
To do this, we can use the
plot_diagnostics
method.
'''
results.plot_diagnostics(figsize=(20, 14))
plt.show()
'''
Results
Now let's create some predictions. We will use three methods:
'''
'''
1) In sample prediction with 1-step ahead forecasting of the last year (1959).
In this case the model is used to predict data that the model was built
on.
1-step ahead forecasting implies that each forecasted point is used to
predict the following one.
'''
pred0 = results.get_prediction(start='1958-01-01', dynamic=False)

pred0_ci = pred0.conf_int()
'''
2) In sample prediction with dynamic forecasting of the last year (1959).
Again, the model is used to predict data that the model was built on.
'''
pred1 = results.get_prediction(start='1958-01-01', dynamic=True)

'''
3) "True" forecasting of out of sample data. In this case the model is asked
to predict data
it has not seen before.
s'''
pred2 = results.get_forecast('1962-12-01')
print(pred2.predicted_mean['1960-01-01':'1960-12-01'])
ax = data.plot(figsize=(20, 16))
pred0.predicted_mean.plot(ax=ax, label='1-step-ahead Forecast
(get_predictions, dynamic=False)')
pred1.predicted_mean.plot(ax=ax, label='Dynamic Forecast (get_predictions,
dynamic=True)')
pred2.predicted_mean.plot(ax=ax, label='Dynamic Forecast (get_forecast)')
ax.fill_between(pred2_ci.index, pred2_ci.iloc[:, 0], pred2_ci.iloc[:, 1],
color='k', alpha=.1)
plt.ylabel('Monthly airline passengers (x1000)')
plt.xlabel('Date')
plt.legend()
plt.show()
prediction = pred2.predicted_mean['1960-01-01':'1960-12-01'].values
# flatten nested list
truth = list(itertools.chain.from_iterable(test_data.values))
# Mean Absolute Percentage Error
MAPE = np.mean(np.abs((truth - prediction) / truth)) * 100
print('The Mean Absolute Percentage Error for the forecast of year 1960 is
{:.2f}%'.format(MAPE))

Machine Learning

Uploaded by

Copyright:

Available Formats

Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning

Uploaded by

Copyright:

Available Formats

Laborator

Programare Python – Machine Learning

# Importing the libraries

# Importing the dataset

# Taking care of missing data

# Encoding categorical data

#This dataset is known to have missing values.

#2. Mark Missing Values

from pandas import read_csv

from pandas import read_csv

from pandas import read_csv

from pandas import read_csv

from pandas import read_csv

from pandas import read_csv

from pandas import read_csv

from pandas import read_csv

from pandas import read_csv

from pandas import read_csv

# Rescale data (between 0 and 1)

# separate array into input and output components

# summarize transformed data

# Standardize data (0 mean, 1 stdev)

#dataframe = pandas.read_csv(url, names=names)

# Normalize data (length of 1)

#Binarize Data (Make Binary)

a. Multiple Linear Regression

# Multiple Linear Regression

# Importing the libraries

# Importing the dataset

# Encoding categorical data

# Avoiding the Dummy Variable Trap

# Fitting Multiple Linear Regression to the Training set

# Predicting the Test set results

X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)

b. Multiple polynomial regression

# -*- coding: utf-8 -*-

This is a temporary script file.

Stock_Market_Unknown = {'Year': ['2018','2018'],

#check for linearity

#get coefficients for a quadratic

#here we can remove polynomial orders we don't want

#generate the regression object

#perform the actual regression

# prediction with sklearn

c. Robust linear regression

col_study = ['ZN', 'INDUS', 'NOX', 'RM']

col_study = ['PTRATIO', 'B', 'LSTAT', 'MEDV']

sns.jointplot(x='LSTAT', y='MEDV', data=df, kind='reg', size=10);

from sklearn.linear_model import RANSACRegressor

d. Decision trees regression

# Decision Tree Regression

# Importing the libraries

# Importing the dataset

# Fitting Decision Tree Regression to the dataset

# Predicting a new result

# Visualising the Decision Tree Regression results (higher resolution)

e. Random forest regression

# Random Forest Regression

# Importing the libraries

# -- coding: utf-8 --