Introduction To Datascience (R20DS501)
Introduction To Datascience (R20DS501)
Regression Analysis
Introduction
Fig. 6.1 Illustration of different simple linear regression models. Blue points correspond to a set
of random points sampled from a univariate normal (Gaussian) distribution. Red, green and
yellow lines are three different simple linear regression models
Linear Regression
best model (best parameters) for this particular set of samples? See the three
different models (straight lines in different colors) in Fig. 6.1.
Ordinary least squares (OLS) is the simplest and most common estimator in which
the parameters (a’s) are chosen to minimize the square of the distance between
thepredicted values and the actual values with respect to a0, a1:
n
2
||a0 + a1 x − y||
2 = (a0 + a1x j − y j )2.
j =1
We are concerned here with the y-axis distance, since it does not consider the
error in the variables. This error expression is often called the sum of squared
errors of prediction (SSE). The SSE function is quadratic in the parameters, w, with
positive- definite Hessian, and therefore this function possesses a unique global
ˆ = ˆ ˆat w (a0, a1). The resulting model is represented
minimum ˆ =asˆfollows:
+ ˆ y a0
a1x, where the hats on the variables represent the fact that they are estimated
from the data available.
OLS is a popular approach for several reasons. It makes it computationally cheap to
calculate the coefficients. It is also easier to interpret than the other more
sophisticated models. In situations where the goal is to understand a simple model
in detail, rather than to estimate the response well, it can provide insight into what
the model captures. Finally, in situations where there is a lot of noise, as in many
real scenarios, it may be hard to find the true functional form, so a constrained
model can perform quite well compared to a complex model which can be more
affected by noise.
Practical Case: Sea Ice Data and Climate Change
In this practical case, we pose the question: Is the climate really changing? More
concretely, we want to show the effect of the climate change by determining whether
the sea ice area (or extent) has decreased over the years. Sea ice area refers to
the total area covered by ice, whereas sea ice extent is the area of ocean with at
least 15% sea ice. Reliable measurement of sea ice edges began with the satellite
era in the late 1970s. Before then, sea ice area and extent were monitored less
precisely bya combination of ships, buoys, and aircraft.
We will use the sea ice data from the National Snow & Ice Data Center1 which
provides measurements of the area and extend of sea ice at the poles over the
last 36 years. The center has given access to the archived monthly Sea Ice Index
imagesand data since 1979 [2]. The archived data reside at an FTP location2 (web-
page instructions can be followed easily to access and download the files). The
ASCII data files tabulate sea ice extent and area (in millions of square kilometers)
by yearfor a given month.
In order to check whether there is an anomaly in the evolution of sea ice
extent over recent years, we want to build a simple linear regression model and
analyze thefitting; but before we need to perform several processing steps.
.
1
0 ession Analysis
0
Fig. 6.2 Ice extent data by month
6
In [1]:
R
e First, we read the data, previously downloaded, and create a DataFrame
(Pandas)
g as follows:
r
ice = pd . r e a d _ c s v ( ’ f i l e s / c h 06 / S e a I c e . t xt ’,
de li m_ w hi te s pa ce = True )
p r i nt ’ s h a p e : ’, i ce . s h a p e
Next, we visualize the data. The lmplot() function from the Seaborn toolbox is
intended for exploring linear relationships of different forms in multidimensional
datasets. For instance, we can illustrate the relationship between the month of the
year (variable) and the extent(response) as follows:
In [3]:
import Se aborn as sns
sns . l m p l o t ( " m o " , " e x t e n t " , i c e2 )
This outputs Fig. 6.2. We can observe a monthly fluctuation of the sea ice
extent,as would be expected for the different seasons of the year.
We should normalize the data before performing the regression analysis to
avoid this fluctuation and be able to study the evolution of the extent over the
years. To capture the variation for a given interval of time (month), we can
compute the mean
Fig. 6.3 Ice extent data by month after the normalization
for the i-th interval of time (using the period from 1979 through 2014 for the
{ j } ei .
mean extent) μi , and subtract it from the set of extent values for that month
This value can be converted to a relative percentage difference by dividing it by
the totalaverage (1979–2014) μ, and then multiplying by 100:
ei − μ i
i j
jẽ = 100 ∗ , i = 1, . . . , 12.
μ
We implement this normalization and plot the relationship again as follows:
In [4]:
for i in r a n g e (1 2 ) :
ic e 2 . e x t e n t [ i c e 2 . mo == i + 1 ] =
1 0 0 *( i c e 2 . e x t e n t [ i c e 2 . mo == i +1]
- month_means [ i +1])
/ m o n t h _ m e a n s . m e an ()
sns . l m p l o t ( " m o " , " e x t e n t " , i c e2 )
The new output is in Fig. 6.3. We now observe a comparable range of values
forall months.
Next, the normalized values can be plotted for the entire time series to analyze the
tendency. We compute the trend as a simple linear regression. We use the lmplot()
function for visualizing linear relationships between the year (variable) and the extent
(response).
This outputs Fig. 6.4 showing the regression model fitting the extent data.
This plot has two main components. The first is a scatter plot, showing the
observed data points. The second is a regression line, showing the estimated
linear model relating
Fig. 6.4 Regression model fitting sea ice extent data for all months by year using lmplot
the two variables. The regression line is plotted with a 95% confidence band to
givean impression of the uncertainty in the model.
In this figure, we can observe that the data show a long-term negative trend
overyears. The negative trend can be attributed to global warming, although there
is alsoa considerable amount of variation from year to year.
Up until here, we have qualitatively shown the linear regression using a useful
visu- alization tool. We can also analyze the linear relationship in the data using the
Scikit- learn library, which allows a quantitative evaluation. As was explained in the
previous chapter, Scikit-learn provides an object-oriented interface centered
around the con- cept of an estimator. The sklearn.linear_model.LinearRegression
estimator sets the state of the estimator based on the training data using the
function fit. Moreover, it allows the user to specify whether to fit an intercept
term in the object construction. This is done by setting the corresponding
constructor argumentsof the estimator object as follows:
from sklearn . l in ea r _m od e l im port L i n e a r R e g r e s s io n
In [6]:
est = L i n e a r R e g r e ss i o n ( f i t_ i n te r c e pt = True )
During the fitting process, the state of the estimator is stored in instance
attributes that have a trailing underscore (‘_’). For example, the coefficients of a
LinearRegression estimator are stored in the attribute coef_. We fit a regres- sion
model using years as variables (x) and the extent values as the response (y).
In [7]:
x = i c e2 [[ ’ y e a r ’ ] ]
y = i c e 2 [[ ’ e x t e n t ’ ] ]
est . fi t ( x , y )
print " Coe f fi ci e nt s :" , est . coef_
p r i nt " I n t e r c e p t : " , es t . i n t e r c e p t _
6.2 Linear Regression 103
Out[9]: Prediction of extent for January 2025 (in millions of squarekm): [12.93603933].
Sparse Model
Often, in real problems, there are uninformative variables in the data which
prevent proper modeling of the problem and thus, the building of a correct
regression model. In such cases, a feature selection process is crucial to select
only the informative features and discard non-informative ones. This can be
achieved by sparse methods which use a penalization approach, such as LASSO
(least absolute shrinkage and selection operator) to set some model coefficients
to zero (thereby discarding thosevariables). Sparsity can be seen as an application
of Occam’s razor: prefer simpler models to complex ones.
Given the set of samples (X, y), the objective of a sparse model is to minimize
the SSE through a restriction (or penalty):
1
||Xw − y|| 22 + α||w||1,
2n
where ||w||1 is the L1-norm of the parameter vector w = ( a 0 , .. ., ad ).
Practical Case: Prediction of the Price of a New Housing Market
In this practical case we want to solve the question: Can we predict the price of a
new market given any of its attributes?
We will use the Boston housing dataset from Scikit-learn, which provides recorded
measurements of 13 attributes of housing markets around Boston, as well as the
median house price.3 Once we load the dataset (506 instances), the description
of the dataset can easily be shown by printing the field DESCR. The data (x),
feature names, and target (y) are stored in other fields of the dataset.
We first consider the task of predicting median house values in the Boston
area using as the variable one of the attributes, for instance, LSTAT, defined as the
“pro-portion of lower status of the population”.
Seaborn visualization can be used to show this linear relationships easily:
Fig. 6.5 Scatter plot of Boston data (LSTAT versus price) and their linear relationship (using
lmplot)
To study the relation among multiple variables in a dataset, there are different
options. We can study the relationship between several variables in a dataset by
using the functions corr and heatmap which allow to calculate a correlation matrix
for a dataset and draws a heat map with the correlation values. The heat map is a
matricial image which helps to interpret the correlations among variables. For the
sake of visualization, we do not consider all the 13 variables in the Boston
housing data, but six: CRIM, per capita crime rate by town; INDUS, proportion of
non-retail
106 6 Regression Analysis
Fig. 6.6 Scatter plot of Boston data (LSTAT versus price) and their polynomial relationship(using
lmplotwith order 2)
business acres per town; NOX, nitric oxide concentrations (parts per 10 million);
RM, average number of rooms per dwelling; AGE, proportion of owner-occupied
units built prior to 1940; and LSTAT. These variables are indicated by their indexes
in thefollowing code:
Figure 6.7 shows a heat map representing the correlation between pairs of vari-
ables; specifically, the six variables selected and the price of houses. The color
bar shows the range of values used in the matrix. This plot is a useful way of
summa- rizing the correlation of several variables. It can be seen that LSTAT and RM
are the variables that are most correlated with price.
Another good way to explore multiple variables is the scatter plot from
Pandas. The scatter plot is a grid of plots of multiple variables one against the
others, illus-trating the relationship of each variable with the rest. For the sake of
visualization,we do not consider all the variables, but just three: RM, AGE, and LSTAT
defined by indexesin the following code:
This code outputs Fig. 6.8, where we obtain visual information concerning the
density function for every variable, in the diagonal, as well as the scatter plots of
the data points for pairs of variables. In the last column, we can appreciate the
relation between the three variables selected and house prices. It can be seen that
RM follows a linear relation with price; whereas AGE does not. LSTAT follows a higher-
order relation with price. This plot gives us an indication of how good or bad
every attribute would be as a variable in a linear model.
For the evaluation of the prediction power of the model with new samples, we
split the data into a training set and a testing set, and we compute the linear
regression score, which returns the coefficient of determination R2 of the
prediction. We can also calculate the MSE.
In [14]: from sklearn import li near_ model
train_s ize = X_b oston . shape [0]/2
X_train = X _boston [: tra in_ siz e ]
X_test = X_boston [ train _size :]
y_train = y_bo ston [: tr ain _si ze ]
y_test = y_boston [ tra in_si ze :]
print ’ Training and test ing set sizes ’,
X_train . shape , X_test . shape
regr = Li ne ar Re gr e ss io n ()
regr . fit ( X_train , y_ train )
print ’ Coeff and int erc ept : ’,
regr . coef_ , regr . int ercep t_
print ’ Tes ting Score : ’, regr . score ( X_test , y_test ) print ’
Training
MSE : ’,
np . mean (( regr . predict ( X_tr ain ) - y_train ) **2)
print ’ Testing MSE : ’,
np . mean (( regr . predict ( X_test ) - y_test ) **2)
108 6 Regression Analysis
Out[16]: Ordered variable (from less to more important): [’CRIM’ ’INDUS’ ’CHAS’ ’NOX’ ’TAX’ ’B’ ’ZN’ ’AGE’
’RAD’ ’LSTAT’ ’PTRATIO’ ’DIS’’RM’]
There are also other strategies for feature selection. For instance, we can
select=the k 5 best features, according to the k highest scores, using the function
SelectKBestfrom Scikit-learn:
In [17]: import sklearn . fe at ur e_ se l ec ti on as fs
selecto r = fs . S elect KBest ( sc ore_f unc = fs . f_regre ssion ,
k = 5)
selecto r . fi t_tra nsfor m ( X_train , y_train ) per
selector . fit ( X_train , y_tr ain )
print ’ Se lected featur es : ’,
zip ( sel ector . get_ suppo rt () , boston . fe ature _name s )
Fig. 6.9 Relation between true (x-axis) and predicted (y-axis) prices
In [18]: clf = L in e ar Re gr es si on ()
clf . fit ( boston . data , boston . target )
pred ict ed = clf . p redict ( boston . data )
plt . sca tter ( boston . target , predicted , alpha = 0.3)
plt . plot ([0 , 50] , [0 , 50] , ’-- k ’)
plt . axis ( ’ tight ’)
plt . xlabel ( ’ True price ( $1000s ) ’)
plt . ylabel ( ’ Pr edict ed price ( $1000s ) ’)
The output is shown in Fig. 6.9, where we can observe that the original
prices are properly estimated by the predicted ones, except for the higher
values, around
$50.000 (points in the top right corner).
Finally, it is worth noting that we can work with statistical evaluation of a
linear regression with the OLS toolbox of the Stats Model toolbox.4 This toolbox is
useful to study several statistics concerning the regression model. To know more
about thetoolbox, go to the Documentation related to Stats Models.
Logistic Regression
Regression 111
Fig. 6.11 Linear regression (blue) versus logistic regression (red) for fitting a set of data (black points)
normally distributed across the 0 and 1 y-values
Figure 6.10 illustrates the logistic function with different values of λ. This function
is useful because it can take as its input any value from negative infinity to
positive infinity, whereas the output is restricted to values between 0 and 1 and
hence can beinterpreted as a probability.
The set of samples (X, y), illustrated as black points in Fig. 6.11, defines a fitting
problem suitable for a logistic regression. The blue and red lines show the fitting
result for linear and logistic models, respectively. In this case, a logistic model can
clearly explain the data; whereas a linear model cannot.
Practical Case: Winning or Losing Football Team
Now, we pose the question: What number of goals makes a football team the
winner or the loser? More concretely, we want to predict victory or defeat in a
football match when we are given the number of goals a team scores. To do this
we consider
the set of results of the football matches from the Spanish league 5 and we build a
classification model with it.
We first read the data file in a DataFrame and select the following columns in
a new DataFrame: HomeTeam, AwayTeam, FTHG (home team goals), FTAG (away
team goals), and FTR (H=home win, D draw, = A away = win). We then build a d-
dimensional vector of variables with all the scores, x, and a binary response
indicating victory or defeat, y. For that, we create two extra columns containing
W the number of goals of the winning team and L the number of goals of the losing
team and we concatenate these data. Finally, we can compute and visualize a logistic
regression model to predict the discrete value (victory or defeat) using these
data.
In [19]: from skl ea rn . li nea r_model import Log i st i cR eg ressi on data = pd.r ea d_cs v (’
files /ch06 /SP1. csv’)
s = data [[’HomeTea m ’,’ A w ay Team ’, ’FTH G ’, ’FTAG ’, ’FTR’] ] def my_f1 (row):
ret urn max(row[’FTHG ’], row[ ’FTAG ’] ) def my_f2 (row):
ret urn min(row[’FTH G ’], row[’FTAG ’] ) s[’W’] = s.appl y (
my_f1, axis = 1)
s[’L’] = s.a pply (my_f2, axis = 1)x1 = s[’W’].val ues
y1 = np.ones (len(x1 ), dtype = np.int)x2 = s[’L’].values
y2 = np.zeros (len(x2 ), dtype = np.int)x = np.conca t ena t e
([x1, x2])
x = x [:, np.ne wa x is ]
y = np.conca t ena t e ([y1, y2]) log reg =
Log is ti cR eg r es s io n ()log reg .fit(x, y)
X_t est = np.l i nspa ce (-5, 10, 300) def lr_ model (x ):
ret urn 1 / (1+np.exp( - x) )
loss = l r_model (X_t est *l ogreg .coef _ + log reg .i nt ercept _ )
.ravel ()
X _tes t2 = X _tes t [:,np.newa x is ]
los s pred = log reg .p redict (X _tes t2 )plt.s ca tter (x. ravel
(), y,
color = ’ bla ck ’,
s = 100, z order = 20,
alpha = 0.03)
plt. plot(X _test , loss, color = ’blue ’, lin ewi dth = 3)
plt. plot(X _test , loss pred , color = ’ red’, lin ewidt h = 3)
Figure 6.12 shows a scatter plot with transparency so we can appreciate the over-
lapping in the discrete positions of the total numbers of victories and defeats. It
also shows the fitting of the logistic regression model, in blue, and prediction of
the logistic regression model, in red, for the Spanish football league results. With
this information we can estimate that the cutoff value is 1. This means that a
team, in general, has to score more than one goal to win.
5http://www.football-data.co.uk/mmz4281/1213/SP1.csv.
Fig. 6.12 Fitting of the logistic regression model (blue) and prediction of the logistic regression model
(red) for the Spanish football league results
Unsupervised Learning
Introduction
• Examples within a cluster are similar (in this case, we speak of high intraclass
similarity).
• Examples in different clusters are different (in this case, we speak of low interclass
similarity).
When we denote data as similar and dissimilar, we should define a measure for
this similarity/dissimilarity. Note that grouping similar data together can help in
discov- ering new categories in an unsupervised manner, even when no sample
category labels are provided. Moreover, two kinds of inputs can be used for
grouping:
• What is a natural grouping among the objects? We need to define the “groupness”
and the “similarity/distance” between data.
• How can we group samples? What are the best procedures? Are they efficient?
Are they fast? Are they deterministic?
• How many clusters should we look for in the data? Shall we state this
number a priori? Should the process be completely data driven or can the user
guide the grouping process? How can we avoid “trivial” clusters? Should we
allow final clustering results to have very large or very small clusters? Which
methods work when the number of samples is large? Which methods work
when the number ofclasses is large?
• What constitutes a good grouping? What objective measures can be defined to
evaluate the quality of the clusters?