Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Introduction To Datascience (R20DS501)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Introduction To Datascience (R20DS501)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT-4

Regression analysis, Regression: linear regression simple linear regression,


multiple & Polynomial regression, Sparse model. Unsupervised learning,
clustering, similarity and distances, quality measures of clustering, case
study.

Regression Analysis

Introduction

In this chapter, we introduce regression analysis and some of its applications in


data science. Regression is related to how to make predictions about real-world
quantities such as, for instance, the predictions alluded to in the following
questions. How does sales volume change with changes in price? How is sales
volume affected by the weather? How does the title of a book affect its sales?
How does the amount of a drug absorbed vary with the patient’s body weight;
and does this relationship depend on blood pressure? How many customers can
I expect today? At what time should Igo home to avoid traffic jams? What is the
chance of rain on the next two Mondays; and what is the expected
temperature?
All these questions have a common structure: they ask for a response that
canbe expressed as a combination of one or more (independent) variables
(also calledcovariates or predictors). The role of regression is to build a model
to predict theresponse from the variables. This process involves the transition
from data to model.More specifically, the model can be useful in different tasks,
such as the following:
(1) analyzing the behavior of data (the relation between the response and the
vari- ables), (2) predicting data values (whether continuous or discrete), and
(3) finding important variables for the model.
In order to understand how a regression model can be suitable for tackling
these tasks, we will introduce three practical cases for which we use three real
datasets and solve different questions. These practical cases will motivate simple
linear regression, multiple linear regression, and logistic regression, as
presented in the following sections.
98 6 Regression Analysis

Fig. 6.1 Illustration of different simple linear regression models. Blue points correspond to a set
of random points sampled from a univariate normal (Gaussian) distribution. Red, green and
yellow lines are three different simple linear regression models

Linear Regression

The objective of performing a regression is to build a model to express the


relation between the ∈response y Rn and a combination of one or more

(independent) vari- ables xi Rn . [1] The model allows us to predict the response y
from the variables. The simplest model which can be considered is a linear model,
where the responsey depends linearly on the d variables xi :
y = a1x 1 + · · · + a d x d . (6.1)
The variables ai are termed the parameters or coefficients of the model.
Thisequation can be rewritten in a more compact matrix form: y = Xw, where
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
y1 x11 . .. x1d a1
y2 x21 . . . x2d a2
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
y= ⎜
⎝ . ⎠,X = ⎜
⎟ ,w = ⎜ ⎟.
⎝ . ⎠ ⎝ . ⎠
yn xn1 . . . xnd ad
Linear regression is the technique for creating these linear models.

Simple Linear Regression

Simple linear regression considers n samples of a single variable x ∈ Rn and


describes the relationship between the variable and the response with the model:
y = a0 + a1x, (6.2)
where the parameter a0 is called the intercept or the constant term.
Given a set of samples (x, y), such as the set illustrated in Fig. 6.1, we can create
a linear model to explain the data, as in Eq. (6.2). But how do we know which is the
6.2 Linear Regression 99

best model (best parameters) for this particular set of samples? See the three
different models (straight lines in different colors) in Fig. 6.1.
Ordinary least squares (OLS) is the simplest and most common estimator in which
the parameters (a’s) are chosen to minimize the square of the distance between
thepredicted values and the actual values with respect to a0, a1:
n
2
||a0 + a1 x − y||
2 = (a0 + a1x j − y j )2.
j =1
We are concerned here with the y-axis distance, since it does not consider the
error in the variables. This error expression is often called the sum of squared
errors of prediction (SSE). The SSE function is quadratic in the parameters, w, with
positive- definite Hessian, and therefore this function possesses a unique global
ˆ = ˆ ˆat w (a0, a1). The resulting model is represented
minimum ˆ =asˆfollows:
+ ˆ y a0
a1x, where the hats on the variables represent the fact that they are estimated
from the data available.
OLS is a popular approach for several reasons. It makes it computationally cheap to
calculate the coefficients. It is also easier to interpret than the other more
sophisticated models. In situations where the goal is to understand a simple model
in detail, rather than to estimate the response well, it can provide insight into what
the model captures. Finally, in situations where there is a lot of noise, as in many
real scenarios, it may be hard to find the true functional form, so a constrained
model can perform quite well compared to a complex model which can be more
affected by noise.
Practical Case: Sea Ice Data and Climate Change
In this practical case, we pose the question: Is the climate really changing? More
concretely, we want to show the effect of the climate change by determining whether
the sea ice area (or extent) has decreased over the years. Sea ice area refers to
the total area covered by ice, whereas sea ice extent is the area of ocean with at
least 15% sea ice. Reliable measurement of sea ice edges began with the satellite
era in the late 1970s. Before then, sea ice area and extent were monitored less
precisely bya combination of ships, buoys, and aircraft.
We will use the sea ice data from the National Snow & Ice Data Center1 which
provides measurements of the area and extend of sea ice at the poles over the
last 36 years. The center has given access to the archived monthly Sea Ice Index
imagesand data since 1979 [2]. The archived data reside at an FTP location2 (web-
page instructions can be followed easily to access and download the files). The
ASCII data files tabulate sea ice extent and area (in millions of square kilometers)
by yearfor a given month.
In order to check whether there is an anomaly in the evolution of sea ice
extent over recent years, we want to build a simple linear regression model and
analyze thefitting; but before we need to perform several processing steps.

.
1
0 ession Analysis
0
Fig. 6.2 Ice extent data by month
6
In [1]:
R
e First, we read the data, previously downloaded, and create a DataFrame
(Pandas)
g as follows:
r
ice = pd . r e a d _ c s v ( ’ f i l e s / c h 06 / S e a I c e . t xt ’,
de li m_ w hi te s pa ce = True )
p r i nt ’ s h a p e : ’, i ce . s h a p e

Out[1]: shape: (424, 6)


For data cleaning, we check the values of all the fields to detect any potential
−is a ‘ 9999’ value in the data_type field which should contain
error. We find that there
‘Goddard’ or ‘NRTSI-G’ (the type of the input dataset). So we can easily clean the
data, removing these instances.
In [2]:
ic e 2 = ic e [ ice . d a t a _ t y p e != ’ -9 999 ’]

Next, we visualize the data. The lmplot() function from the Seaborn toolbox is
intended for exploring linear relationships of different forms in multidimensional
datasets. For instance, we can illustrate the relationship between the month of the
year (variable) and the extent(response) as follows:

In [3]:
import Se aborn as sns
sns . l m p l o t ( " m o " , " e x t e n t " , i c e2 )

This outputs Fig. 6.2. We can observe a monthly fluctuation of the sea ice
extent,as would be expected for the different seasons of the year.
We should normalize the data before performing the regression analysis to
avoid this fluctuation and be able to study the evolution of the extent over the
years. To capture the variation for a given interval of time (month), we can
compute the mean
Fig. 6.3 Ice extent data by month after the normalization

for the i-th interval of time (using the period from 1979 through 2014 for the
{ j } ei .
mean extent) μi , and subtract it from the set of extent values for that month
This value can be converted to a relative percentage difference by dividing it by
the totalaverage (1979–2014) μ, and then multiplying by 100:
ei − μ i
i j
jẽ = 100 ∗ , i = 1, . . . , 12.
μ
We implement this normalization and plot the relationship again as follows:
In [4]:
for i in r a n g e (1 2 ) :
ic e 2 . e x t e n t [ i c e 2 . mo == i + 1 ] =
1 0 0 *( i c e 2 . e x t e n t [ i c e 2 . mo == i +1]
- month_means [ i +1])
/ m o n t h _ m e a n s . m e an ()
sns . l m p l o t ( " m o " , " e x t e n t " , i c e2 )

The new output is in Fig. 6.3. We now observe a comparable range of values
forall months.
Next, the normalized values can be plotted for the entire time series to analyze the
tendency. We compute the trend as a simple linear regression. We use the lmplot()
function for visualizing linear relationships between the year (variable) and the extent
(response).

In [5]: sns . l m p l o t (" y e a r " , " extent " , ice2 )

This outputs Fig. 6.4 showing the regression model fitting the extent data.
This plot has two main components. The first is a scatter plot, showing the
observed data points. The second is a regression line, showing the estimated
linear model relating
Fig. 6.4 Regression model fitting sea ice extent data for all months by year using lmplot

the two variables. The regression line is plotted with a 95% confidence band to
givean impression of the uncertainty in the model.
In this figure, we can observe that the data show a long-term negative trend
overyears. The negative trend can be attributed to global warming, although there
is alsoa considerable amount of variation from year to year.
Up until here, we have qualitatively shown the linear regression using a useful
visu- alization tool. We can also analyze the linear relationship in the data using the
Scikit- learn library, which allows a quantitative evaluation. As was explained in the
previous chapter, Scikit-learn provides an object-oriented interface centered
around the con- cept of an estimator. The sklearn.linear_model.LinearRegression
estimator sets the state of the estimator based on the training data using the
function fit. Moreover, it allows the user to specify whether to fit an intercept
term in the object construction. This is done by setting the corresponding
constructor argumentsof the estimator object as follows:
from sklearn . l in ea r _m od e l im port L i n e a r R e g r e s s io n
In [6]:
est = L i n e a r R e g r e ss i o n ( f i t_ i n te r c e pt = True )

During the fitting process, the state of the estimator is stored in instance
attributes that have a trailing underscore (‘_’). For example, the coefficients of a
LinearRegression estimator are stored in the attribute coef_. We fit a regres- sion
model using years as variables (x) and the extent values as the response (y).

In [7]:
x = i c e2 [[ ’ y e a r ’ ] ]
y = i c e 2 [[ ’ e x t e n t ’ ] ]
est . fi t ( x , y )
print " Coe f fi ci e nt s :" , est . coef_
p r i nt " I n t e r c e p t : " , es t . i n t e r c e p t _
6.2 Linear Regression 103

Out[7]: Coefficients: [[-0.45275459]]


Intercept: [ 903.71640207]
Estimators that can generate predictions provide an Estimator.predict method.
In the case of regression, Estimator.predict will return the predicted regression
values. We can evaluate the model fitting by computing the mean squared error
2 2
(MSE) and the coefficient
. of determination
. (R ) of the model. The coefficientR is
defined as (1 u/v), with u (y y)2 −
and v= (y y)2, where − y isˆ the mean.
= The
2
best possible score for R is 1.0, lower values are worse (it can also be negative).
−¯ ¯
These measures can provide a quantitative answer to the question we are facing:
Is there a negative trend in the evolution of sea ice extent over recent years? We
can perform this analysis for a particular month or for all months together, as
done in the following lines:
In [8]: from sklearn import metrics
y_hat = est . p redict ( x)
print " MSE :" , metric s . m ea n _s qu ar ed _e rr or ( y_hat , y)
print " R ^2: " , metric s . r2 _score ( y_hat , y)
print ’ var : ’, y. var ()

Out[8]: MSE: 10.5391316398


R2: 0.50678703821
var: 31.98324
The negative trend seen in Fig. 6.4 is validated by the MSE value which is small,
0.1%, and the R2 value which is acceptable, given the variance of the data, 0.3%.
Given the model, we can also predict the extent value for the coming years.
Forinstance, the predicted extent for January 2025 can be computed as follows:
In [9]: x = [2025]
y_hat = model . p redict ( x)
m = 1 # January
y_hat = ( y_hat * m onth_ means . mean () /100) + mon th_m e ans [ m]
print " P redic tion of extent for January 2025
( in millions of square km ):" , y_hat

Out[9]: Prediction of extent for January 2025 (in millions of squarekm): [12.93603933].

Multiple Linear Regression and Polynomial Regression

As we have seen in the previous section, with simple linear regression we


describe the relationship between the variable and the response with a straight
line. In the case of multiple linear regression, we extend this idea by fitting a d-
dimensional hyperplane to our d variables, as defined in Eq. (6.1).
Multiple linear regression may seem a very simple model, but even when the
response depends on the variables in nonlinear ways, this model can still be used
by
104 6 Regression Analysis

considering nonlinear transformations φ (·) of the variables:


y = a1 φ(x1) + · · · + ad φ(xd )
This model is called polynomial regression and it is a popular nonlinear regression
technique which models the relationship between the response and the
variablesas an p-th order polynomial. The higher the order of the polynomial, the
more complex the functions you can fit. However, using higher-order polynomial
can involve computational complexity and overfitting. Overfitting occurs when a
model fits the characteristics of the training data and loses the capacity to
generalize fromthe seen to predict the unseen.

Sparse Model

Often, in real problems, there are uninformative variables in the data which
prevent proper modeling of the problem and thus, the building of a correct
regression model. In such cases, a feature selection process is crucial to select
only the informative features and discard non-informative ones. This can be
achieved by sparse methods which use a penalization approach, such as LASSO
(least absolute shrinkage and selection operator) to set some model coefficients
to zero (thereby discarding thosevariables). Sparsity can be seen as an application
of Occam’s razor: prefer simpler models to complex ones.
Given the set of samples (X, y), the objective of a sparse model is to minimize
the SSE through a restriction (or penalty):
1
||Xw − y|| 22 + α||w||1,
2n
where ||w||1 is the L1-norm of the parameter vector w = ( a 0 , .. ., ad ).
Practical Case: Prediction of the Price of a New Housing Market
In this practical case we want to solve the question: Can we predict the price of a
new market given any of its attributes?
We will use the Boston housing dataset from Scikit-learn, which provides recorded
measurements of 13 attributes of housing markets around Boston, as well as the
median house price.3 Once we load the dataset (506 instances), the description
of the dataset can easily be shown by printing the field DESCR. The data (x),
feature names, and target (y) are stored in other fields of the dataset.
We first consider the task of predicting median house values in the Boston
area using as the variable one of the attributes, for instance, LSTAT, defined as the
“pro-portion of lower status of the population”.
Seaborn visualization can be used to show this linear relationships easily:

3Copy of UCI ML housing dataset: http://archive.ics.uci.edu/ml/datasets/Housing.


6.2 Linear Regression 105

Fig. 6.5 Scatter plot of Boston data (LSTAT versus price) and their linear relationship (using
lmplot)

In [10]: from sk le ar n i mpo rt da ta se t s


bost on = da ta se ts . l o a d _ b o s t on ()
X_bo sto n , y_ bo st on = b ost on . data , bos ton . ta rge t
prin t ’ S hape of data : ’, X_ bo st on . shape , y _b o st on . s hap e
prin t ’ Fe at ur e nam es : ’, bo sto n . f e a t u r e _ na m e s
d f _ b o st o n = pd . D a t a F r am e ( bosto n . data ,
co lu mn s = bo sto n . f e a t u r e _ na m e s )
d f _ b o st o n [ ’ p ric e ’] = bo sto n . ta rge t
sns . l mpl ot (" pri ce " , " L STA T " , d f _ b o s t o n )

Out[10]:Shape of data: (506L, 13L) (506L,)


Feature names: [’CRIM’ ’ZN’ ’INDUS’ ’CHAS’ ’NOX’ ’RM’ ’AGE’
’DIS’ ’RAD’ ’TAX’ ’PTRATIO’ ’B’ ’LSTAT’]
In Fig. 6.5, we can clearly see that the relationship between price and LSTAT is
nonlinear, since the straight line is a poor fit. We can examine whether a better fit
can be obtained by including higher-order terms. For example, a quadratic model:
yi ≈ a0 + a1xi + a2x2 i
The lmplotfunction allows to easily change the order of the model as is done in
the next code, which outputs Fig. 6.6, where we observe a better fit.
In [11]: sns . lmplot (" price " , " LSTAT " , df_boston , order = 2)

To study the relation among multiple variables in a dataset, there are different
options. We can study the relationship between several variables in a dataset by
using the functions corr and heatmap which allow to calculate a correlation matrix
for a dataset and draws a heat map with the correlation values. The heat map is a
matricial image which helps to interpret the correlations among variables. For the
sake of visualization, we do not consider all the 13 variables in the Boston
housing data, but six: CRIM, per capita crime rate by town; INDUS, proportion of
non-retail
106 6 Regression Analysis

Fig. 6.6 Scatter plot of Boston data (LSTAT versus price) and their polynomial relationship(using
lmplotwith order 2)

business acres per town; NOX, nitric oxide concentrations (parts per 10 million);
RM, average number of rooms per dwelling; AGE, proportion of owner-occupied
units built prior to 1940; and LSTAT. These variables are indicated by their indexes
in thefollowing code:

In [12]: indexes = [0 ,2 ,4 ,5 ,6 ,12]


df 2 = pd . Dat aFrame ( boston . data [: , ind exes ],
columns = boston . feat ure_ names [ indexes ])
df2 [ ’ price ’] = boston . target
corrmat = df 2 . corr ()
sns . he atmap ( corrmat , vmax = .8 , square = True )

Figure 6.7 shows a heat map representing the correlation between pairs of vari-
ables; specifically, the six variables selected and the price of houses. The color
bar shows the range of values used in the matrix. This plot is a useful way of
summa- rizing the correlation of several variables. It can be seen that LSTAT and RM
are the variables that are most correlated with price.
Another good way to explore multiple variables is the scatter plot from
Pandas. The scatter plot is a grid of plots of multiple variables one against the
others, illus-trating the relationship of each variable with the rest. For the sake of
visualization,we do not consider all the variables, but just three: RM, AGE, and LSTAT
defined by indexesin the following code:

In [13]: indexes =[ 5 , 6 ,12]


df 2 = pd . Dat aFrame ( boston . data [: , ind exes ],
columns = boston . feat ure_ names [ indexes ])
df2 [ ’ price ’] = boston . target
pd . sc atter _matr ix ( df2 , figsize = (12.0 , 12.0) )
6.2 Linear Regression 107

Fig. 6.7 Correlation plot:


heat map representing
thecorrelation between
sevenpairs of variables in
the Boston housing
dataset

This code outputs Fig. 6.8, where we obtain visual information concerning the
density function for every variable, in the diagonal, as well as the scatter plots of
the data points for pairs of variables. In the last column, we can appreciate the
relation between the three variables selected and house prices. It can be seen that
RM follows a linear relation with price; whereas AGE does not. LSTAT follows a higher-
order relation with price. This plot gives us an indication of how good or bad
every attribute would be as a variable in a linear model.
For the evaluation of the prediction power of the model with new samples, we
split the data into a training set and a testing set, and we compute the linear
regression score, which returns the coefficient of determination R2 of the
prediction. We can also calculate the MSE.
In [14]: from sklearn import li near_ model
train_s ize = X_b oston . shape [0]/2
X_train = X _boston [: tra in_ siz e ]
X_test = X_boston [ train _size :]
y_train = y_bo ston [: tr ain _si ze ]
y_test = y_boston [ tra in_si ze :]
print ’ Training and test ing set sizes ’,
X_train . shape , X_test . shape
regr = Li ne ar Re gr e ss io n ()
regr . fit ( X_train , y_ train )
print ’ Coeff and int erc ept : ’,
regr . coef_ , regr . int ercep t_
print ’ Tes ting Score : ’, regr . score ( X_test , y_test ) print ’
Training
MSE : ’,
np . mean (( regr . predict ( X_tr ain ) - y_train ) **2)
print ’ Testing MSE : ’,
np . mean (( regr . predict ( X_test ) - y_test ) **2)
108 6 Regression Analysis

Fig. 6.8 Scatter plot of Boston housing dataset

Out[14]:Training and testing set sizes (253, 13) (253, 13)


Coeff and intercept: [ 1.20133313 0.02449686 0.00999508
0.42548672 -8.44272332 8.87767164 -0.04850422 -1.11980855
0.20377571 -0.01597724 -0.65974775 0.01777057 -0.11480104]
-10.0174305829
Testing Score: -2.24420202674
Training MSE: 9.98751732546
Testing MSE: 302.64091133
We can see that all the coefficients obtained are different from zero, meaning
that no variable is discarded. Next, we try to build a sparse model to predict the
price using the most important factors and discarding the non-informative ones. To
do this, we can create a LASSO regressor, forcing zero coefficients.
Linear
Regression 109

In [15]: regr_la sso = l inear _mod el . Lasso ( alpha = . 3 )


regr _la sso . fit ( X_train , y_train ) print ’ Coeff and inte rce pt :
’, r egr_l asso . coef_
print ’ Tesing Score : ’, r egr _la sso . score ( X_test ,
y_test ) print ’ Trainin g MSE : ’,
np . mean (( r egr_ las so . predi ct ( X_ train ) - y_tra in ) **2)
print ’ Testing MSE : ’,
np . mean (( reg r_las so . predict ( X_test ) - y_test ) **2)

Out[15]: Coeff and intercept: [ 0. 0.01996512 -0. 0. -0. 7.69894744


-0.03444803 -0.79380636 0.0735163 -0.0143421 -0.66768539
0.01547437 -0.22181817] -6.18324183615
Testing Score: 0.501127529021
Training MSE: 10.7343110095
Testing MSE: 46.5381680949
It can now be seen that the result of the model fitting for a set of sparse
coefficientsis much better than before (using all the variables), with the score
increasing from
−2.24 to 0.5. This demonstrates that four of the initial variables are not
importantfor the prediction and in fact they confuse the regressor.
With the LASSO result, we can also emphasize the most important factors for
determining the price of a new market, based on the coefficient values:
In [16]: ind = np . a rgsort ( np . abs ( r egr _la sso . coef_ ))
print ’ Ordered variable ( from less to more impo rt ant ): ’,
boston . feat ure_n ames [ ind ]

Out[16]: Ordered variable (from less to more important): [’CRIM’ ’INDUS’ ’CHAS’ ’NOX’ ’TAX’ ’B’ ’ZN’ ’AGE’
’RAD’ ’LSTAT’ ’PTRATIO’ ’DIS’’RM’]
There are also other strategies for feature selection. For instance, we can
select=the k 5 best features, according to the k highest scores, using the function
SelectKBestfrom Scikit-learn:
In [17]: import sklearn . fe at ur e_ se l ec ti on as fs
selecto r = fs . S elect KBest ( sc ore_f unc = fs . f_regre ssion ,
k = 5)
selecto r . fi t_tra nsfor m ( X_train , y_train ) per
selector . fit ( X_train , y_tr ain )
print ’ Se lected featur es : ’,
zip ( sel ector . get_ suppo rt () , boston . fe ature _name s )

Out[17]:Selected features: [(False, ’CRIM’), (False, ’ZN’), (True,


’INDUS’), (False, ’CHAS’), (False, ’NOX’), (True, ’RM’), (True,
’AGE’), (False, ’DIS’), (False, ’RAD’), (False, ’TAX’), (True,’PTRATIO’), (False, ’B’), (True, ’LSTAT’)]
The set of selected features is now different, since the criterion has changed.
However, three of the most important features: RM, PTRATIO, and LSTAT.
In order to evaluate the prediction, it could be interesting to visualize the
targetand predicted responses in a scatter plot, as it is done in the next code:
110 6 Regression Analysis

Fig. 6.9 Relation between true (x-axis) and predicted (y-axis) prices

In [18]: clf = L in e ar Re gr es si on ()
clf . fit ( boston . data , boston . target )
pred ict ed = clf . p redict ( boston . data )
plt . sca tter ( boston . target , predicted , alpha = 0.3)
plt . plot ([0 , 50] , [0 , 50] , ’-- k ’)
plt . axis ( ’ tight ’)
plt . xlabel ( ’ True price ( $1000s ) ’)
plt . ylabel ( ’ Pr edict ed price ( $1000s ) ’)

The output is shown in Fig. 6.9, where we can observe that the original
prices are properly estimated by the predicted ones, except for the higher
values, around
$50.000 (points in the top right corner).
Finally, it is worth noting that we can work with statistical evaluation of a
linear regression with the OLS toolbox of the Stats Model toolbox.4 This toolbox is
useful to study several statistics concerning the regression model. To know more
about thetoolbox, go to the Documentation related to Stats Models.

Logistic Regression

Logistic regression is a type of model of probabilistic statistical classification. It is


used as a binary model to predict a binary response, the outcome of a categorical
dependent variable (i.e., a class label), based on one or more variables.
The form of the logistic function is:
1
f x
( ) = 1 + e−λx
Logistic

Regression 111

Fig. 6.10 Logistic function for different lambda values

Fig. 6.11 Linear regression (blue) versus logistic regression (red) for fitting a set of data (black points)
normally distributed across the 0 and 1 y-values

Figure 6.10 illustrates the logistic function with different values of λ. This function
is useful because it can take as its input any value from negative infinity to
positive infinity, whereas the output is restricted to values between 0 and 1 and
hence can beinterpreted as a probability.
The set of samples (X, y), illustrated as black points in Fig. 6.11, defines a fitting
problem suitable for a logistic regression. The blue and red lines show the fitting
result for linear and logistic models, respectively. In this case, a logistic model can
clearly explain the data; whereas a linear model cannot.
Practical Case: Winning or Losing Football Team
Now, we pose the question: What number of goals makes a football team the
winner or the loser? More concretely, we want to predict victory or defeat in a
football match when we are given the number of goals a team scores. To do this
we consider
the set of results of the football matches from the Spanish league 5 and we build a
classification model with it.
We first read the data file in a DataFrame and select the following columns in
a new DataFrame: HomeTeam, AwayTeam, FTHG (home team goals), FTAG (away
team goals), and FTR (H=home win, D draw, = A away = win). We then build a d-
dimensional vector of variables with all the scores, x, and a binary response
indicating victory or defeat, y. For that, we create two extra columns containing
W the number of goals of the winning team and L the number of goals of the losing
team and we concatenate these data. Finally, we can compute and visualize a logistic
regression model to predict the discrete value (victory or defeat) using these
data.
In [19]: from skl ea rn . li nea r_model import Log i st i cR eg ressi on data = pd.r ea d_cs v (’
files /ch06 /SP1. csv’)
s = data [[’HomeTea m ’,’ A w ay Team ’, ’FTH G ’, ’FTAG ’, ’FTR’] ] def my_f1 (row):
ret urn max(row[’FTHG ’], row[ ’FTAG ’] ) def my_f2 (row):
ret urn min(row[’FTH G ’], row[’FTAG ’] ) s[’W’] = s.appl y (
my_f1, axis = 1)
s[’L’] = s.a pply (my_f2, axis = 1)x1 = s[’W’].val ues
y1 = np.ones (len(x1 ), dtype = np.int)x2 = s[’L’].values
y2 = np.zeros (len(x2 ), dtype = np.int)x = np.conca t ena t e
([x1, x2])
x = x [:, np.ne wa x is ]
y = np.conca t ena t e ([y1, y2]) log reg =
Log is ti cR eg r es s io n ()log reg .fit(x, y)
X_t est = np.l i nspa ce (-5, 10, 300) def lr_ model (x ):
ret urn 1 / (1+np.exp( - x) )
loss = l r_model (X_t est *l ogreg .coef _ + log reg .i nt ercept _ )
.ravel ()
X _tes t2 = X _tes t [:,np.newa x is ]
los s pred = log reg .p redict (X _tes t2 )plt.s ca tter (x. ravel
(), y,
color = ’ bla ck ’,
s = 100, z order = 20,
alpha = 0.03)
plt. plot(X _test , loss, color = ’blue ’, lin ewi dth = 3)
plt. plot(X _test , loss pred , color = ’ red’, lin ewidt h = 3)

Figure 6.12 shows a scatter plot with transparency so we can appreciate the over-
lapping in the discrete positions of the total numbers of victories and defeats. It
also shows the fitting of the logistic regression model, in blue, and prediction of
the logistic regression model, in red, for the Spanish football league results. With
this information we can estimate that the cutoff value is 1. This means that a
team, in general, has to score more than one goal to win.

5http://www.football-data.co.uk/mmz4281/1213/SP1.csv.
Fig. 6.12 Fitting of the logistic regression model (blue) and prediction of the logistic regression model
(red) for the Spanish football league results
Unsupervised Learning
Introduction

In machine learning, the problem of unsupervised learning is that of trying to find


hidden structure in unlabeled data. Since the examples given to the learner are
unla- beled, there is no error or reward signal to evaluate the goodness of a
potential solution. This distinguishes unsupervised from supervised learning.
Unsupervised learning is defined as the task performed by algorithms that learn
from a training set of unlabeled or unannotated examples, using the features of
the inputs to categorizethem according to some geometric or statistical criteria.
Unsupervised learning encompasses many techniques that seek to summarize and
explain key features or structures of the data. Many methods employed in
unsuper- vised learning are based on data mining methods used to preprocess
data. Most unsupervised learning techniques can be summarized as those that
tackle the follow-ing four groups of problems:

• Clustering: has as a goal to partition the set of examples into groups.


• Dimensionality reduction: aims to reduce the dimensionality of the data. Here,
weencounter techniques such as Principal Component Analysis (PCA),
independentcomponent analysis, and nonnegative matrix factorization.
• Outlier detection: has as a purpose to find unusual events (e.g., a
malfunction),that distinguish part of the data from the rest according to certain
criteria.
• Novelty detection: deals with cases when changes occur in the data (e.g., in
stream-ing data).

The most common unsupervised task is clustering, which we focus on in this


chapter.
Clustering

Clustering is a process of grouping similar objects together; i.e., to partition unlabeled


examples into disjoint subsets of clusters, such that:

• Examples within a cluster are similar (in this case, we speak of high intraclass
similarity).
• Examples in different clusters are different (in this case, we speak of low interclass
similarity).

When we denote data as similar and dissimilar, we should define a measure for
this similarity/dissimilarity. Note that grouping similar data together can help in
discov- ering new categories in an unsupervised manner, even when no sample
category labels are provided. Moreover, two kinds of inputs can be used for
grouping:

(a) in similarity-based clustering, the input to the algorithm is an×n n dissimilarity


matrix or distance matrix;
× D feature matrix
(b) in feature-based clustering, the input to the algorithm is an n
or design matrix, where n is the number of examples in the dataset and D the
dimensionality of each sample.

Similarity-based clustering allows easy inclusion of domain-specific similarity,


while feature-based clustering has the advantage that it is applicable to
potentially noisy data.
Therefore, several questions regarding the clustering process arise.

• What is a natural grouping among the objects? We need to define the “groupness”
and the “similarity/distance” between data.
• How can we group samples? What are the best procedures? Are they efficient?
Are they fast? Are they deterministic?
• How many clusters should we look for in the data? Shall we state this
number a priori? Should the process be completely data driven or can the user
guide the grouping process? How can we avoid “trivial” clusters? Should we
allow final clustering results to have very large or very small clusters? Which
methods work when the number of samples is large? Which methods work
when the number ofclasses is large?
• What constitutes a good grouping? What objective measures can be defined to
evaluate the quality of the clusters?

There is not always a single or optimal answer to these questions. It used to be


said that clustering is a “subjective” issue. Clustering will help us to describe,
analyze, and gain insight into the data, but the quality of the partition depends to a
great extent on the application and the analyst.

You might also like