Unit 5
Unit 5
Unit 5
Data Analysis
1 2 3 4 5 6 7
Define Specify Collect data Do Estimate Evaluate Use model
problem or model descriptive unknown model for
question data analysis parameters prediction
• represents the • i represents the
unit change in Y per unit change in Y
unit change in X . per unit change in
• Does not take into Xi.
account any other • Takes into account
Simple vs. variable besides the effect of other
single independent i s.
Multiple variable.
• “Net regression
coefficient.”
Linearity - the Y variable is linearly related to the
value of the X variable.
y = 0 + 1x +e
where:
0 and 1 are called parameters of the model,
e is a random variable called the error term.
Simple Linear Regression Equation
E(y) = 0 + 1x
E(y)
Regression line
Intercept Slope 1
0 is positive
x
Simple Linear Regression Equation
E(y)
Intercept
0 Regression line
Slope 1
is negative
x
Simple Linear Regression Equation
No Relationship
E(y)
x
Least Squares Method
• Least Squares Criterion
min (y i − y i ) 2
where:
yi = observed value of the dependent variable
for the ith observation
y^i = estimated value of the dependent variable
for the ith observation
Least Squares Method
• Slope for the Estimated Regression
Equation
b1 =
( x − x )( y − y )
i i
i
( x − x ) 2
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
Least Squares Method
b0 = y − b1 x
x 2 5 3 5 1 6
y 4 7 6 8 4 9
Simple Linear Regression
Number of Number of
TV Ads (x) Cars Sold (y)
1 14
3 24
2 18
1 17
3 27
Sx = 10 Sy = 100
x=2 y = 20
Estimated Regression Equation
Slope for the Estimated Regression Equation
b1 = ( x − x )( y − y ) 20
i i
= =5
(x − x )i
2
4
i
( y − y ) 2
= i
( ˆ
y − y ) 2
+ i i
( y − ˆ
y ) 2
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination
r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
Coefficient of Determination
supervised unsupervised
Eg: Fruit
Classification
Classifier-
with some Eg: Fruit
orange, apples, Seen lots of
known label Classifier-
bananas example but not
a proper label. “fruit with
soft skin”.
Like clustering “red
fruits”
Supervised
learning
O/P variable
Draw Classification Regression is real or
conclusions continuous
like spam values like
or not, red marks or
or blue Linear weight
Naïve Bayes
regression
Polynomial
Decision tree
regression
SVM
SVM
Regression
What Is Naive Bayes?
Medical Diagnosis
• Given a list of symptoms, predict whether a patient has disease X or not
Weather
• Based on temperature, humidity, etc… predict if it will rain tomorrow
Feature
Feature x1
X2
Label Y
NAÏVE BAYS EXAMPLE:
To predict days suitable for a football match based on weather conditions
Smaller circle- low probability to play (P<0.5)
Big Circle- High probability to play (P>0.5
Combining both the conditions
Probability
of the class
0.60
Prior Probability
Predict the likelihood to play football on
( Season =Winter, Sunny=No, Windy= yes
Probability of match not being played??
Face recognition
Mail classification
Handwriting analysis
Salary prediction
Statistical Learning :
Bayesian Network
Bayesian Network
A simple graphical representation for a joint probability distribution.
• Nodes are random variables
• Directed edges between nodes reflect dependence
Syntax:
– a set of nodes, one per variable
– a directed, acyclic graph (link ≈ "directly influences")
– if there is a link from x to y, x is said to be a parent of y
– a conditional distribution for each node given its parents:
P (Xi | Parents (Xi ))
Find the probability when John Call and Marry Calls and Alarm
Went Off and there is no burglary and no earthquake happened.
P(J , M , A ,¬B , ¬ E)
= P(J | A)* P(M | A)* P(A| ¬ B , ¬ E) * P(¬ B) *P(¬ E)
= 0.90 x 0.70 x 0.001 x 0.999 x 0.998
= 0.0006
Inference and Bayesian
Networks
• Bayesian networks are a type of probabilistic
graphical model that uses Bayesian inference for
probability computations.
• Bayesian networks aim to model conditional
dependence, and therefore causation, by
representing conditional dependence by edges in a
directed graph.
• Through these relationships, one can efficiently
conduct inference on the random variables in the
graph through the use of factors.
•
Inference and Bayesian
Networks
• A Bayesian network is a directed acyclic graph in
which each edge corresponds to a conditional
dependency, and each node corresponds to a unique
random variable.
• Formally, if an edge (A, B) exists in the graph
connecting random variables A and B, it means that
P(B|A) is a factor in the joint probability distribution,
so we must know P(B|A) for all values of B and A in
order to conduct inference.
• In the above example, since Rain has an edge going
into WetGrass, it means that P(WetGrass|Rain) will be
a factor, whose probability values are specified next to
the WetGrass node in a conditional probability table.
• Support Vector Machine, abbreviated as
SVM can be used for both regression and
classification tasks.
• In logistic regression, we take the output of the linear function and squash the value
within the range of [0,1] using the sigmoid function.
• If the squashed value is greater than a threshold value(0.5) we assign it a label 1, else we
assign it a label 0.
• In SVM, we take the output of the linear function and if that output is greater than 1, we
identify it with one class and if the output is -1, we identify is with another class.
• Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement
range of values([-1,1]) which acts as margin.
Sec. 15.1
Maximizes
Narrower
margin
margin
71
SVM
Sec. 15.1
Geometric Margin
wT x + b
• Distance from example to the separator is r = y
w
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the width of separation between support vectors of classes.
Derivation of finding r:
x ρ Dotted line x’−x is perpendicular to
decision boundary so parallel to w.
r ➢ Unit vector is w/|w|, so line is
x′ rw/|w|.
x’ = x – yrw/|w|.
x’ satisfies wTx’+b = 0.
So wT(x –yrw/|w|) + b = 0
Recall that |w| = sqrt(wTw).
So wTx –yr|w| + b = 0
So, solving for r gives:
w r = y(wTx + b)/|w|
Sec. 15.1
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1
ρ wTxa + b = 1
• Hyperplane
wTxb + b = -1
wT x + b = 0
• This implies:
wT(xa–xb) = 2
ρ = ||xa–xb||2 = 2/||w||2 wT x + b = 0
76
Solving the Optimization Problem
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
f(x) = ΣαiyixiTx + b
• Notice that it relies on an inner product between the test point x and the support vectors xi
• We will return to this later.
• Also keep in mind that solving the optimization problem involved computing the inner products xiTxj
between all pairs of training points.
78
Classification with SVMs
• The most “important” training points are the support vectors; they define the hyperplane.
• Quadratic optimization algorithms can identify which training points xi are support vectors with
non-zero Lagrangian multipliers αi.
• Both in the dual formulation of the problem and in the solution, training points appear only inside
inner products:
80
Non-linear SVMs
• Datasets that are linearly separable (with some noise) work out great:
0 x
0 x
81
Non-linear SVMs: Feature spaces
Φ: x → φ(x)
82
The “Kernel Trick”
• The linear classifier relies on an inner product between vectors K(xi,xj)=xiTxj
• If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the
inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is some function that corresponds to an inner product in some expanded feature
space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj) where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
83
Sec. 15.2.3
Kernels
Common kernels
• Linear
• Polynomial K(x,z) = (1+xTz)d
• Gives feature conjunctions
• Radial basis function (infinite dimensional space)
Objectives:
•To understand how time series works, what factors are affecting a
certain variable(s) at different points of time.
•Time series analysis will provide the consequences and insights of
features of the given dataset that changes over time.
•Supporting to derive the predicting the future values of the time series
variable.
•Assumptions: There is one and the only assumption that is
“stationary”, which means that the origin of time, does not affect the
properties of the process under the statistical factor.
How to analyze Time Series?
Classical set
1. Classical set is a collection of distinct objects. For example, a set of students passing
grades.
2. Each individual entity in a set is called a member or an element of the set.
3. The classical set is defined in such a way that the universe of discourse is splitted into two
groups members and non-members. Hence, In case classical sets, no partial
membership exists.
4. Let A is a given set. The membership function can be use to define a set A is given by:
Fuzzy Logic: Extracting Fuzzy Models from Data
Fuzzy set:
1. Fuzzy set is a set having degrees of membership between 1 and 0. Fuzzy sets are
represented with tilde character(~). For example, Number of cars following traffic signals
at a particular time out of all cars present will have membership value between [0,1].
2. Partial membership exists when member of one fuzzy set can also be a part of other fuzzy
sets in the same universe.
3. The degree of membership or truth is not same as probability, fuzzy truth represents
membership in vaguely defined sets.
4. A fuzzy set A~ in the universe of discourse, U, can be defined as a set of ordered pairs
and it is given by
Fuzzy Logic: Extracting Fuzzy Models from Data