Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Preliminary Idea On Machine Learning

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

ATAL COURSE ON MACHINE LEARNING

AVIJIT BOSE
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
MCKVIE,LILUAH,HOWRAH-711204
WHAT IS MACHINE LEARNING-
BACKGROUND
• In 1997 IBM Deep Blue which was a Supercomputer defeated
World chess Champion Garry Kasparov based on AI.
• It was the first time that the name of “AI” became widely
popular among non academic domains also.
• Some concepts of AI were now taken along with some
portions of inferential statistics and mathematical modeling
which gave rise to Machine Learning.
• Pioneer in the area of Machine Learning – Tom Mitchell.
• Some of the areas that can be covered with Machine Learning-
classification problems, clustering problems, Prediction ,
Cognitive Modeling.
• In the next slide we cover human learning
Human Modeling/Learning- Inspiration
for Machine Learning
• Learning under Expert Guidance- Directly from Experts.
• Learning Guided by Knowledge Gain from Experts-we build
our own notion based on what we have learnt from some
experts in the past.
• Learning by Self /Self Learning- doing a certain job by
ourselves may be multiple times before being successful.
Definition of Machine Learning
• Tom M Mitchell as mentioned in the first slide defines machine
Learning as “ A Computer Program is said to learn from experience E
with respect to some class of Task T and Performance measure P if
its performance at tasks in T as measured by P improves with
experience E”.
• How do Machines Learn?
(a) Data Input
(b) Data Abstraction
(c) Generalization
So the question that arises is that whatever data is made as input it
should be processed well and we come to the concept of feature
engineering. Then it is passed through the underlying algorithm which
is the second phase and next the abstracted representation is
generalized to form a framework for making decision.
Different Types of Learning
• Supervised Learning:- A Machine predicts the class of
unknown objects based on prior class related information of
similar objects.(Predictive Learning)
• Unsupervised Learning:- A machine finds patterns in unknown
objects by grouping similar objects together.(Descriptive
Learning)
• Reinforcement Learning:- A machine learns to act on its own
to achieve the given goal. (Still in Research phase – Success
has been achieved by Google).
Difference Between Classification and
Regression
• Classification:- When we are trying to predict a categorical or
nominal variable it is known as classification problem.
• Regression:- When we try to predict a real valued variable the
problem falls under the category of regression.
When will supervised Learning fail?
Supervised Learning will fail when the quantity and quality of
data is not up to the mark.
Image Showing Supervised Learning
Supervised Learning
• Naïve Bayes
• Decision Tree
• K –Nearest Neighbor
• ML can save up to 52% life of patients who are suffering from
breast cancer.
Linear Regression
• The objective is to predict numerical features like real estate
value, stock price, temperature etc.
• It follows statistical least square method.
• A straight line relationship is fitted between the predictor
variable and the target variable.
• In case of simple linear regression only one predictor variable
is used where as MLR multiple predictor variables are being
used.
How Linear Regression looks like?
A typical Linear Regression
• Y= 𝛽 + λ𝑋
• Y= Target output
• X= Predictor variable

Unsupervised Learning can be classified as :-


1. Clustering
2. Association Analysis
Clustering
Clustering- Different Types
• 1. Partitioning Methods
• 2. Hierarchical methods
• 3. Density Based Methods
1.Partitioning methods
• Uses mean or medoid (etc.) to represent cluster center.
• Adopts distance based approach to refine clusters.
• Find mutually exclusive clusters of spherical or nearly spherical shape.
• Effective for data sets of small or medium size
2. Hierarchical method
• Creates hierarchical or tree like structure through decomposition or
merger.
• Uses distance between nearest or furthest points in neighbouring
clusters as a guideline for refinement.
• Erroneous merges or splits cannot be corrected at subsequent levels.
Clustering
3. Density Based Methods
• Useful for identifying arbitrarily shaped clusters.
• Guiding principle of cluster creation is the identification of
dense regions of objects in shape which are separated by low
density regions.
• May filter out outliers.
Association Analysis
A methodology which is useful for discovering interesting
relationships hidden in large data sets are known as Association
Analysis.
Association Analysis
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
{Bread}  {Milk},
{Milk, Bread} {Eggs, Coke},
{Diaper} {Beer},
{Beer, Bread}  {Milk},
Reinforcement Learning
• Think of a situation as follows
Reinforcement Learning(Contd..)
• As shown in the previous slide the machine basically starts
learning of its own and it may succeed and it may not succeed
once it succeeds it is rewarded based on some reward
function.
• So this process of completion of subtasks from the mistakes is
called reinforcement learning.
• Still in the research phase. However the companies like google
have already achieved the target.
• Real time system it is risky.
Parametric Vs Non parametric
Machine Learning

Unsupervised
Supervised

Parametric Non parametric


Difference Between Parametric and Non
parametric
Parametric Non Parametric
Have a fixed Number of Parameters Do not use fixed No. of Parameter
Make some assumption for Business Do not make any assumption.
Model
Faster Computation Relatively slower than Parametric
Algorithm.
Examples:- Linear Regression, Logistic Examples:- KNN, SVM, Decision Tree
Regression, Linear Discriminant
Analysis
Bias Vs. Variance
• Bias is how far the predicted value differs from the actual
value.
• Variance occurs when the model performs very well on the
training data but does not well perform on the test data.
What does Bias and Variance Implies
High Bias means
• High Training Error
• Validation error /test error is same as training error

High Variance Means


• Low Training error.
• High Validation error.

• Parametric Machine Learning often have High Bias and Low


Variance.
• So Solution?????
• Decision Tree, Ensemble Learning, Bagging, Boosting and Stacking
(Research Phase)
Ensemble Learning
• Use multiple learning algorithm together for the same task.
• It goes for better prediction than individual learning.
Why to use Ensemble Learning?
1. Better Accuracy
2. Higher Consistency
3. Reduces bias and Variance
Bagging , Boosting and Stacking are the methods
Bagging:- They take homogeneous weak learners , learns them
independently from each other in parallel and combines them
following some kind of deterministic averaging process.
Boosting:- Considers weak homogeneous learners ,learns them
sequentially and combines them with deterministic strategy.
Stacking:- That often considers weak heterogeneous learners ,learns
them in parallel and combines them by training a meta model to
output a prediction based on the different weak model prediction.
How to Avoid
• Bagging is the method whereby repeated data sets are taken
with replacement .
• Ultimate mean value is taken which is nothing but mean of
mean.
• So (Mean)n
Single Random Forest
Ensemble Learning…
Probabilistic Reasoning & Markov
Decision Process
• In Bayesian statistics we want to learn about the
probability distribution of the parameters of interest given
the data = the posterior
P(𝜽 𝑫 = 𝑷 𝜽 𝑿 𝑷(𝑫|𝜽)/𝑷(𝑫)
• P(𝜽|D) = Posterior Probability
• P(D| 𝜽) = Likelihood
• P(𝜽)= Prior Probability
• P(D)= Normalizing constant (Hypothetical
Independent)
Need of Probabilistic Reasoning
There are several leading cause of uncertainty
• Information occurred from unreliable sources.
• Experimental Errors
• Equipment fault
• Temperature variation
• Climate change
Probabilistic Reasoning with Uncertainty
Probabilistic reasoning is a way of knowledge representation where we
apply the concept of probability to indicate the uncertainty in
knowledge. In probabilistic reasoning, we combine probability theory
with logic to handle the uncertainty.
Probabilistic Reasoning
Need of probabilistic reasoning in AI
• When there are unpredictable outcomes.
• When specifications or possibilities of predicates becomes too large
to handle.
• When an unknown error occurs during an experiment.
• Prior probability: The prior probability of an event is probability
computed before observing new information.
• Posterior Probability: The probability that is calculated after all
evidence or information has taken into account. It is a combination
of prior probability and new information
𝑃(𝐴∧𝐵)
P(A|B) =
𝑃(𝐵)
P(A⋀B)= Joint probability of A and B
P(B)= Marginal probability of B.
In the above notation we need to calculate probability of event A
under the influence of B which means that B has already occurred
and we need to calculate the event A.
An Example
In a class, there are 70% of the students who like English and 40% of the
students who likes English and mathematics, and then what is the
percent of students those who like English also like mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.

Hence, 57% are the students who like English also like Mathematics
Applying Bayes Theorem
𝑃 𝐸𝑓𝑓𝑒𝑐𝑡𝑠 𝐶𝑎𝑢𝑠𝑒 𝑃(𝐶𝑎𝑢𝑠𝑒)
• P(Cause|Effects)= 𝑃(𝐸𝑓𝑓𝑒𝑐𝑡𝑠)
• Question: what is the probability that a patient has diseases
meningitis with a stiff neck?
• Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff
neck, and it occurs 80% of the time. He is also aware of some more facts,
which are given as follows:
• The Known probability that a patient has meningitis disease is 1/30,000.
• The Known probability that a patient has a stiff neck is 2%.
Ans:- Let a be the proposition that patient has stiff neck and b be the
proposition that patient has meningitis. , so we can calculate the following
as:
• P(a|b) = 0.8
• P(b) = 1/30000
• P(a)= .02
1
.8∗( )
30000
P(B|A)= 0.02 =0.001333333
Hence, we can assume that 1 patient out of 750 patients has meningitis
disease with a stiff neck.
Bayesian Network Graph
Joint Probability Distribution
• If we have variables x1, x2, x3,....., xn, then the probabilities of
a different combination of x1, x2, x3.. xn, are known as Joint
probability distribution.
• P[x1, x2, x3,....., xn], it can be written as the following way in
terms of the joint probability distribution.
• = P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
• = P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
• In general for each variable Xi, we can write the equation as:
• P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
Types of Stochastic Process
• A real stochastic process is a collection of random variables {Xt, t€T}
defined on a probability space.
• Four types of Stochastic Process
• a) Discrete State & Discrete time Stochastic process.[Markov Chain]
S={0,1,2,….} T={0,1,2,3,….} Ex: Discount in motor insurance
depending on road accident . S={0%,10%,20%} &T={ 0,1,2,…}
b) Discrete State and continuous time Stochastic Process.[MDP]
S={ 0,1,2,….} T= { t: 0 ≤t ≤∞} Ex: Car parking in the time interval (0,t)
and no. of cars S=(0,1,2,….)
• Continuous state and discrete time Stochastic process
T={ 0,1,2,…..} and S={ x: 0 ≤ x≤∞} Ex: Share price for an asset at the
close of trading on day with T ={ 0,1,2,…..} and S={ x: 0 ≤ x≤∞}

• Continuous state and continuous time Stochastic process


T= { t: 0 ≤t ≤∞} and S={ x: 0 ≤ x≤∞} Ex: Value of Stock index at time t
with T ={ t: 0 ≤t ≤∞} and S={ x: 0 ≤ x≤∞}
Markov Model
• There are four common Markov models used in different
situations, depending on whether every sequential state is
observable or not, and whether the system is to be adjusted on
the basis of observations made:
Properties of Markov Chain
• A concept that utilizes a mathematical model that combines
probability and matrices to analyze what is called a stochastic
process, which consists of a sequence of trials satisfying
certain conditions. The sequence of trials is called a Markov
Chain which is named after a Russian mathematician called
Andrei Markov (1856-1922).
An Interesting Problem
• An insurance company classifies drivers as low-risk if they are
accident free for one year. Past records indicate that 98% of
the drivers in the low-risk category (L) will remain in that
category the next year, and 78% of the drivers who are not in
the low risk category ( L’) one year will be in the low-risk
category the next year.
Soln:- Find the transition matrix, P

If 90% of the drivers in the community are in the low-risk category this
year, what is the probability that a driver chosen at random from the
community will be in the low-risk category the next year? The year
after next ? (answer 0.96, 0.972 from matrices)
Contd..
Stationary matrix
• When we computed the fourth state matrix of a previous problem
we saw that the numbers appeared to approaching fixed values.
Recall,

• We would find if we calculated the 5th , 6th and kth state matrix, we
would find that they approach a limiting matrix of [0.975 0.025] This
final matrix is called a stationary matrix.
• The stationary matrix for a Markov chain with transition matrix P has
the property that SP = S { To prove that the matrix [ 0.975 0.025] is
the stationary matrix, we need to show that SP = S
Conclusion
• However Stationary matrix is always not achievable. There
were fewer topics which could not be covered.
1. ANN –SLP,MLP, etc.
2. SVM,KNN
3. Deep Learning, CNN (Research Phase)
Anyway Machine Learning is vast like ocean and I know very
little of the pond of the ocean. I hope that with my little
knowledge I have at least given some introduction about the
topics.
THANK YOU

You might also like