Python

Machine Learning
Introduction:-
Machine learning is a growing technology which enables computers to learn
automatically from past data. Machine learning uses various algorithms
for building mathematical models and making predictions using historical
data or information.
> Human learn from their past experience with their learning capabilities.
> But machine works according to the instructions given by us.
> If human works/trains the machine, then machine could work much faster
than human being.
ava Try Catch
Machine learning is about predicting the future from past experience.

What does it mean Learning?
Eg: Ramu laves listening music,

Ramu likes or dislikes a song depends on
 Tempo
 Intensity
 Genere
 Etc.,
Vaagdevi Degree and PG College Page 1

Let us suppose for Instance we consider only instances, Tempo, Intensity.
Song A: Fast Tempo, Soaring Intensity

Looking at previous data we can predict likes or dislikes.
SongB:- Medium Tempo, Medium Intensity machine learning comes

into Picture when the data is unknown.
We draw a circle around the new song arrived 4 points from like group 1
point from dislike group.
Eg: By K-Means song mostly the song is in like group.
Eg: We want Predict how much rating Alice will give to the movie that has
not seen basing on her previous movie prediction.
Movie rating
 Drama/story
 Documentary screenplay
 Lang
 Director
 Actors
 Production composes
ML Means making a informed guess about unseen observation of object, based on

properties of the observed properties of objects.
Alia Just began a course on Machine Learning at the end of exam Bob her teacher
wants to know how she learned
.
But what makes reasonable exams?
 Keep the exam out of syllabus
 What Bob said exactly in class room basing on what open notes.
 What Alice learned exactly in class room, basing on that new application.

Machine Learning Generalization is central concept.
Machine Learning is a growing technology which enables computers to learn
automatically from past data, improves the performance from experience &
predicting without explicating programmed.
Machine Learning is subset AI, mainly concerned with development of algorithm
to learn from data & post experience on their own.
First ML system was developed by Arthur Samuel in 1952 i.e Playing checkers
– program
The system played many games with the program & observed that the program
was able to playing better.
The” Machine Learning” term was coined by Samuel.
Machine learning enables a machine to automatically learn from data, improve

performance from experiences, and predict things without being explicitly
programmed.
With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions
or decisions without being explicitly programmed. Machine learning brings
computer science and statistics together for creating predictive models. Machine
learning constructs or uses the algorithms that learn from historical data. The more
we will provide the information, the higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining
more data.
How does Machine Learning work?
A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The
accuracy of predicted output depends upon the amount of data, as the huge amount
of data helps to build a better model which predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions,
so instead of writing a code for it, we just need to feed the data to generic
algorithms, and with the help of these algorithms, machine builds the logic as per
the data and predict the output. Machine learning has changed our way of thinking
about the problem. The below block diagram explains the working of Machine
Learning algorithm:

MACHINE LEARING DEFINITION:-A computer program is said to learn
from experience E, with respect to some class of Task T & Performance measure
P.
If its performance on Task T as measured by P improves with Experience E.
Components of Machine Learning:-
Applications of Machine learning

Machine learning is a buzzword for today's technology, and it is growing very
rapidly day by day. We are using machine learning in our daily life even without
knowing it such as Google Maps, Google assistant, Alexa, etc. Below are some
most trending real-world applications of Machine Learning:

1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It
is used to identify objects, persons, places, digital images, etc. The popular use
case of image recognition and face detection is, Automatic friend tagging
suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we
upload a photo with our Facebook friends, then we automatically get a tagging
suggestion with name, and the technology behind this is machine learning's face
detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for
face recognition and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is
also known as "Speech to text", or "Computer speech recognition." At present,
machine learning algorithms are widely used by various applications of speech
recognition. Google assistant, Siri, Cortana, and Alexa are using speech
recognition technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the
correct path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or
heavily congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the
performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment
companies such as Amazon, Netflix, etc., for product recommendation to the user.
Whenever we search for some product on Amazon, then we started getting an
advertisement for the same product while internet surfing on the same browser and
this is because of machine learning.
Google understands the user interest using various machine learning algorithms
and suggests the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment
series, movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars.
Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company is working on self-driving car. It is using
unsupervised learning method to train the car models to detect people and objects
while driving.
6. Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as important, normal,
and spam. We always receive an important mail in our inbox with the important
symbol and spam emails in our spam box, and the technology behind this is
Machine learning. Below are some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters

o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision
tree, and Naïve Bayes classifier are used for email spam filtering and malware
detection.
7. Virtual Personal Assistant:
We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding the
information using our voice instruction. These assistants can help us in various
ways just by our voice instructions such as Play music, call someone, Open an
email, Scheduling an appointment, etc.
These virtual assistants use machine learning algorithms as an important part.
These assistant record our voice instructions, send it over the server on a cloud,
and decode it using ML algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by detecting
fraud transaction. Whenever we perform some online transaction, there may be
various ways that a fraudulent transaction can take place such as fake
accounts, fake ids, and steal money in the middle of a transaction. So to detect
this, Feed Forward Neural network helps us by checking whether it is a genuine
transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and
these values become the input for the next round. For each genuine transaction,
there is a specific pattern which gets change for the fraud transaction hence, it
detects it and makes our online transactions more secure.
9. Stock Market trading:
Machine learning is widely used in stock market trading. In the stock market, there
is always a risk of up and downs in shares, so for this machine learning's long
short term memory neural network is used for the prediction of stock market
trends.
10. Medical Diagnosis:
In medical science, machine learning is used for diseases diagnoses. With this,
medical technology is growing very fast and able to build 3D models that can
predict the exact position of lesions in the brain.
It helps in finding brain tumors and other brain-related diseases easily.
11. Automatic Language Translation:
Nowadays, if we visit a new place and we are not aware of the language then it is
not a problem at all, as for this also machine learning helps us by converting the
text into our known languages. Google's GNMT (Google Neural Machine
Translation) provide this feature, which is a Neural Machine Learning that
translates the text into our familiar language, and it called as automatic translation.
The technology behind the automatic translation is a sequence to sequence learning
algorithm, which is used with image recognition and translates the text from one
language to another language.
Types of Machine Learning:-

we have the following type of the machine learning
1. supervised Learning
2. unsupervised Learning
3. Reinforcement Learning
Supervised Learning:- Supervised learning is the types of machine learning in
which machines are trained using well "labelled" training data, and on basis of that
data, machines predict the output.
The labelled data means some input data is already tagged with the correct output.
The aim of a supervised learning algorithm is to find a mapping function to map
the input variable(x) with the output variable(y).
Given
A set of i/p features x1,x2,x3,….xn
A target function y
A set of training examples where target features are given for each example
A new example where only the values for the i/p features for each example
Predict the values for the target features for the new example
 Classification when Y is discrete
Eg:- given set of images of animals we have to predict what kind of animal
it is?

 Regression when Y is continuous
Eg:- Basing on the features of the car we want to predict price of the car
Unsupervised Learning:- Unsupervised learning is a type of machine learning in

which models are trained using unlabeled dataset and are allowed to act on that
data without any supervision.
unsupervised learning is a machine learning technique in which models are not
supervised using training dataset.
Instead, models itself find the hidden patterns and insights from the given data.
It can be compared to learning which takes place in the human brain while
learning new things. Clustering and Association are unsupervised learning
mechanism

Clustering:- grouping the features basing on the similarities
Example: Suppose the unsupervised learning algorithm is given an input dataset
containing images of different types of cats and dogs. The algorithm is never
trained upon the given dataset, which means it does not have any idea about the
features of the dataset.
The task of the unsupervised learning algorithm is to identify the image features on
their own.
Unsupervised learning algorithm will perform this task by clustering the image
dataset into the groups according to similarities between images.
Reinforcement Learning:- Reinforcement learning is an area of Machine

Learning
It is employed by various software and machines to find the best possible
behavior or path it should take in a specific situation. Reinforcement learning
differs from supervised learning in a way that in supervised learning the training
data has the answer key with it so the model is trained with the correct answer
itself whereas in reinforcement learning, there is no answer but the reinforcement
agent decides what to do to perform the given task.
In the absence of a training dataset, it is bound to learn from its experience.
We have an Agent acting in the Environment we have to figure out what action
agent must take action based on rewards and penalty agent gets in
St is current state
Rt is current state rewards
Basing on the current state and reward agent reacts At in the environment and the
state is update to St+1 and Rewards will be Rt+1
The above image shows the robot, diamond, and fire. The goal of the robot is to
get the reward that is the diamond and avoid the hurdles that are fired.
The robot learns by trying all the possible paths and then choosing the path
which gives him the reward with the least hurdles.
Each right step will give the robot a reward and each wrong step will subtract
the reward of the robot.
The total reward will be calculated when it reaches the final reward that is the
diamond.
Machine learning Life cycle
Machine learning has given the computer systems the abilities to automatically
learn without being explicitly programmed. So, it can be described using the life

cycle of machine learning. Machine learning life cycle is a cyclic process to build
an efficient machine learning project. The main purpose of the life cycle is to find
a solution to the problem
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
The most important thing in the complete process is to understand the problem and
to know the purpose of the problem. Therefore, before starting the life cycle, we
need to understand the problem because the good result depends on the better
understanding of the problem.
In the complete life cycle process, to solve a problem, we create a machine
learning system called "model", and this model is created by providing "training".
But to train a model, we need data, hence, life cycle starts by collecting data.
5

1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this
step is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected
from various sources such as files, database, internet, or mobile devices. It is one
of the most important steps of the life cycle. The quantity and quality of the
collected data will determine the efficiency of the output. The more will be the
data, the more accurate will be the prediction.
This step includes the below tasks:
o Identify various data sources
o Collect data
o Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as
a dataset. It will be used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is
a step where we put our data into a suitable place and prepare it to use in our
machine learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
o Dataexploration:
It is used to understand the nature of data that we have to work with. We
need to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Datapre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in the
next step. It is one of the most important steps of the complete process. Cleaning of
data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some of the
data may not be useful. In real-world applications, collected data may have various
issues, including:
o Missing Values
o Duplicate data

o Invalid data
o Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively
affect the quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step
involves:
o Selection of analytical techniques
o Building models
o Review the result
The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the
determination of the type of the problems, where we select the machine learning
techniques such as Classification, Regression, Cluster analysis, Association, etc.
then build the model using prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build
the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve
its performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms.
Training a model is required so that it can understand the various patterns, rules,
and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test
the model. In this step, we check for the accuracy of our model by providing a test
dataset to it.
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the
model in the real-world system.
If the above-prepared model is producing an accurate result as per our requirement
with acceptable speed, then we deploy the model in the real system. But before
deploying the project, we will check whether it is improving its performance using
available data or not. The deployment phase is similar to making the final report
for a project.
Some Canonical Learning Problems :-

There are a large number of typical inductive learning problems.
The primary difference between them is in what type of thing they’re trying to
predict. Here are some examples:
Regression: trying to predict a real value. For instance, predict the value of a stock
tomorrow given its past performance. Or predict Alice’s score on the machine
learning final exam based on her homework scores.
Classification:- it is a process of dividing the datasets into different categories
We have Binary classification and Multiclass Classification
Binary Classification: trying to predict a simple yes/no response. For instance,
predict whether Alice will enjoy a course or not. Or predict whether a user review
of the newest Apple product is positive or negative about the product.
Multiclass Classification: trying to put an example into one of a number of
classes. For instance, predict whether a news story is about entertainment, sports,
politics, religion, etc. Or predict whether a CS course is Systems, Theory, AI or
Other.
Ranking: trying to put a set of objects in order of relevance. For instance,
predicting what order to put web pages in, in response to a user query. Or predict
Alice’s ranked preferences over courses she hasn’t taken
Decision Tree:-
The decision tree is a classic and natural model of learning.
It is closely related to the fundamental computer science notion of “divide and
conquer.”
Although decision trees can be applied to many learning problems
Loan Sanction System :-

Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the
decision node.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
Training Set: refers to the training data
Testing Set: refers to the testing data

o In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node.
o Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.
o Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not. So, to solve this problem,
the decision tree starts with the root node (Salary attribute ). The root node
splits further into the next decision node (distance from the office) and one
leaf node based on the corresponding labels.
o The next decision node further gets split into one decision node (Cab
facility) and one leaf node.
o Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:

The Iterative Dichotomiser 3 (ID3) Algorithm
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm
iteratively (repeatedly) dichotomizes(divides) features into two or more groups at
each step.
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a
decision tree. In simple words, the top-down approach means that we start building
the tree from the top and the greedy approach means that at each iteration we select
the best feature at the present moment to create a node.
n below this new branch add a leaf node with label = most
common target value in the examples
e. Else below this new branch add the subtree ID3(Examples(vi),
Target_Attribute, Attributes – {A})
10. End
11. Return Root
d. Then below this new branch add a leaf node with label = most

10. End
11. Return Root
d. Then below this new branch add a leaf node with label = most
10. End
11. Return Root
This algorithm follows greedy top down recursive

This algorithm uses Information Gain as its splitting criteria.
Entropy(H).
The final goal of the building a decision tree is the depth should be less, and
the nodes which are having more information gain are nearer to the root node

Entropy:-
Entropy is a measure of the amount of uncertainty in the dataset S.
Mathematical Representation of Entropy is shown here -
Entropy=H(S)= -p(y)log2p(y)-p(n)log2p(n)
P(y) =proportion of +ve samples in S

P(n)=proportion of –ve samples in S
Case1: If collection S has equal no of +ve and –ve training examples
i.e. p(y)=p(n) Entropy=1
Eg1:-
Y 2
N 2

P(y)=2/4=0.5 p(n)=2/4=0.5
Entropy= -0.5log20.5-0.5log20.5
= -0.5(log20.5+log20.5)=1
Case2: if all members of S belong to same class

p(y)=1 and p(n)=0 or p(y)=0 and p(n)=1
Entropy=0
Entropy 0 means all are of same class
Eg2:-
Y 0
N 2
P(y)=0/2=0 p(n)=2/2=1
Entropy= -0log20-1log21
= 0
Eg3
Y 2
N 4
P(y)=2/6 p(n)=4/6
Entropy= -2/6log22/6-4/6log24/6
= 0.92
Information Gain IG(A) tells us how much uncertainty in S was reduced after
splitting set S on attribute A
 H(S) - Entropy of set S.

 p(x) - The proportion of the number of elements subset of
features(categorical) to the number of elements in set S for Attribute A.
 H(x) - Entropy of subset (categorical) of features for Attribute A
The steps in ID3 algorithm are as follows:
1. Calculate entropy for dataset.

2. For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
3. Find the feature with maximum information gain.
4. Repeat it until we get the desired tree.
Eg;-
S1-n1
S2-n2
O1-y1
R1-y2
R2-y3
R3-n3
O2-y4
S3-n4
S4-y5
R4-y6
S5-y7
O3-y8
O4-y9
R5—n5
Here,dataset is of binary classes(yes and no), where 9 out of 14 are "yes" and 5 out
of 14 are "no".
Complete entropy of dataset is:
Complete Y N Total
dataset
9 5 14
H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))

= - (9/14) * log2(9/14) - (5/14) * log2(5/14)
= - (-0.41) - (-0.53)
= 0.94
For each attribute of the dataset, let's follow the step-2 of pseudocode : -
First Attribute – Outlook
Outlook Y N total
Sunny 2 3 5

Rain 3 2 5
Overcast 4 0 4
Categorical values - sunny, overcast and rain

H(Outlook=sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971
H(Outlook=rain) = -(3/5)*log(3/5)-(2/5)*log(2/5) =0.971
H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0
Average Entropy Information for Outlook -
I(Outlook) = p(sunny) * H(Outlook=sunny) + p(rain) * H(Outlook=rain) +
p(overcast) * H(Outlook=overcast)
= (5/14)*0.971 + (5/14)*0.971 + (4/14)*0
= 0.693
Information Gain = H(S) - I(Outlook)
= 0.94 - 0.693
= 0.247
Second Attribute – Temperature

Temperatur Y N Total
e
Hot 2 2 4
Cool 3 1 4
Mild 4 2 6
Categorical values - hot, mild, cool

H(Temperature=hot) = -(2/4)*log2(2/4)-(2/4)*log2(2/4) = 1
H(Temperature=cool) = -(3/4)*log2(3/4)-(1/4)*log2(1/4) = 0.811
H(Temperature=mild) = -(4/6)*log2(4/6)-(2/6)*log2(2/6) = 0.9179
Average Entropy Information for Temperature -
I(Temperature) = p(hot)*H(Temperature=hot) +
p(mild)*H(Temperature=mild) + p(cool)*H(Temperature=cool)
= (4/14)*1 + (6/14)*0.9179 + (4/14)*0.811
= 0.9108
Information Gain=H(s)-I(Tempearature)
=0.94-0.9108=0.0292
Third Attribute – Humidity

Categorical values - high, normal
Humidity Y N Total
High 3 4 7
Normal 6 1 7
H(Humidity=high) = -(3/7)*log2(3/7)-(4/7)*log2(4/7) = 0.983

H(Humidity=normal) = -(6/7)*log2(6/7)-(1/7)*log2(1/7) = 0.591
Average Entropy Information for Humidity -

I(Humidity) = p(high)*H(Humidity=high) + p(normal)*H(Humidity=normal)
= (7/14)*0.983 + (7/14)*0.591
= 0.787
Information Gain = H(S) - I(Humidity)
= 0.94 - 0.787
= 0.153
Fourth Attribute – Wind

Categorical values - weak, strong
Wind Y N Total
Weak 6 2 8
Strong 3 3 6
H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811

H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1
Average Entropy Information for Wind -

I(Wind) = p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong)
= (8/14)*0.811 + (6/14)*1
= 0.892
Information Gain = H(S) - I(Wind)

= 0.94 - 0.892
= 0.048

Attribute IG
Outlook 0.693
Temperature 0.247
Humidity 0.153
Wind 0.048
Here, the attribute with maximum information gain is Outlook. So, the decision
tree built so far -
Here, when Outlook == overcast, it is of pure class(Yes).

Now, we have to repeat same procedure for the data with rows consist of Outlook
value as Sunny and then for Outlook value as Rain.
Entropy(Sunny)=H(S=Sunny)= -p(y)log2p(y)-p(n)log2p(n)
= - (2/5) * log2(2/5) - (3/5) * log2(3/5)

= 0.971
First Attribute - Temperature
Categorical values - hot, mild, cool
Temperature Y N Total
Sunny Hot 0 2 2
Mild 1 1 2
Cool 1 0 1
H(Sunny, Temperature=hot) = -0-(2/2)*log(2/2) = 0

H(Sunny, Temperature=cool) = -(1)*log(1)- 0 = 0
H(Sunny, Temperature=mild) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1
I(Sunny, Temperature) = p(Sunny, hot)*H(Sunny, Temperature=hot) +
p(Sunny, mild) * H(Sunny, Temperature=mild) + p(Sunny, cool)*H(Sunny,
Temperature=cool)
= (2/5)*0 + (1/5)*0 + (2/5)*1
= 0.4
Information Gain(Sunny,Temparature) = H(Sunny) - I(Sunny, Temperature)
= 0.971 - 0.4
= 0.571
Second Attribute - Humidity

Categorical values - high, normal
Humidity Y N Total
Sunny High 0 3 3
Normal 2 0 2
H(Sunny, Humidity=high) = - 0 - (3/3)*log(3/3) = 0

H(Sunny, Humidity=normal) = -(2/2)*log(2/2)-0 = 0
Average Entropy Information for Humidity -

I(Sunny, Humidity) = p(Sunny, high)*H(Sunny, Humidity=high) + p(Sunny,
normal)*H(Sunny, Humidity=normal)
= (3/5)*0 + (2/5)*0

=0
Information Gain(sunny,humidity) = H(Sunny) - I(Sunny, Humidity)

= 0.971 - 0
= 0.971
Third Attribute - Wind
wind Y N Total
Sunny Weak 1 2 3
Strong 1 1 2
H(Sunny, Wind=weak) = -(1/3)*log2(1/3)-(2/3)*log2(2/3) = 0.918

H(Sunny, Wind=strong) = -(1/2)*log2(1/2)-(1/2)*log2(1/2) = 1

I(Sunny, Wind) = p(Sunny, weak)*H(Sunny, Wind=weak) + p(Sunny,
strong)*H(Sunny, Wind=strong)
= (3/5)*0.918 + (2/5)*1
= 0.9508
Information Gain(sunny,wind) = H(Sunny) - I(Sunny, Wind)

= 0.971 - 0.9508
= 0.0202
Sunny,categorical values IG
Sunny, temperature 0.571
Sunny,humidity 0.971
Sunny,wind 0.0202
Here, the attribute with maximum information gain is Humidity. So, the
decision tree built so far –

Here, when Outlook = Sunny and Humidity = High, it is a pure class of
category "no". And When Outlook = Sunny and Humidity = Normal, it is
again a pure class of category "yes". Therefore, we don't need to do further
calculations.
Complete entropy of Rain is –
Rain Y N Total
3 2 5
H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))

= - (3/5) * log2(3/5) - (2/5) * log2(2/5)
= 0.971

First Attribute – Temperature
Categorical values - mild, cool
Rain Properties Y N Total
Mild 2 1 3
Cool 1 1 2
H(Rain, Temperature=cool) = -(1/2)*log(1/2)- (1/2)*log(1/2) = 1

H(Rain, Temperature=mild) = -(2/3)*log(2/3)-(1/3)*log(1/3) = 0.918
I(Rain, Temperature) = p(Rain, mild)*H(Rain, Temperature=mild) + p(Rain,
cool)*H(Rain, Temperature=cool)
= (3/5)*0.918 +(2/5)*1
= 0.9508
Information Gain = H(Rain) - I(Rain, Temperature)

= 0.971 - 0.9508
= 0.0202
Second Attribute - Wind
Rain Properties Y N Total
Weak 3 0 3
Strong 0 2 2
H(Wind=weak) = -(3/3)*log(3/3)-0 = 0
H(Wind=strong) = 0-(2/2)*log(2/2) = 0

I(Wind) = p(Rain, weak)*H(Rain, Wind=weak) + p(Rain, strong)*H(Rain,
Wind=strong)
= (3/5)*0 + (2/5)*0
=0
Information Gain = H(Rain) - I(Rain, Wind)

= 0.971 - 0
= 0.971
Rain IG
Rain,temparture 0.0202
Rain,wind 0.971

Here, the attribute with maximum information gain is Wind. So, the decision
tree built so far -
Here, when Outlook = Rain and Wind = Strong, it is a pure class of category
"no". And When Outlook = Rain and Wind = Weak, it is again a pure class of
category "yes".
And this is our final desired tree for the given dataset.
characteristics of ID3 algorithm

The characteristics of ID3 Algorithm are as follows:
 ID3 uses a greedy approach that's why it does not guarantee an optimal
solution; it can get stuck in local optimums
 ID3 can overfit to the training data (to avoid overfitting, smaller decision
trees should be preferred over larger ones).
 This algorithm usually produces small trees, but it does not always produce
the smallest possible tree.
 ` ID3 is harder to use on continuous data (if the values of any given attribute
is continuous, then there are many more places to split the data on this
attribute, and searching for the best value to split by can be time consuming)
Formalizing the Learning Problem:-

There are several issues that we must take into account when formalizing the
notion of learning.
The performance of the learning algorithm should be measured on unseen “test”
data.

The way in which we measure performance should depend on the problem we are
trying to solve.
we have a dataset of observations S={(x1,y1),…,(xn,yn)} where xi is a feature
vector and yi is a label and we wish to learn how to infer the value of yi given xi.
For an example, xi can be a vector of specific medical measurements and tests
results (such as blood glucose level, and body mass index) of a patient and yi is
the whether that patient is diabetic?, and we wish to learn how to diagnose
diabetes given the set of medical tests results.
We can formalize this fact by saying that the values of xi and yi are realizations
of two random variables X and Y with probability distributions PX and PY
respectively.
We know that there are some rules on the values of X and Y that we expect any
realization of them to follow. In the diabetes example, we know that a value of a
blood glucose test (a component of the x vector) cannot be negative, so it belongs
to a space of positive numbers. We also know that value of the label can either be
0 (non-diabetic) or 1 (diabetic), so it belongs to a space containing only 0 and 1.
These kind of rules define what we formally call a space. We say that X takes
values drawn from the input space X, and Y from the output space Y.
loss function:- It is a function which tells us how “bad” a system’s prediction is in
comparison to the ground truth. In particular, if y is the ground truth and yˆ is the
system’s prediction, then l( y , y ˆ )is a measure of error.
For three of the canonical tasks discussed above, we might use the following loss
functions:
Regression: squared distance to compute loss
l( y , y ˆ )= (y − yˆ) 2 or
absolute distance l( y , y ˆ )= |y − yˆ|.
Binary Classification: zero/one values to compute loss
l( y , y ˆ )= 0 if y = yˆ This notation means that the loss is zero if the prediction is
correct otherwise l( y , y ˆ )=1 our prediction is wrong
Multiclass Classification: also zero/one loss.
When ever we are defining loss function,
we need to consider where the data comes from?
The probability distribution input/output pairs is called the Data Generating
Distribution
If we write x for i/p and y for o/p then D is a distribution over (x,y) pairs
If D is a finite discrete distribution, defined by a finite data set {(x1, y1),
(x2,y2) . . . ,(xn, yn) }where each pair is having equal weight on examples with
probability 1/n Then the Average Expected loss formally given as
n
1
∈= ∑ l( yi, y ˆ i)
n i =1

Limits of Learning:-
Not every thing learnable
Data Generating Distributions:-
The probability distribution D over input/output pairs (x, y) ∈ X×Y.
Let D is the distribution,Assume we have a Python function compute D by
taking two inputs x and y, and returned the probability of that x, y pairs under D.
We can define the Bayes optimal classifier as the classifier that, for any test input
xˆ returns the yˆ such that maximizes compute over D(xˆ, yˆ), or, more formally:
f BO(xˆ) = arg max yˆ∈Y D(xˆ, yˆ)
This classifier is optimal in one specific sense: of all possible classifiers, it
achieves the smallest zero/one error.
Theorem 1 (Bayes Optimal Classifier):-

The Bayes Optimal Classifier f BO achieves minimal zero/one error of any
deterministic classifier.
This theorem assumes that we are comparing against deterministic classifiers.
You can actually prove a stronger result that f BO is optimal for randomized
classifiers as well
for a given x f BOchooses the label with highest probability,
thus minimizing the probability that it makes an error
. Proof of Theorem 1. Consider some other classifier that claims to be better than
f BO. Then, there must be some x on which g(x) != f BO (x)
Fix such an x. Now, the probability that f BO makes an error on this particular x is
1 − D(x, f BO (x)) and
the probability that g makes an error on this x is 1 − D(x, g(x)).
But f BO was chosen in such a way to maximize D(x, f BO (x)), so this must be
greater than D(x, g(x)). Thus, the probability that f BO error on this particular x is
smaller than the probability that g error on it. This applies to any x for which
f BO (x) != g(x) and therefore f BO achieves smaller zero/one error than any g
The Bayes error rate (or Bayes optimal error rate) is the error rate of the Bayes
optimal classifier. It is the best error rate you can ever hope to achieve on this
classification problem.If we have data distribution, forming an optimal classifier
would be trivial. Unfortunately, we don’t no distribution, so we need to figure out
ways of learning the mapping from x to y given only access to a training set
sampled from D, rather than D itself.
Inductive Bias:- Every Machine Learning Algorithm with ability to generalize
beyond the training data that sees has some type inductive bias

Inductive bias is the assumption made by the model to learn the target function
and generalize beyond the training data
Inductive bias is set assumptions a learner uses to predict the results given input
that has not encountered
Given a training examples ,there are typically many consistent decision trees with
the training example
How does ID3 chooses one among these trees?
ID3 search strategy chooses a tree based on the following
i. Select in favor of shorter tree over longer ones
ii. Select the tree that places attributes with higher information gain
closer to the root
The above to constitute inductive bias of ID3 Algorithm
Consider a variant of the decision tree learning algorithm.
In this variant, we will not allow the trees to grow beyond some pre-defined
maximum depth, d.
That is, once we have queried on d-many features,
we cannot query on any more and must just make the best guess
we can at that point. This variant is called a shallow decision tree.
The key question is: What is the inductive bias of shallow decision trees? Roughly,
their bias is that decisions can be made by only looking at a small number of
features.
The inductive bias of a decision tree is that the sorts of things we want to learn to
predict are more
Consider the following examples with ClassA and ClassB as
Training Data
Let us suppose we have the following Test Data

It can be classified as either ABBA or AABB.
Which of these solutions is right?
If we give this same example to 100 people,
60 − 70 of them come up with the ABBA prediction and
30 − 40 come up with the AABB prediction. Why?
Presumably because the first group believes that the relevant distinction is
between “bird” and “non-bird”
while the second group believes that the relevant distinction is between
“fly” and “no-fly.”
This preference for one distinction (bird/non-bird) over another (fly/no-fly)
is a bias that different human learners have.
In the context of machine learning, it is called inductive bias:
in the absence of data that narrow down the relevant concept, what type of
solutions are we more likely to prefer?
Two thirds of people seem to have an inductive bias in favor of bird/non-bird, and
one third seem to have an inductive bias in favor of fly/no-fly.
Not Everything is Learnable :-
There are many reasons why a machine learning algorithm might fail on some
learning task
There could be noise in the training data. Noise can occur both at the feature level
and at the label level
Some example may not have a single correct answer
In the inductive bias case, it is the particular learning algorithm that you are using
that cannot cope with the data
Under fitting and Over fitting:-
Over fitting:- if our algorithm works well for Training data but not well with
Test data is called over fitting
Overfitting is when algorithm pay too much attention to idiosyncracies of the
training data, and aren’t able to generalize well.

Often this means that your model is fitting noise, rather than whatever it is
supposed to fit.
Eg:- A student who memorizes answers to past exam questions without
understanding them has overfit the training data. Like the full tree, this student also
will not do well on the exam.
Under fitting:- if our algorithm does not perform well even for Training data
then it is called under fitting
Underfitting is when you had the opportunity to learn something but didn’t.
eg:-
. A student who hasn’t studied much for an upcoming exam will be underfit to the
exam, and consequently will not do well, like empty tree .
A model that is neither overfit nor underfit is the one that is expected to do best in
the future
Techniques to reduce underfitting:

1. Increase the model complexity
2. Increase the no of feature
3. Remove the noise data
4. Increase no of epochs
Techniques to reduce overfitting
5. Increase training data
6. Reduce the complexity of model
7. Early stop in trainingphase
8. Laassoregression/drop outs reducing in nn
Separation of Training and Test Data:-
The easiest approach is to set aside some of your available data as “test data” and
use this to evaluate the performance of our learning algorithm.
For instance, the pottery recommendation service that we want to work for might
have collected 1000 examples of pottery ratings.
We will select 800 of these as training data and set aside the final 200 as test
data.
we will run our learning algorithms only on the 800 training points.
Only once we have done then we will apply our learned model to the 200 test
points, and report our test error on those 200 points to our boss.
The 80/20 split is not magic: it’s simply fairly well established
Occasionally people use a 90/10 split instead, especially if they have a lot of data.
In order to tune hyperparameters we take development data also
The development data is also called as validationdata, held-out-data
The general approach as follows
1.Split your data into 70% training data, 10% development data and 20% test data.
Let’s take our original 1000 data points, and select 700 of them as training
data. From the remainder, take 100 as development data1 and the remaining
200 as test data.
2. For each possible setting of your hyperparameters:
(a) Train a model using that setting of hyperparameters on the training data.
(b) Compute this model’s error rate on the development data.
3. From the above collection of models, choose the one that achieved the
lowest error rate on development data.
4. Evaluate that model on the test data to estimate future test performance
The cardinal rule of machine learning is: “Never ever touch test data”.
Models, Parameters and Hyperparameters:-

Models
A machine learning model is defined as a mathematical representation of the

output of the training process.
A machine learning model is similar to computer software designed to recognize
patterns or behaviors based on previous experience or data.
The learning algorithm discovers patterns within the training data, and it outputs an
ML model which captures these patterns and makes predictions on new data.

Hyperparameters
Hyperparameters are parameters whose values control the learning process and
determine the values of model parameters that a learning algorithm ends up
learning.
The prefix ‘hyper_’ suggests that they are ‘top-level’ parameters that control the
learning process and the model parameters that result from it.
As a machine learning engineer designing a model, we choose and set
hyperparameter values that your learning algorithm will use before the training of
the model even begins.
hyperparameters are external to the model because the model cannot change its
values during learning/training.
Hyperparameters are used by the learning algorithm when it is learning but they are
not part of the resulting model.
The hyperparameters that were used during training are not part of this model.
Basically, anything in machine learning and deep learning that you decide their
values or choose their configuration before training begins and whose values or
configuration will remain the same when training ends is a hyperparameter.
Here are some common examples

 Train-test split ratio
 Choice of activation function in a neural network (nn) layer
 The choice of cost or loss function the model will use
 Number of hidden layers in a nn
 Number of activation units in each layer
 The drop-out rate in nn (dropout probability)
 Number of iterations (epochs) in training a nn
 Number of clusters in a clustering task

 Kernel or filter size in convolutional layers
 Pooling size
 Batch size’
 K value in KNN
Parameters
Parameters on the other hand are internal to the model.
That is, they are learned or estimated purely from the data during training as the
algorithm used tries to learn the mapping between the input features and the labels
or targets.
Model training typically starts with parameters being initialized to some values
(random values or set to zeros).
As training/learning progresses the initial values are updated using an optimization
algorithm
The learning algorithm is continuously updating the parameter values as learning
progress but hyperparameter values set by the model designer remain unchanged.
At the end of the learning process, model parameters are what constitute the model
itself.
Examples of parameters
 The coefficients (or weights) of linear and logistic regression models.
 Weights and biases of a nn
 The cluster centroids in clustering
Therefore, setting the right hyperparameter values is very important because it

directly impacts the performance of the model that will result from them being used
during model training.
The process of choosing the best hyperparameters for our model is called
hyperparameter tuning
Real World Applications of Machine Learning:-
(2) we have some real world goal like increasing revenue for our search
engine, and decide to try to increase revenue by displaying better ads.
(3) We convert this task into a machine learning problem by deciding to
train a classifier to predict whether a user will click on an ad or not.
(4) In order to apply machine learning, we must collect some training
data;
(5) in this case, we collect data by logging user interactions with the
current system. We must choose what to log;
(6) we choose to log the ad being displayed, the query the user entered
into our search engine, and binary value showing if they clicked or
not.
(7) In order to make these logs consumable by a machine learning
algorithm, we convert the data into input/output pairs: in this case,
pairs of words from a bag-of-words representing the query and a bag-
of-words representing the ad as input, and the click as a ± label.
(8) We then select a model family e.g., depth 20 decision trees, and
thereby an inductive bias, for instance depth ≤ 20 decision trees.
(9) We’re now ready to select a specific subset of data to use as training
data: in this case, data from April 2016. We split this into training and
development and
(10) learn a final decision tree, tuning the maximum depth on the
development data. We can then use this decision tree to
(11) make predictions on some held-out test data, in this case from
the following month.May 2016 We can
(12) measure the overall quality of our predictor as zero/one loss
(clasification error) on this test data and finally
(13) deploy our system.

Geometry and Nearest Neighbors
Prediction is a task of mapping inputs to outputs
The inputs are list of features values
Geometric view of data:- It is view where we have one dimension for every
feature. In this view, examples are points in a high dimensional space
Once we think of a data set as a collection of points in high dimensional space, we

can start performing geometric operations on this data.
For instance, suppose you need to predict whether Alice will like Algorithms.
Perhaps we can try to find another student who is most “similar” to Alice, in terms
of favorite courses. Say this student is Jeremy. If Jeremy liked Algorithms, then we
might guess that Alice will as well. This is an example of a nearest neighbor model
of learning.
From Data To Features:-
Generally the data will be in the form of table
Where each row represents the instance or example and the each column represents
feature
Entire table is called DataSet
The features are not important, the value of the feature is important for a Machine
Learning algorithm
Feature Vector:- The representation of symbolic characters or numeric values of
features is called Feature vector
Feature vector contains one dimension of each feature
Numerical feature can directly represented in feature vector
Age Height Weight
25 5.5 75
20 5.4 65
• Individual representation
<25,5.5,75>
<20,5.4,65>
• Complete feature vector
Binary features such as yes/no are represented using 0 and 1 value respectively

The mapping from feature values to vectors is straight forward in the case of real
valued features (trivial) and binary features (mapped to zero or one).
It is less clear what to do with categorical features. For example, if our goal is to
identify whether an object in an image is a tomato, blueberry, cucumber or
cockroach,
we might want to know its color: is it Red, Blue, Green or Black?
One option would be to map Red to a value of 0, Blue to a value of 1, Green to a
value of 2 and Black to a value of 3. The problem with this mapping is that it turns
an unordered set (the set of colors) into an ordered set (the set {0, 1, 2, 3}). In
itself, this is not necessarily a bad thing.
But when we go to use these features, we will measure examples based on their
distances to each other
By doing this mapping, we are essentially saying that Red and Blue are more
similar (distance of 1) than Red and Black (distance of 3).
This is probably not what we want to say!
A solution is to turn a categorical feature that can take four different values (say:
Red, Blue, Green and Black) into four binary features (say: IsItRed?, IsItBlue?,
IsItGreen? and IsItBlack?).
In general, if we start from a categorical feature that takes V values, we can map it
to V-many binary indicator features.
Binary features become 0 (for false) or 1 (for true).
• Categorical features with V possible values get mapped to V-many binary
indicator features.
feature vector will have D-many components.
We will denote feature vectors as x = x1, x2, . . . , xd so that xd denotes the value
of the dth feature of x.
Since these are vectors with real-valued components in D-dimensions, we say that
they belong to the space RD.
For D = 2, our feature vectors are just points in the plane,

For D = 3 this is three dimensional space. For D > 3 it becomes quite hard to
visualize.
If D = 4 as “time” – this will just make things confusing.

Unfortunately, for the sorts of problems you will encounter in machine learning,
D ≈ 20 is considered “low dimensional,” D ≈ 1000 is “medium dimensional” and
D ≈ 100000 is “high dimensional.”

Highdimensional space means Let’s say we have n samples (data points,
instances) and p features ( attributes, independent variables, explanatory
variables).we may think high dimensional data is simply a data set with a very
large p.
Then, please tell me whether p=1000 means high dimensional data. In modern

machine learning, 1000 features is not a big deal
In fact, “high dimension” has a very rigorous meaning: it means a data set
whenever p>n, no matter what p is or n is. Because in statistics, you will never
have a deterministic answer when p>n unless you introduce your own assumption.
For example, if you have 3 data points, and 5 features each, it’s a high dimensional
data. On the other hand, even if you have 500k features, once you have 1M
samples, it’s still low dimensional.
K-Nearest Neighbors:-
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which
can be used for both classification as well as regression predictive problems.
However, it is mainly used for classification predictive problems in industry.
In order to cluster the data KNN use distance metric
Distance metrics are a key part of several machine learning algorithms.
These distance metrics are used in both supervised and unsupervised learning,
generally to calculate the similarity between data points.
An effective distance metric improves the performance of our machine learning

model, whether that’s for classification tasks or clustering.

This will happen if their features are similar, right? When we plot these points,
they will be closer to each other in distance.
We have the following Distance metric
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance
1. Euclidean Distance
Euclidean Distance represents the shortest distance between two points.

Most machine learning algorithms including K-NN,k-means use this distance
metric to measure the similarity between observations.
Let’s say we have two points as shown below:
So, the Euclidean Distance between these two points A and B will be:
eg:- A(2,4) B(5,4) =>d=((5-2)2+(4-4)2)1/2=(3)1/2=3

We use this formula when we are dealing with 2 dimensions. We can generalize
this for an n-dimensional space as:
Where,
 n = number of dimensions
 pi, qi = data points
Let’s code Euclidean Distance in Python. This will give you a better understanding
of how this distance metric works.
2. Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all
the dimensions.
We can represent Manhattan Distance as:

Since the above representation is 2 dimensional, to calculate Manhattan Distance,
we will take the sum of absolute distances in both the x and y directions. So, the
Manhattan distance in a 2-dimensional space is given as:
Eg:- A(2,4) B(5,4) =>d=|5-2|+|4-4|=3
And the generalized formula for an n-dimensional space is given as:
Where,
 n = number of dimensions
 pi, qi = data points
3. Minkowski Distance
Minkowski Distance is the generalized form of Euclidean and Manhattan Distance.
The formula for Minkowski Distance is given as:

4. Hamming Distance
Hamming Distance measures the similarity between two strings of the same length.
The Hamming Distance between two strings of the same length is the number of
positions at which the corresponding characters are different.
The following two properties would define KNN well −

 Lazy learning algorithm − KNN is a lazy learning algorithm because it
does not have a specialized training phase and uses all the data for training
while classification.
 Non-parametric learning algorithm − KNN is also a non-parametric
learning algorithm because it doesn’t assume anything about the underlying
data.
Work flow of KNN:-
K-nearest neighbors (KNN) algorithm

KNN algorithm simple stores all the available data and classifies new data based
on “similarity measure”
It suggest that if your are similar to your neighbors then you are one of them
Eg:- if apple looks similar to the banana, mango rather than monkey, cat then most
likely apple belongs to fruits group
KNN is general used for searching for a similar items

uses ‘feature similarity’ to predict the values of new data points assigned to a
value(label) based on how closely it matches the points in the training set
K in KNN- indicates no of nearest neighbors which are voting class of new data or
testing data
If k=1 then testing data have the same label one closer to it
If K=3 then the label of the 3 closet instance are checked and most common label
is assigned to testing data
Example
Amazon uses in Recommender systems:- Amazon recommends different
products based on browsing history, pull the products most likely to buy
Concept search:- searching semantically similar documents containing similar

topics, finding a concept from anonymous amount of documents
Image recognization, handwriting recognization, video recognization
Example
Consider a plot with orange and blue point blue point are class A points and
orange points belongs to class B. If a new point star arrives at some position now
we want predict the class of this point .
If k=3 then if consider 3 closer points which are least distance from the new point
2 of them are orange and 1 is blue most frequent class is orange so we classify the
point belongs to

KNN -Algortihm
 We consider all the training data and test point
Training data{x1,x2,x3,…xn} and x^ is the test
 We initialize k value like 1 or 3 or 5 etc
 We compute the distance from test point to all other training points
S={d(x1,x^),d(x2,x^)……d(xn,x^)
 We sort the distance in increasing order
Sort s
 Among the sorted distance take first k points and observe the labels
for i1 to k
(disti,labeli)si
 We will find which is the majority class of these k points
 We will confirm that is the class of our test point
eg:- we are given with data of customers where height in cm and weight in kg and
size of the T shirt size required
and a Test point
customer named 'Manoz' has height 161cm and weight 61kg.
now we have to predict the Size of T-shirt required for this customer
Height Weights
T shirt size
In cm In kg
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L

165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
Let us suppose k =5
Now we have calculated distance from Testpoint to all other points
Euclidean distance
Example 1st point(158,58) and Testpoint(161,61)
=√(161−158)2 +(61−58)2=√3 2=4.2
D((158,59)(161,61))=
√(161−158)2 +(61−59)2=3.6
D((158,63)(161,61))=
√(161−158)2 +(61−63)2=3.6
D((160,59)(161,61))=
√(161−160)2 +(61−59)2=2.2
D((160,60)(161,61))=
√(161−160)2 +(61−60)2=1.4
D((163,60)(161,61))=
√(161−163)2 +(61−60)2=2.2
D((163,61)(161,61))=
√(161−163)2 +(61−61)2=2.0
D((160,64)(161,61))=
√(161−160)2 +(61−64)2=3.2
D((160,64)(161,61))=
√(161−160)2 +(61−64)2=3.2
D((163,60)(161,61))=
√(161−163)2 +(61−60)2=3.6
D((165,61)(161,61))=
√(161−165)2 +(61−61)2=4.0
D((165,62)(161,61))=
√(161−165)2 +(61−62)2=4.1
D((165,65)(161,61))=
√(161−165)2 +(61−65)2=5.7
D((168,62)(161,61))=
√(161−168)2 +(61−62)2=7.1
D((168,63)(161,61))=
√(161−168)2 +(61−63)2=7.3
D((168,68)(161,61))=
√(161−168)2 +(61−68)2=8.6
D((170,63)(161,61))=
√(161−170)2 +(61−63)2=9.2
D((170,64)(161,61))=
√(161−170)2 +(61−64)2=9.5
D((170,68)(161,61))=
√(161−170)2 +(61−68)2=11.4
In the graph below, binary dependent variable (T-shirt size) is displayed in blue
and orange color.
'Medium T-shirt size' is in blue color and

'Large T-shirt size' in orange color.
New customer information is exhibited in yellow circle.
Four blue highlighted data points and one orange highlighted data point are close
to yellow circle.
so the prediction for the new case is blue highlighted data point which is Medium
T-shirt size.
We use KNN when there are non linear decision boundaries between classes
We have large amount of data
K is hyper parameter how to select the k-value

There is possibility of miss classifying the data points where there is mix of the
both the data points
As we go away from this region then the classification is error less

Rules for selecting K –value:-
o There is no particular way to determine the best value for "K", so we need to
try some values to find the best out of them. The most preferred value for K
is 5.
o Basing on accuracy we can select suitable k value by iterating through
possible k values
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.
inductive bias that we’ve seen for KNN is that it assumes that nearby points
should have the same label.
Another aspect, which is quite different from decision trees, is that all features
are equally important
Issues in KNN
A related issue with KNN is feature scale. Suppose that we are trying to classify
whether some object is a ski or a snowboard .
We are given two features about this data: the width and height. As is standard
in skiing, width is measured in millimeters and height is measured in
centimeters. Since there are only two features, we can actually plot the entire
training set

ski is the positive class. Based on this data, we might guess that a KNN
classifier would do well. Suppose, however, that our measurement of the width
was computed in millimeters (instead of centimeters). Since the width values
are now tiny, in comparison to the height values, a KNN classifier will
effectively ignore the width values and classify almost purely based on height.
The predicted class for the displayed test point had changed because of this
feature scaling.
Feature Scaling is the process of bringing all of the features of a Machine

Learning problem to a similar scale or range. The definition is as follows
Feature scaling is a method used to normalize the range of independent

variables or features of data.
Min-Max Normalization: This technique re-scales a feature or observation

value with distribution value between 0 and 1.
Xnew=xi-xmax/xmax-xmin=
After feature scaling
Age Salary
1 1
0 0
0.176 -1

Decision Boundaries:-
A decision boundary is a line (in the case of two features), where all (or
most) samples of one class are on one side of that line, and all samples of the
other class are on the opposite side of the line. The line separates one class
from the other.
The solid line separating the positive regions from the negative regions is called
the decision boundary for this classifier. It is the line with positive land on one
side and negative land on the other side
Decision boundaries are useful ways to visualize the complexity of a learned

model. Intuitively, a learned model with a decision boundary that is really jagged
is really complex and prone to overfitting. A learned model with a decision
boundary that is really simple potentially underfit.
The decision boundaries of the
Consider the following decision tree decision tree

Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
k-Means algorithm:-
It is a unsupervised learning algorithm. In unsupervised learning we contain
data with examples but not contains corresponding labels.
clustering is one such a unsupervised mechanism used in unsupervised learning
algorithms. K-means is an clustering algorithm.
Clustering is a process of grouping examples basing on their similarity
Clustering is a task of dividing the data into groups such that similar featured data
points lie in one group
Cluster is collection of data points similar to one another with in the cluster,
dissimilar points to another cluster
Good clustering method uses have high intra cluster similarity, low inter cluster
similarity
By using distance metrics we can make similarity measure

Basing assumptions:-
d(i,i)=0
d(i,j)=d(j,i)
d(i,j) is non-negative
K-means is heuristic ,partitioning approach for clustering data points in iterative
manner
Inputs:-
K =no of clusters
Set of Datapoints X={X1,X2,X3,…….,Xm}
Where each xi={xi1,xi2,xi3,………..,xin}
Each cluster is having centeroids randomly{c1,c2,c3,…….,ck}
Output:-
K no of clusters with new centeroids
Set of k clusters S={s1,s2,s3,….sk}
k-means algorithm
step1:- choose no of clusters=k
step2:- Randomly choose initial positions of k –centeroids
step3:- Assign each of the data points to the nearest centeroid
step4: for each cluster
i. Calculate intra cluster variance
ii. Re-compute the centeroids positions
Step5: if the intra cluster variance don’t change then
Converges then Stop
Step 6 Else
Repeat step 3 and step 4
Step7 stop

Problem :-
Suppose we have 4 types of medicines with two attributes ph and weight index
cluster the data points where k=2
initial centeroid are A,C
Medicine PH weight
A 1 1
B 2 1
C 4 3
D 5 4
Sol:c1,c2 are centroids of clusters

c1=A c2=C
B(2,1) c1(1,1) C2(4,3)
d(B,c1)=√(1−2)2 +(1−1)2 =√ 12+ 02=1
d(B,c2)=√ (4−2)2 +(3−1)2=√ 22+ 22=2*√ 2=2.8
B belongs s1 1st cluster
D(5,4) c1(1,1) c2(4,3)
d(D,c1)=√ (1−5)2 +(1−4)2=√ 4 2+ 32=√ 25=5
d(D,c2)=√ (4−5)2+(3−4)2=√ 12+ 12 =√ 2=1.42
D belongs s2 2nd cluster
S1={A,B} s2={C,D}
New centeroids are
A(1,1) B(2,1)
1+ 2 1+ 1 3
C1= 2 , 2 = 2 ,1=(1.5,1)
C(4,3) D(5,4)
4+ 5 3+4 9 7
C2= 2 , 2 = 2 , 2 =(4.5,3.5)

B(2,1) c1(1.5,1) C2(4.5,3.5)
d(B,c1)=√(1.5−2)2 +(1−1)2=√ 0.52 +02=0.5
d(B,c2)=√( 4.5−2)2 +(3.5−1)2 =√ 2.52 +2.52=2√ 2.5=3.16
B belongs s1 1st cluster
D(5,4) c1(1.5,1) C2(4.5,3.5)

d(D,c1)=√ (1.5−5)2+(1−4)2=√ 3.52 +32=√ 21.25=4.6
d(D,c2)=√ (4.5−5)2 +(3.5−4 )2=√ 0.52 +0.52 =0.5*√ 2=0.71
D belongs s2 2nd cluster
There is no variation in clusters
Then the centers will be remain same so we stop the process
Finally we got to clusters s1={A,B}
S2={C,D}
With centers c1= (1.5,1) and c2=(4.5,3.5)
II problem:
Consider another example
X 1 2 2 3 4 5
1 1 3 2 3 5
No of cluster k=2
c1=(2,1)
c2-(2,3)
compute Euclidean distance from each and every point to the centeroids of the
clusters c1,c2

Distance from Distance from Assigned cluster
c1(2,1) c2(2,3)
X1(1,1) 1 2.24 Cluster1-s1
X2(2,1) 0 2 Cluster1-s1
X3(2,3) 2 0 Cluster2-s2
X4(3,2) 1.41 1.41 Cluster1-s1
X5(4,3) 2.82 2 Cluster2-s2
X6(5,5) 5 3.61 Cluster2-s2
S1={x1,x2,x4}={(1,1),(2,1).(3,2)};S2-{x3,x4,x5}={(2,3),(4,3),(5,5)}
New centeroids
C1=((1+2+3)/3,(1+1+2)/3)=(2,4/3)=(2,1.33)
C2=((2+4+5)/3,(3+3+5)/3)=(11/3,11/3)=(3.67,3.67)
Again we have to compute Euclidean distance from new centeroids to each and
every point

c1(2,1.33) c2(3.67,3.67)
X1(1,1) 1.05 3.78 Cluster1-s1
X2(2,1) 0.33 3.15 Cluster1-s1
X3(2,3) 1.67 1.8 Cluster1-s1
X4(3,2) 1.204 1.8 Cluster1-s1
X5(4,3) 2.605 0.75 Cluster2-s2
X6(5,5) 4.74 1.88 Cluster2-s2
S1={x1,x2,x3,x4}={(1,1),(2,1), (2,3),(3,2)};S2-{x4,x5}={(4,3),(5,5)}
New centeroids

C1=((1+2+2+3)/4, (1+1+3+2)/4) =(2,7/4)=(2,1.75)
C2=((4+5)/2,(3+5)/2)=(9/2,8/2)=(4.5,4)
Again we have to compute Euclidean distance from new centeroids to each and
every point

c1(2,1.75) c2(4.5,4)
X1(1,1) 1.25 4.61 Cluster1-s1
X2(2,1) 0.75 3.9 Cluster1-s1
X3(2,3) 1.25 2.69 Cluster1-s1
X4(3,2) 1.03 2.5 Cluster1-s1
X5(4,3) 2.36 1.12 Cluster2-s2
X6(5,5) 4.42 1.12 Cluster2-s2
S1={x1,x2,x3,x4}={(1,1),(2,1), (2,3),(3,2)}
S2-{x4,x5}={(4,3),(5,5)}
New centeroids
C1=((1+2+2+3)/4, (1+1+3+2)/3) =(2,7/4)=(2,1.75)
C2=((4+5)/2,(3+5)/2)=(9/2,8/2)=(4.5,4)
Since there is no change in cluster centeroids we can stop the process
The points (1,1)(2,1)(2,3)(3,2) are in one cluster
Where as (4,3),(5,5) are in another cluster
Advantages:-
i. Simple and easy to understand and robust
ii. Suitable for data form is compact and well separated form each other
iii. It is scalable quite sufficient for large datasets
Disadvantages:-
i. Requires prior assumption on no of cluster, and cluster centers
ii. Unable to handle noise data using mean(medoids)

iii. Local optimal problem
iv. Applicable when mean is defined
High Dimensions are scary:-

Curse of Dimensionality:-
As the number of features or dimensions increasing, then the amount of data we
need to generalize exponentially increases
Ex=> we have n feature then => no of data points will be 2 power n
Visualizing 1D,2D,or 3D data is possible to humans, visualizing more than three

dimensional data is hard to human beings.
In addition to this hard visualization it has computation and mathematical hardness
these both aspects are called Curse of Dimensionality
As features are increasing it hard to compute nearest neighbor in KNN also
The speed of prediction becomes slow for large set of data.
In addition to the computational difficulties of working in high dimensions, there are a large
number of strange mathematical occurrences there.
Consider the two dimensional space we start with 4 sphere(in 2d they are circles)
touching each other exactly at one point and each of I unit radius each then we
have inserted orange circle which touches the circles exactly at one point
Then the maximum radius of circle by phythagrean theorem is
=>22+22=(2+2r)2=

12+12=(1+r)2=>r=√ 2−1=¿0.41
If we consider this as three dimensional
Then maximum radius will be 12+12+12=(1+r)2=>r=√ 3−1=¿0.73

If it is 4-D then r will be =>r=√ 4−1=¿1
If it is 9D then r will be =>r=√ 9−1=¿2
In general, in D-dimensional space, there will be 2D hyperspheres of radius one. Each
hypersphere will touch exactly n-many other hyperspheres. The orange hypersphere in the
middle will touch them all and will have radius r = √ D – 1
When dimensionality increases the volume of the space increases so fast the data
available becomes sparse as a result statistically sound and reliable result the
amount of data supporting the result grows exponentially with dimensionality
Due to curse of dimensionality all objects appear to be sparse and dissimilar in

many ways, which prevents common data organization strategies from being
efficient
The furthest distance between two points in a D-dimensional hyper space is √ D

distances begin to concentrate around 0.4√ D
As we go on increase the features the accuracy is going to be increase up to
particular threshold value as we go on increasing the features after reaching the
threshold value the accuracy is going to decrease
That is point where dimensionality reduction comes in to picture

Python

Uploaded by

Copyright:

Available Formats

Python

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Python

Uploaded by

Copyright:

Available Formats

Machine Learning

Machine learning is about predicting the future from past experience.

Eg: Ramu laves listening music,

Vaagdevi Degree and PG College Page 1

Song A: Fast Tempo, Soaring Intensity

SongB:- Medium Tempo, Medium Intensity machine learning comes

ML Means making a informed guess about unseen observation of object, based on

Vaagdevi Degree and PG College Page 2

Machine learning enables a machine to automatically learn from data, improve

Vaagdevi Degree and PG College Page 3

Components of Machine Learning:-

Applications of Machine learning

Vaagdevi Degree and PG College Page 4

Vaagdevi Degree and PG College Page 5

Vaagdevi Degree and PG College Page 6

Types of Machine Learning:-

Vaagdevi Degree and PG College Page 8

Unsupervised Learning:- Unsupervised learning is a type of machine learning in

Vaagdevi Degree and PG College Page 9

Reinforcement Learning:- Reinforcement learning is an area of Machine

Vaagdevi Degree and PG College Page 11

Vaagdevi Degree and PG College Page 12

Vaagdevi Degree and PG College Page 13

Some Canonical Learning Problems :-

Vaagdevi Degree and PG College Page 15

Testing Set: refers to the testing data

Vaagdevi Degree and PG College Page 16

Vaagdevi Degree and PG College Page 17

Vaagdevi Degree and PG College Page 18

This algorithm follows greedy top down recursive

Vaagdevi Degree and PG College Page 19

P(y) =proportion of +ve samples in S

Vaagdevi Degree and PG College Page 20

Case2: if all members of S belong to same class

 H(S) - Entropy of set S.

Vaagdevi Degree and PG College Page 21

H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no))

Vaagdevi Degree and PG College Page 22

Categorical values - sunny, overcast and rain

Second Attribute – Temperature

Categorical values - hot, mild, cool

Third Attribute – Humidity

Vaagdevi Degree and PG College Page 23

H(Humidity=high) = -(3/7)*log2(3/7)-(4/7)*log2(4/7) = 0.983

Average Entropy Information for Humidity -

Fourth Attribute – Wind

H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811

Average Entropy Information for Wind -

Information Gain = H(S) - I(Wind)

Vaagdevi Degree and PG College Page 24

Here, when Outlook == overcast, it is of pure class(Yes).

Vaagdevi Degree and PG College Page 25

H(Sunny, Temperature=hot) = -0-(2/2)*log(2/2) = 0

Second Attribute - Humidity

H(Sunny, Humidity=high) = - 0 - (3/3)*log(3/3) = 0

Average Entropy Information for Humidity -

Vaagdevi Degree and PG College Page 26

Information Gain(sunny,humidity) = H(Sunny) - I(Sunny, Humidity)

H(Sunny, Wind=weak) = -(1/3)*log2(1/3)-(2/3)*log2(2/3) = 0.918

H(Humidity=high) = -(3/7)log2(3/7)-(4/7)log2(4/7) = 0.983

H(Wind=weak) = -(6/8)log(6/8)-(2/8)log(2/8) = 0.811

H(Sunny, Wind=weak) = -(1/3)log2(1/3)-(2/3)log2(2/3) = 0.918

H(Rain, Temperature=cool) = -(1/2)log(1/2)- (1/2)log(1/2) = 1