Python
Python
Python
Introduction:-
Machine learning is a growing technology which enables computers to learn
automatically from past data. Machine learning uses various algorithms
for building mathematical models and making predictions using historical
data or information.
> Human learn from their past experience with their learning capabilities.
> But machine works according to the instructions given by us.
> If human works/trains the machine, then machine could work much faster
than human being.
ava Try Catch
Tempo
Intensity
Genere
Etc.,
Given
A set of i/p features x1,x2,x3,….xn
A target function y
A set of training examples where target features are given for each example
A new example where only the values for the i/p features for each example
Predict the values for the target features for the new example
Classification when Y is discrete
Eg:- given set of images of animals we have to predict what kind of animal
it is?
We have an Agent acting in the Environment we have to figure out what action
agent must take action based on rewards and penalty agent gets in
St is current state
Rt is current state rewards
Basing on the current state and reward agent reacts At in the environment and the
state is update to St+1 and Rewards will be Rt+1
The above image shows the robot, diamond, and fire. The goal of the robot is to
get the reward that is the diamond and avoid the hurdles that are fired.
The robot learns by trying all the possible paths and then choosing the path
which gives him the reward with the least hurdles.
Each right step will give the robot a reward and each wrong step will subtract
the reward of the robot.
The total reward will be calculated when it reaches the final reward that is the
diamond.
Machine learning Life cycle
Machine learning has given the computer systems the abilities to automatically
learn without being explicitly programmed. So, it can be described using the life
The most important thing in the complete process is to understand the problem and
to know the purpose of the problem. Therefore, before starting the life cycle, we
need to understand the problem because the good result depends on the better
understanding of the problem.
In the complete life cycle process, to solve a problem, we create a machine
learning system called "model", and this model is created by providing "training".
But to train a model, we need data, hence, life cycle starts by collecting data.
5
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is
a step where we put our data into a suitable place and prepare it to use in our
machine learning training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
o Dataexploration:
It is used to understand the nature of data that we have to work with. We
need to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Datapre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in the
next step. It is one of the most important steps of the complete process. Cleaning of
data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some of the
data may not be useful. In real-world applications, collected data may have various
issues, including:
o Missing Values
o Duplicate data
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step
involves:
o Selection of analytical techniques
o Building models
o Review the result
The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the
determination of the type of the problems, where we select the machine learning
techniques such as Classification, Regression, Cluster analysis, Association, etc.
then build the model using prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build
the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve
its performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms.
Training a model is required so that it can understand the various patterns, rules,
and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test
the model. In this step, we check for the accuracy of our model by providing a test
dataset to it.
Testing the model determines the percentage accuracy of the model as per the
requirement of project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the
model in the real-world system.
If the above-prepared model is producing an accurate result as per our requirement
with acceptable speed, then we deploy the model in the real system. But before
deploying the project, we will check whether it is improving its performance using
Vaagdevi Degree and PG College Page 14
available data or not. The deployment phase is similar to making the final report
for a project.
d. Then below this new branch add a leaf node with label = most
common target value in the examples
e. Else below this new branch add the subtree ID3(Examples(vi),
d. Then below this new branch add a leaf node with label = most
common target value in the examples
e. Else below this new branch add the subtree ID3(Examples(vi),
Target_Attribute, Attributes – {A})
10. End
11. Return Root
P(y)=2/6 p(n)=4/6
Entropy= -2/6log22/6-4/6log24/6
= 0.92
Information Gain IG(A) tells us how much uncertainty in S was reduced after
splitting set S on attribute A
Here,dataset is of binary classes(yes and no), where 9 out of 14 are "yes" and 5 out
of 14 are "no".
Complete entropy of dataset is:
Complete Y N Total
dataset
9 5 14
For each attribute of the dataset, let's follow the step-2 of pseudocode : -
First Attribute – Outlook
Outlook Y N total
Sunny 2 3 5
Overcast 4 0 4
Humidity Y N Total
High 3 4 7
Normal 6 1 7
Wind Y N Total
Weak 6 2 8
Strong 3 3 6
Here, the attribute with maximum information gain is Outlook. So, the decision
tree built so far -
Entropy(Sunny)=H(S=Sunny)= -p(y)log2p(y)-p(n)log2p(n)
= - (2/5) * log2(2/5) - (3/5) * log2(3/5)
Temperature Y N Total
Sunny Hot 0 2 2
Mild 1 1 2
Cool 1 0 1
Humidity Y N Total
Sunny High 0 3 3
Normal 2 0 2
Sunny Weak 1 2 3
Strong 1 1 2
= 0.0202
Sunny,categorical values IG
Sunny, temperature 0.571
Sunny,humidity 0.971
Sunny,wind 0.0202
Here, the attribute with maximum information gain is Humidity. So, the
decision tree built so far –
Rain Y N Total
3 2 5
H(Wind=weak) = -(3/3)*log(3/3)-0 = 0
H(Wind=strong) = 0-(2/2)*log(2/2) = 0
Here, when Outlook = Rain and Wind = Strong, it is a pure class of category
"no". And When Outlook = Rain and Wind = Weak, it is again a pure class of
category "yes".
And this is our final desired tree for the given dataset.
The key question is: What is the inductive bias of shallow decision trees? Roughly,
their bias is that decisions can be made by only looking at a small number of
features.
The inductive bias of a decision tree is that the sorts of things we want to learn to
predict are more
Consider the following examples with ClassA and ClassB as
Training Data
The cardinal rule of machine learning is: “Never ever touch test data”.
The learning algorithm discovers patterns within the training data, and it outputs an
ML model which captures these patterns and makes predictions on new data.
hyperparameters are external to the model because the model cannot change its
values during learning/training.
Hyperparameters are used by the learning algorithm when it is learning but they are
not part of the resulting model.
The hyperparameters that were used during training are not part of this model.
Basically, anything in machine learning and deep learning that you decide their
values or choose their configuration before training begins and whose values or
configuration will remain the same when training ends is a hyperparameter.
Parameters
Parameters on the other hand are internal to the model.
That is, they are learned or estimated purely from the data during training as the
algorithm used tries to learn the mapping between the input features and the labels
or targets.
Model training typically starts with parameters being initialized to some values
(random values or set to zeros).
As training/learning progresses the initial values are updated using an optimization
algorithm
The learning algorithm is continuously updating the parameter values as learning
progress but hyperparameter values set by the model designer remain unchanged.
At the end of the learning process, model parameters are what constitute the model
itself.
Examples of parameters
The coefficients (or weights) of linear and logistic regression models.
Weights and biases of a nn
The cluster centroids in clustering
(2) we have some real world goal like increasing revenue for our search
engine, and decide to try to increase revenue by displaying better ads.
(3) We convert this task into a machine learning problem by deciding to
train a classifier to predict whether a user will click on an ad or not.
(4) In order to apply machine learning, we must collect some training
data;
(5) in this case, we collect data by logging user interactions with the
current system. We must choose what to log;
Vaagdevi Degree and PG College Page 38
(6) we choose to log the ad being displayed, the query the user entered
into our search engine, and binary value showing if they clicked or
not.
(7) In order to make these logs consumable by a machine learning
algorithm, we convert the data into input/output pairs: in this case,
pairs of words from a bag-of-words representing the query and a bag-
of-words representing the ad as input, and the click as a ± label.
(8) We then select a model family e.g., depth 20 decision trees, and
thereby an inductive bias, for instance depth ≤ 20 decision trees.
(9) We’re now ready to select a specific subset of data to use as training
data: in this case, data from April 2016. We split this into training and
development and
(10) learn a final decision tree, tuning the maximum depth on the
development data. We can then use this decision tree to
(11) make predictions on some held-out test data, in this case from
the following month.May 2016 We can
(12) measure the overall quality of our predictor as zero/one loss
(clasification error) on this test data and finally
(13) deploy our system.
Geometric view of data:- It is view where we have one dimension for every
feature. In this view, examples are points in a high dimensional space
For instance, suppose you need to predict whether Alice will like Algorithms.
Perhaps we can try to find another student who is most “similar” to Alice, in terms
of favorite courses. Say this student is Jeremy. If Jeremy liked Algorithms, then we
might guess that Alice will as well. This is an example of a nearest neighbor model
of learning.
From Data To Features:-
Generally the data will be in the form of table
Where each row represents the instance or example and the each column represents
feature
Entire table is called DataSet
The features are not important, the value of the feature is important for a Machine
Learning algorithm
Feature Vector:- The representation of symbolic characters or numeric values of
features is called Feature vector
Feature vector contains one dimension of each feature
Numerical feature can directly represented in feature vector
Age Height Weight
25 5.5 75
20 5.4 65
• Individual representation
<25,5.5,75>
<20,5.4,65>
• Complete feature vector
Binary features such as yes/no are represented using 0 and 1 value respectively
In fact, “high dimension” has a very rigorous meaning: it means a data set
whenever p>n, no matter what p is or n is. Because in statistics, you will never
have a deterministic answer when p>n unless you introduce your own assumption.
For example, if you have 3 data points, and 5 features each, it’s a high dimensional
data. On the other hand, even if you have 500k features, once you have 1M
samples, it’s still low dimensional.
K-Nearest Neighbors:-
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which
can be used for both classification as well as regression predictive problems.
However, it is mainly used for classification predictive problems in industry.
In order to cluster the data KNN use distance metric
These distance metrics are used in both supervised and unsupervised learning,
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance
1. Euclidean Distance
Where,
n = number of dimensions
pi, qi = data points
Let’s code Euclidean Distance in Python. This will give you a better understanding
of how this distance metric works.
2. Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all
the dimensions.
Where,
n = number of dimensions
pi, qi = data points
3. Minkowski Distance
Hamming Distance measures the similarity between two strings of the same length.
The Hamming Distance between two strings of the same length is the number of
positions at which the corresponding characters are different.
K in KNN- indicates no of nearest neighbors which are voting class of new data or
testing data
If k=1 then testing data have the same label one closer to it
If K=3 then the label of the 3 closet instance are checked and most common label
is assigned to testing data
Example
Amazon uses in Recommender systems:- Amazon recommends different
products based on browsing history, pull the products most likely to buy
Example
Consider a plot with orange and blue point blue point are class A points and
orange points belongs to class B. If a new point star arrives at some position now
we want predict the class of this point .
If k=3 then if consider 3 closer points which are least distance from the new point
2 of them are orange and 1 is blue most frequent class is orange so we classify the
point belongs to
eg:- we are given with data of customers where height in cm and weight in kg and
size of the T shirt size required
and a Test point
customer named 'Manoz' has height 161cm and weight 61kg.
now we have to predict the Size of T-shirt required for this customer
Height Weights
T shirt size
In cm In kg
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
In the graph below, binary dependent variable (T-shirt size) is displayed in blue
and orange color.
'Medium T-shirt size' is in blue color and
We use KNN when there are non linear decision boundaries between classes
We have large amount of data
K is hyper parameter how to select the k-value
inductive bias that we’ve seen for KNN is that it assumes that nearby points
should have the same label.
Another aspect, which is quite different from decision trees, is that all features
are equally important
Issues in KNN
A related issue with KNN is feature scale. Suppose that we are trying to classify
whether some object is a ski or a snowboard .
We are given two features about this data: the width and height. As is standard
in skiing, width is measured in millimeters and height is measured in
centimeters. Since there are only two features, we can actually plot the entire
training set
Age Salary
1 1
0 0
0.176 -1
A decision boundary is a line (in the case of two features), where all (or
most) samples of one class are on one side of that line, and all samples of the
other class are on the opposite side of the line. The line separates one class
from the other.
The solid line separating the positive regions from the negative regions is called
the decision boundary for this classifier. It is the line with positive land on one
side and negative land on the other side
k-Means algorithm:-
It is a unsupervised learning algorithm. In unsupervised learning we contain
data with examples but not contains corresponding labels.
clustering is one such a unsupervised mechanism used in unsupervised learning
algorithms. K-means is an clustering algorithm.
Clustering is a process of grouping examples basing on their similarity
Clustering is a task of dividing the data into groups such that similar featured data
points lie in one group
Cluster is collection of data points similar to one another with in the cluster,
dissimilar points to another cluster
Good clustering method uses have high intra cluster similarity, low inter cluster
similarity
k-means algorithm
step1:- choose no of clusters=k
step2:- Randomly choose initial positions of k –centeroids
step3:- Assign each of the data points to the nearest centeroid
step4: for each cluster
i. Calculate intra cluster variance
ii. Re-compute the centeroids positions
Step5: if the intra cluster variance don’t change then
Converges then Stop
Step 6 Else
Repeat step 3 and step 4
Step7 stop
Medicine PH weight
A 1 1
B 2 1
C 4 3
D 5 4
C(4,3) D(5,4)
4+ 5 3+4 9 7
C2= 2 , 2 = 2 , 2 =(4.5,3.5)
X 1 2 2 3 4 5
1 1 3 2 3 5
No of cluster k=2
c1=(2,1)
c2-(2,3)
compute Euclidean distance from each and every point to the centeroids of the
clusters c1,c2
X2(2,1) 0 2 Cluster1-s1
X3(2,3) 2 0 Cluster2-s2
S1={x1,x2,x4}={(1,1),(2,1).(3,2)};S2-{x3,x4,x5}={(2,3),(4,3),(5,5)}
New centeroids
C1=((1+2+3)/3,(1+1+2)/3)=(2,4/3)=(2,1.33)
C2=((2+4+5)/3,(3+3+5)/3)=(11/3,11/3)=(3.67,3.67)
Again we have to compute Euclidean distance from new centeroids to each and
every point
S1={x1,x2,x3,x4}={(1,1),(2,1), (2,3),(3,2)};S2-{x4,x5}={(4,3),(5,5)}
New centeroids
S1={x1,x2,x3,x4}={(1,1),(2,1), (2,3),(3,2)}
S2-{x4,x5}={(4,3),(5,5)}
New centeroids
C1=((1+2+2+3)/4, (1+1+3+2)/3) =(2,7/4)=(2,1.75)
C2=((4+5)/2,(3+5)/2)=(9/2,8/2)=(4.5,4)
Since there is no change in cluster centeroids we can stop the process
The points (1,1)(2,1)(2,3)(3,2) are in one cluster
Where as (4,3),(5,5) are in another cluster
Advantages:-
i. Simple and easy to understand and robust
ii. Suitable for data form is compact and well separated form each other
iii. It is scalable quite sufficient for large datasets
Disadvantages:-
i. Requires prior assumption on no of cluster, and cluster centers
ii. Unable to handle noise data using mean(medoids)
Consider the two dimensional space we start with 4 sphere(in 2d they are circles)
touching each other exactly at one point and each of I unit radius each then we
have inserted orange circle which touches the circles exactly at one point
Then the maximum radius of circle by phythagrean theorem is
=>22+22=(2+2r)2=