Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Anuranan Das

Summer Of Sciences,2019.

Understanding and implementing Machine learning


Introduction:
“Machine learning will automate jobs that most people thought could only be
done by people.” ~Dave Waters.

Truly indeed, since the invention and intervention of computers into daily life,
mankind can be said to have made considerable progress. One of the most
interesting features of machine learning is that it lies on the boundary of several
different academic disciplines, principally computer science, statistics,
mathematics, and engineering. …machine learning is usually studied as part of
artificial intelligence, which puts it firmly into computer science …understanding
why these algorithms work requires a certain amount of statistical and
mathematical sophistication. In one line, we can aptly say that

Machine Learning is the training of a model from data that generalizes a


decision against a performance measure.

Classification of Machine learning:

The following picture gives a clear indication of the various parts discussed in
machine learning.
As clear from the illustration, Machine learning can be broadly divided into three
parts:

 Supervised Learning:
1. Prediction
2. Classification

 Unsupervised learning:
1. Clustering
 Reinforcement Learning

Machine learning Algorithms:

 Supervised Learning:
In Supervised Learning, algorithms learn from labeled data. After understanding
the data, the algorithm determines which label should be given to new data based
on pattern and associating the patterns to the unlabeled new data.

Supervised Learning can be divided into 2 categories i.e. Classification &


Regression

Classification predicts the category the data belongs to.

eg: Spam Detection, Churn Prediction, Sentiment Analysis, Dog Breed Detection.

Regression/Prediction predicts a numerical value based on previous


observed data.

eg: House Price Prediction, Stock Price Prediction, Height-Weight Prediction.

Classification:
Classification is a technique for determining class the dependent belongs to based
on the one or more independent variables.

Classification is used for predicting discrete responses.

Certain common algorithms used in Classification problem are:


1. Logistic regression: This is almost like linear regression with the difference
in the fact that the dependent variable is not a number but some binary or
discrete value indicating yes/no or something like that. Usually in logistic
regression the cross-entropy cost function is used. For binary classification
y can take only 2 values, 1 and 0. Like if we want to predict if a person has
cancer based on tumor size then we have only two possibilities. In such
cases we use the sigmoid function as our hypothesis function and set
threshold as per our requirements. So as per our problem, the hypothesis
function is represented by sigmoid function so that the hypothesis function
remains between 0 and 1.

The function g(z), shown here, maps any real number to the (0, 1) interval, making
it useful for transforming an arbitrary-valued function into a function better
suited for classification, also, as y∞, p1 and for y-∞, p0. So we have
mapped all real into a function giving output range (0,1).

The cost function usually used for this case is given by,

If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis
function also outputs 0. If our hypothesis approaches 1, then the cost function
will approach infinity.

2. K-Nearest Neighbors (K-NN): K-NN algorithm is one of the simplest


classification algorithm and it is used to identify the data points that are
separated into several classes to predict the classification of a new sample point.
K-NN is a non-parametric, lazy learning algorithm. It classifies new cases based on
a similarity measure (e.g. distance functions). These distance functions can be
based on various kernels. Two most commonly used kernels are :

1.Linear kernel: Predict “y = 1” if Θt *X >=0

2.Gaussian Kernel:

Where y is the centroid and x is the point from which


the distance is being measured. Needless to say x and y are both vectors .

After applying K-means algorithm and classification we may get different groups
as:

But only difference in clustering for supervised and unsupervised learning is that in
supervised learning we know definitely what the groups mean(they are labeled)
usually but in case of unsupervised learning we just find clusters in the data later to
find what are the similarities meant to be.

Prediction: In statistics and machine learning terminology, finding the


relationship between two numeric quantities, like area and rent, is called
regression. The difference between binary classification, and regression, is just
the type of the output. Where binary classification produces 0’s and 1’s,
regression produces decimal numbers. We usually work with a lot of features in
the data and in such case it is called “Multivariate Linear Regression”. Here the
hypothesis looks like
hθ(x) =θ0+θ1x1+θ2x2+θ3x3+⋯+θnxn , In order to

develop intuition about this function, we can

think about θ0 as the basic price of a


house,

θ1 as the price per square meter, θ2 as the

price per floor, etc. x1 will be the number

of square meters in the house, x2 the


number

of floors, etc. For two featured data the term reduces to

hθ(x) = θ0 + θ1x1 + θ2x2 ….(1)

Using the definition of matrix multiplication, our multivariable hypothesis


function can be concisely represented as:

hθ(x) = θTX =θ0x0+θ1x1+θ2x2+θ3x3+⋯+θnxn ,where θ is the feature vector with n-


features and X is the training example with initial term(X0) as one and other values
corresponding to the values corresponding to the features.

Finding the parameter vector theta: To find the optimal value of the parameters, we
need to minimize the error in the prediction. Intuitively, the closer that the
regression output values are to the expected outputs (given in the training
data), the better the regression is. Now, to measure the magnitude of error
a number of functions are used which are most commonly known as cost
function like L1 norm,L2 norm etc. We usually use L2 norm or more
common, Method of least squares:

With our target to minimize J(θ) and find the considerate values of parameter θ.
Algorithms for finding parameters: Although there are more advanced
optimization methods, they’re mostly rooted in gradient descent. It is the
foundation for training neural nets.

 Gradient Descent: To describe gradient descent more formally: at each


step, move in the direction of steepest descent (opposite of the gradient),
proportional to how steep the slope is. When you get to a point where the
slope is 0, you’re at the bottom, so stop.

Where At each iteration j, one should simultaneously update the parameters


θ1,θ2,...,θn with alpha as learning rate.

For linear regression, LMS is used usually and the form is given by
For logistic regression however the cross entropy cost function is used.
So in that case:

 Unsupervised Learning and Neural nets:


For application of gradient descent in we must come across the back
propagation algorithm. "Backpropagation" is neural-network terminology
for minimizing our cost function, just like what we were doing with gradient
descent in logistic and linear regression. Our goal is to compute:
minΘJ(Θ)
That is, we want to minimize our cost function J using an optimal set of
parameters in theta. In this section we'll look at the equations we use to
compute the partial derivative of J(Θ).

More details about how this works can be found from


http://neuralnetworksanddeeplearning.com/chap2.html

Some optimization algorithms learnt:


 Use Of Support Vector Machines:
“Support Vector Machine” (SVM) is a supervised machine learning
algorithm which can be used for both classification or regression
challenges. However, it is mostly used in classification problems. In this
algorithm, we plot each data item as a point in n-dimensional space (where
n is number of features you have) with the value of each feature being the
value of a particular coordinate. Then, we perform classification by finding
the hyper-plane that differentiate the two classes very well (look at the
below snapshot).

Support Vector Machine is a


frontier which best segregates the two classes (hyper-plane/ line). In SVM,
it is easy to have a linear hyper-plane between these two classes. But,
another burning question which arises is, should we need to add this
feature manually to have a hyper-plane. No, SVM has a technique called
the kernel trick. These are functions which takes low dimensional input
space and transform it to a higher dimensional space i.e. it converts not
separable problem to separable problem, these functions are called
kernels. It is mostly useful in non-linear separation problem that is when
the decision boundary is not linear.

Pros and Cons associated with SVM

 Pros:
o It works really well with clear margin of separation
o It is effective in high dimensional spaces.
o It is effective in cases where number of dimensions is greater than
the number of samples.
o It uses a subset of training points in the decision function (called
support vectors), so it is also memory efficient.
 Cons:
o It doesn’t perform well, when we have large data set because the
required training time is higher
o It also doesn’t perform very well, when the data set has more noise
i.e. target classes are overlapping
o SVM doesn’t directly provide probability estimates, these are
calculated using an expensive five-fold cross-validation. It is related
SVC method of Python scikit-learn library.

 Use of principal component analysis: One of the common problems in


analysis of complex data comes from a large number of variables, which
requires a large amount of memory and computation power. This is where
Principal Component Analysis (PCA) comes in. It is a technique to reduce
the dimension of the feature space by feature extraction.
Objectives of principal component analysis
•PCA reduces attribute space from a larger number of variables to a
smaller number of factors and as such is a "non-dependent"
procedure (that is, it does not assume a dependent variable is
specified).
• PCA is a dimensionality reduction or data compression method.
The goal is dimension reduction and there is no guarantee that the
dimensions are interpretable (a fact often not appreciated by
(amateur) statisticians).
•To select a subset of variables from a larger set, based on which
original variables have the highest correlations with the principal
component.

Step by step explanation:


Step 1: Standardization

The aim of this step is to standardize the range of the continuous initial variables so
that each one of them contributes equally to the analysis.

Mathematically, this can be done by subtracting the mean and dividing by the
standard deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to the same
scale.

Step 2: Covariance Matrix computation


The aim of this step is to understand how the variables of the input data set are
varying from the mean with respect to each other, or in other words, to see if there
is any relationship between them.

The covariance matrix is a p × p symmetric matrix (where p is the number of


dimensions) that has as entries the covariances associated with all possible pairs of
the initial variables. For example, for a 3-dimensional data set with 3 variables x, y,
and z, the covariance matrix is a 3×3 matrix of this from:

Covariance matrix for 3-dimensional data

Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in


the main diagonal (Top left to bottom right) we actually have the variances of each
initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the
entries of the covariance matrix are symmetric with respect to the main diagonal,
which means that the upper and the lower triangular portions are equal.

Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to


identify the principal components

Principal components are new variables that are constructed as linear


combinations or mixtures of the initial variables.
Trying to capture as much variance as possible is common practice in statistics
when replacing original variables with fewer new variables “to account for a high
percentage of the variance in the original dataset.” It makes sense intuitively,
because we want to discard similar features, but only keep features with
maximum dissimilarity when we are considering reducing the dimension of the
dataset.

 Reinforcement Learning :

Reinforcement learning is the training of machine learning models to make a


sequence of decisions. The agent learns to achieve a goal in an uncertain,
potentially complex environment. In reinforcement learning, an artificial
intelligence faces a game-like situation. The computer employs trial and error
to come up with a solution to the problem. To get the machine to do what
the programmer wants, the artificial intelligence gets either rewards
or penalties for the actions it performs. Its goal is to maximize the total
reward. In contrast to human beings, artificial intelligence can gather
experience from thousands of parallel gameplays if a reinforcement learning
algorithm is run on a sufficiently powerful computer infrastructure.

Examples of Reinforcement learning:

Applications of reinforcement learning were in the past limited by weak computer


infrastructure. However Reinforcement learning has found its place in modern day
world. Training the models that control autonomous cars is an excellent example
of a potential application of reinforcement learning.

In usual circumstances we would require an autonomous vehicle to put safety


first, minimize ride time, reduce pollution, offer passengers comfort and obey
the rules of law. With an autonomous race car, on the other hand, we would
emphasize speed much more than the driver’s comfort. The programmer cannot
predict everything that could happen on the road. Instead of building lengthy “if-
then” instructions, the programmer prepares the reinforcement learning agent
to be capable of learning from the system of rewards and penalties.

Exact relation mapping between Machine learning and Deep Learning :

In fact, there should be no clear divide between machine learning, deep learning
and reinforcement learning. It is like a parallelogram – rectangle – square relation,
where machine learning is the broadest category and the deep reinforcement
learning the narrowest one.

Although the ideas seem to differ, there is no sharp divide between these
subtypes. Moreover, they merge within projects, as the models are designed not
to stick to a “pure type” but to perform the task in the most effective way
possible.

Basic definitions widely used in Reinforcement Learning:

Reinforcement learning can be understood using the concepts of agents,


environments, states, actions and rewards.

 Agent: An agent takes actions; for example, a drone making a delivery, or


Super Mario navigating a video game. The algorithm is the agent. In life, the
agent is you.
 Action: Action is the set of all possible moves the agent can make. An
action is almost self-explanatory, but it should be noted that agents choose
among a list of possible actions. In video games, the list might include
running right or left, jumping high or low, crouching or standing still.
 Discount factor: The discount factor is multiplied by future rewards as
discovered by the agent in order to dampen thse rewards’ effect on the
agent’s choice of action. Why? It is designed to make future rewards worth
less than immediate rewards; i.e. it enforces a kind of short-term hedonism
in the agent.
 Environment: The world through which the agent moves. The environment
takes the agent’s current state and action as input, and returns as output
the agent’s reward and its next state.
 State: A state is a concrete and immediate situation in which the agent
finds itself; i.e. a specific place and moment, an instantaneous
configuration that puts the agent in relation to other significant things such
as tools, obstacles, enemies or prizes.
 Reward: A reward is the feedback by which we measure the success or
failure of an agent’s actions. For example, in a video game, when Mario
touches a coin, he wins points. From any given state, an agent sends output
in the form of actions to the environment, and the environment returns the
agent’s new state (which resulted from acting on the previous state) as well
as rewards, if there are any. They effectively evaluate the agent’s action.
 Policy: The policy is the strategy that the agent employs to determine the
next action based on the current state. It maps states to actions, the
actions that promise the highest reward.
 Value: The expected long-term return with discount, as opposed to the
short-term reward R. Vπ(s) is defined as the expected long-term return of
the current state under policy π. We discount rewards, or lower their
estimated value, the further into the future they occur.
 Q-value or action-value: Q-value is similar to Value, except that it takes an
extra parameter, the current action a. Qπ(s, a) refers to the long-term
return of the current state s, taking action a under policy π.
 Trajectory: A sequence of states and actions that influence those states.
From the Latin “to throw across.” The life of an agent is but a ball tossed
high and arching through space-time.

The Procedure in brief:

So environments are functions that transform an action taken in the current


state into the next state and a reward; agents are functions that transform the
new state and reward into the next action.
In the feedback loop above, the subscripts denote the time steps t and t+1, each
of which refer to different states: the state at moment t, and the state at
moment t+1. Unlike other forms of machine learning – such as supervised and
unsupervised learning – reinforcement learning can only be thought about
sequentially in terms of state-action pairs that occur one after the other.

Here’s an example of an objective function for reinforcement learning; i.e. the


way it defines its goal.

We are summing reward function r over t, which stands for time steps. So this
objective function calculates all the reward we could obtain by running
through, say, a game. Here, x is the state at a given time step, and a is the
action taken in that state. r is the reward function for x and a.

Footnote:

The correct analogy may actually be that a learning algorithm is like a species. Each simulation the
algorithm runs as it learns could be considered an individual of the species. Just as knowledge from the
algorithm’s runs through the game is collected in the algorithm’s model of the world, the individual
humans of any group will report back via language, allowing the collective’s model of the world,
embodied in its texts, records and oral traditions, to become more intelligent (At least in the ideal case.
The subversion and noise introduced into our collective models is a topic for another post, and probably
for another website entirely.). This puts a finer point on why the contest between algorithms and
individual humans, even when the humans are world champions, is unfair. We are pitting a civilization
that has accumulated the wisdom of 10,000 lives against a single sack of flesh.
References:

 Machine learning course by Andrews Ng in coursera.


 Stanford lectures by Serena Young and Co.
 http://www.neuralnetworksanddeeplearning.com/ -book by Michael
Nelson.
 Deep learning by Mason Simon.
 https://www.freecodecamp.org/news/an-introduction-to-reinforcement-
learning-4339519de419/
 https://skymind.ai/wiki/deep-reinforcement-learning.

You might also like