Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit-3 Artificial Intelligence

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 68

UNIT-3, Introduction to Machine Learning

Machine learning (ML) is the study of computer algorithms that can improve
automatically through experience and using data. It is seen as a part of artificial
intelligence. Machine learning algorithms build a model based on sample data,
known as training data, to make predictions or decisions without being explicitly
programmed to do so. Machine learning algorithms are used in a wide variety of
applications, such as in medicine, email filtering, speech recognition, and computer
vision, where it is difficult or unfeasible to develop conventional algorithms to
perform the needed tasks.

A subset of machine learning is closely related to computational statistics, which


focuses on making predictions using computers; but not all machine learning is
statistical learning. The study of mathematical optimization delivers methods, theory,
and application domains to the field of machine learning. Data mining is a related
field of study, focusing on exploratory data analysis through unsupervised learning.
Some implementations of machine learning use data and neural networks in a way
that mimics the working of a biological brain. In its application across business
problems, machine learning is also referred to as predictive analytics.

History:

The term machine learning was coined in 1959 by Arthur Samuel, an American
IBMer and pioneer in the field of computer gaming and artificial intelligence. Also,
the synonym self-teaching computers was used in this time. A representative book
of the machine learning research during the 1960s was the Nilsson’s book on
Learning Machines, dealing mostly with machine learning for pattern classification.
Interest related to pattern recognition continued into the 1970s, as described by
Duda and Hart in 1973. In 1981 a report was given on using teaching strategies so
that a neural network learns to recognize 40 characters (26 letters, 10 digits, and 4
special symbols) from a computer terminal.

Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms
studied in the machine learning field: “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with experience E. “This
definition of the tasks in which machine learning is concerned offers a fundamentally
operational definition rather than defining the field in cognitive terms. This follows
Alan Turing’s proposal in his paper “Computing Machinery and Intelligence”, in which
the question “Can machines think?” is replaced with the question “Can machines do
what we (as thinking entities) can do?”.

Modern day machine learning has two objectives, one is to classify data based on
models which have been developed, the other purpose is to make predictions for
future outcomes based on these models. A hypothetical algorithm specific to
classifying data may use computer vision of moles coupled with supervised learning
to train it to classify the cancerous moles. Whereas a machine learning algorithm for
stock trading may inform the trader of future potential predictions.

Artificial intelligence

As a scientific endeavor, machine learning grew out of the quest for artificial
intelligence. In the early days of AI as an academic discipline, some researchers
were interested in having machines learn from data. They attempted to approach
the problem with various symbolic methods, as well as what was then termed
“neural networks”; these were mostly perceptron’s and other models that were later
found to be reinventions of the generalized linear models of statistics. Probabilistic
reasoning was also employed, especially in automated medical diagnosis.

However, an increasing emphasis on the logical, knowledge-based approach caused


a rift between AI and machine learning. Probabilistic systems were plagued by
theoretical and practical problems of data acquisition and representation.  By 1980,
expert systems had come to dominate AI, and statistics were out of favor. Work on
symbolic/knowledge-based learning did continue within AI, leading to inductive logic
programming, but the more statistical line of research was now outside the field of
AI proper, in pattern recognition and information retrieval.  Neural networks research
had been abandoned by AI and computer science around the same time. This line,
too, was continued outside the AI/CS field, as “connectionism”, by researchers from
other disciplines including Hopfield, Rumelhart and Hinton. Their main success came
in the mid-1980s with the reinvention of backpropagation.

Machine learning (ML), reorganized as a separate field, started to flourish in the


1990s. The field changed its goal from achieving artificial intelligence to tackling
solvable problems of a practical nature. It shifted focus away from the symbolic
approaches it had inherited from AI, and toward methods and models borrowed
from statistics and probability theory.

The difference between ML and AI is frequently misunderstood. ML learns and


predicts based on passive observations, whereas AI implies an agent interacting with
the environment to learn and take actions that maximize its chance of successfully
achieving its goals.

As of 2020, many sources continue to assert that ML remains a subfield of AI.


Others have the view that not all ML is part of AI, but only an ‘intelligent subset’ of
ML should be considered AI.

Data mining

Machine learning and data mining often employ the same methods and overlap
significantly, but while machine learning focuses on prediction, based on known
properties learned from the training data, data mining focuses on the discovery of
(previously) unknown properties in the data (this is the analysis step of knowledge
discovery in databases). Data mining uses many machine learning methods, but with
different goals; on the other hand, machine learning also employs data mining
methods such as “Unsupervised Learning” or as a pre-processing step to improve
learner accuracy. Much of the confusion between these two research communities
(which do often have separate conferences and separate journals, ECML PKDD being
a major exception) comes from the basic assumptions they work with: in machine
learning, performance is usually evaluated with respect to the ability to reproduce
known knowledge, while in knowledge discovery and data mining (KDD) the key task
is the discovery of previously unknown knowledge. Evaluated with respect to known
knowledge, an uninformed (unsupervised) method will easily be outperformed by
other supervised methods, while in a typical KDD task, supervised methods cannot
be used due to the unavailability of training data.

Optimization

Machine learning also has intimate ties to optimization: many learning problems are
formulated as minimization of some loss function on a training set of examples. Loss
functions express the discrepancy between the predictions of the model being
trained and the actual problem instances (for example, in classification, one wants to
assign a label to instances, and models are trained to correctly predict the pre-
assigned labels of a set of examples).

Generalization

The difference between optimization and machine learning arises from the goal of
generalization: while optimization algorithms can minimize the loss on a training set,
machine learning is concerned with minimizing the loss on unseen samples.
Characterizing the generalization of various learning algorithms is an active topic of
current research, especially for deep learning algorithms.

Statistics

Machine learning and statistics are closely related fields in terms of methods, but
distinct in their principal goal: statistics draw population inferences from a sample,
while machine learning finds generalizable predictive patterns. According to Michael
I. Jordan, the ideas of machine learning, from methodological principles to
theoretical tools, have had a long pre-history in statistics. He also suggested the
term data science as a placeholder to call the overall field.

Leo Breiman distinguished two statistical modeling paradigms: data model and
algorithmic model, wherein “algorithmic model” means more or less the machine
learning algorithms like Random Forest.

Some statisticians have adopted methods from machine learning, leading to a


combined field that they call statistical learning.

Approaches
Machine learning approaches are traditionally divided into four broad categories,
depending on the nature of the “Signal” or “Feedback” available to the learning
system:

“Machine learning is a subset of AI, which enables the machine to automatically


learn from data, improve performance from past experiences, and make
predictions. Machine learning contains a set of algorithms that work on a huge
amount of data. Data is fed to these algorithms to train them, and based on training,
they build the model & perform a specific task.”

Based on the methods and way of learning, machine learning is divided


into mainly four types, which are:

1. Supervised Machine Learning


2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

1. Supervised Machine Learning


As its name suggests, Supervised machine learning is based on supervision. It means
in the supervised learning technique, we train the machines using the "labelled"
dataset, and based on the training, the machine predicts the output. Here, the
labelled data specifies that some of the inputs are already mapped to the output.
More preciously, we can say; first, we train the machine with the input and
corresponding output, and then we ask the machine to predict the output using the
test dataset.

Let's understand supervised learning with an example. Suppose we have an input


dataset of cats and dog images. So, first, we will provide the training to the machine
to understand the images, such as the shape & size of the tail of cat and dog,
Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After
completion of training, we input the picture of a cat and ask the machine to identify
the object and predict the output. Now, the machine is well trained, so it will check
all the features of the object, such as height, shape, colour, eyes, ears, tail, etc., and
find that it's a cat. So, it will put it in the Cat category. This is the process of how the
machine identifies the objects in Supervised Learning.

The main goal of the supervised learning technique is to map the input
variable(x) with the output variable(y). Some real-world applications of supervised
learning are Risk Assessment, Fraud Detection, Spam filtering, etc.

The computer is presented with example inputs and their desired outputs,
given by a “teacher”, and the goal is to learn a general rule that maps
inputs to outputs.

Categories of Supervised Machine Learning


Supervised machine learning can be classified into two types of problems, which are
given below:

o Classification
o Regression

a) Classification

Classification algorithms are used to solve the classification problems in which the
output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue,
etc. The classification algorithms predict the categories present in the dataset. Some
real-world examples of classification algorithms are Spam Detection, Email
filtering, etc.

Some popular classification algorithms are given below:

o Random Forest Algorithm


o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a


linear relationship between input and output variables. These are used to predict
continuous output variables, such as market trends, weather prediction, etc.

Some popular Regression algorithms are given below:

o Simple Linear Regression Algorithm


o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression

Advantages and Disadvantages of Supervised Learning


Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output based on prior experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.


o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning


Some common applications of Supervised Learning are given below:

o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
o Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data
to identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.

2. Unsupervised Machine Learning


Unsupervised learning is different from the Supervised learning technique; as its
name suggests, there is no need for supervision. It means, in unsupervised machine
learning, the machine is trained using the unlabelled dataset, and the machine
predicts the output without any supervision.

In unsupervised learning, the models are trained with the data that is neither
classified nor labelled, and the model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories


the unsorted dataset according to the similarities, patterns, and
differences. Machines are instructed to find the hidden patterns from the input
dataset.

Let's take an example to understand it more preciously; suppose there is a basket of


fruit images, and we input it into the machine learning model. The images are totally
unknown to the model, and the task of the machine is to find the patterns and
categories of the objects.

So, now the machine will discover its patterns and differences, such as colour
difference, shape difference, and predict the output when it is tested with the test
dataset.

No labels are given to the learning algorithm, leaving it on its own to find structure
in its input. Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end (feature learning).
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given
below:

o Clustering
o Association

1) Clustering

The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the
most similarities remain in one group and have fewer or no similarities with the
objects of other groups. An example of the clustering algorithm is grouping the
customers by their purchasing behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

2) Association

Association rule learning is an unsupervised learning technique, which finds


interesting relations among variables within a large dataset. The main aim of this
learning algorithm is to find the dependency of one data item on another data item
and map those variables accordingly so that it can generate maximum profit. This
algorithm is mainly applied in Market Basket analysis, Web usage mining,
continuous production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat,


FP-growth algorithm.

Advantages and Disadvantages of Unsupervised Learning


Algorithm
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised
ones because these algorithms work on the unlabelled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabelled
dataset is easier as compared to the labelled dataset.

Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.

Applications of Unsupervised Learning


o Network Analysis: Unsupervised learning is used for identifying plagiarism and
copyright in document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised
learning techniques for building recommendation applications for different web
applications and e-commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised
learning, which can identify unusual data points within the dataset. It is used to
discover fraudulent transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to
extract information from the database. For example, extracting information of each
user located at a particular location.

3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies
between Supervised and Unsupervised machine learning. It represents the
intermediate ground between Supervised (With Labelled training data) and
Unsupervised learning (with no labelled training data) algorithms and uses the
combination of labelled and unlabelled datasets during the training period.

Although Semi-supervised learning is the middle ground between supervised and


unsupervised learning and operates on the data that consists of a few labels, it
mostly consists of unlabelled data. As labels are costly, but for corporate purposes,
they may have few labels. It is completely different from supervised and
unsupervised learning as they are based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning
algorithms, the concept of Semi-supervised learning is introduced. The main aim
of semi-supervised learning is to effectively use all the available data, rather than
only labelled data like in supervised learning. Initially, similar data is clustered along
with an unsupervised learning algorithm, and further, it helps to label the unlabelled
data into labelled data. It is because labelled data is a comparatively more expensive
acquisition than unlabelled data.

We can imagine these algorithms with an example. Supervised learning is where a


student is under the supervision of an instructor at home and college. Further, if that
student is self-analysing the same concept without any help from the instructor, it
comes under unsupervised learning. Under semi-supervised learning, the student has
to revise himself after analysing the same concept under the guidance of an
instructor at college.

Advantages and disadvantages of Semi-supervised


Learning
Advantages:

o It is simple and easy to understand the algorithm.


o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages:

o Iterations results may not be stable.


o We cannot apply these algorithms to network-level data.
o Accuracy is low.

4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI
agent (A software component) automatically explore its surrounding by hitting
& trail, acting, learning from experiences, and improving its
performance. Agent gets rewarded for each good action and get punished for each
bad action; hence the goal of reinforcement learning agent is to maximize the
rewards.

In reinforcement learning, there is no labelled data like supervised learning, and


agents learn from their experiences only.
The reinforcement learning process is like a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement
learning is to play a game, where the Game is the environment, moves of an agent at
each step define states, and the goal of the agent is to get a high score. Agent
receives feedback in terms of punishment and rewards.

Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision


Process (MDP). In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and generates a new
state.

A computer program interacts with a dynamic environment in which it must perform


a certain goal (such as driving a vehicle or playing a game against an opponent). As
it navigates its problem space, the program is provided feedback that’s analogous to
rewards, which it tries to maximize.

Categories of Reinforcement Learning


Reinforcement learning is categorized mainly into two types of methods/algorithms:

o Positive Reinforcement Learning: Positive reinforcement learning specifies


increasing the tendency that the required behaviour would occur again by adding
something. It enhances the strength of the behaviour of the agent and positively
impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour
would occur again by avoiding the negative condition.

Real-world Use cases of Reinforcement Learning


o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-
human performance. Some popular games that use RL algorithms
are AlphaGO and AlphaGO Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that
how to use RL in computer to automatically learn and schedule resources to wait for
different jobs to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent
robots using AI and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with
the help of Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning


Advantages

o It helps in solving complex real-world problems which are difficult to be solved by


general techniques.
o The learning model of RL is like the learning of human beings; hence most accurate
results can be found.
o Helps in achieving long term results.

Disadvantage

o RL algorithms are not preferred for simple problems.


o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can weaken
the results.

The curse of dimensionality limits reinforcement learning for real physical systems.

Framework for building ML Systems, KDD


process mode
A machine learning framework is an interface that allows developers to build and
deploy machine learning models faster and easier. A tool like this allows enterprises
to scale their machine learning efforts securely while maintaining a healthy ML
lifecycle.

Features of machine learning framework

Machine learning framework allows enterprises to deploy, manage, and scale their
machine learning portfolio. Algorithmia is the fastest route to deployment and makes
it easy to securely govern machine learning operations with a healthy ML lifecycle.

With Algorithmia, you can connect your data and pre-trained models, deploy and
serve as APIs, manage your models and monitor performance, and secure your
machine learning portfolio as it scales.

Connectivity

A flexible machine learning framework connects to all necessary data sources in one
secure, central location for reusable, repeatable, and collaborative model
management.

 Manage source code by pushing models into production directly from the code
repository
 Control data access by running models close to connectors and data sources for
optimal security
 Deploy models from wherever they are with seamless infrastructure management

Deployment

Machine learning models only achieve value once they reach production. Efficient
deployment capabilities reduce the time it takes your organization to get a return on
your ML investment.

 Deploy in any language and any format with flexible tooling capabilities.
 Serve models with a git push to a highly scalable API in seconds.
 Version models automatically with a framework that compares and updates models
while maintaining a dependable version for calls.

Management

Manage MLOps using access controls and governance features that secure and audit
the machine learning models you have in production.

 Split machine learning workflows into reusable, independent parts and pipeline them
together with a microservices architecture.
 Operate your ML portfolio from one, secure location to prevent work silos with a
robust ML management system.
 Protect your models with access control.
 Usage reporting allows you to gain full visibility into server use, model consumption,
and call details to control costs.

Scaling

A properly scaled machine learning lifecycle scales on demand, operates at peak


performance, and continuously delivers value from one MLOps center.

 Serverless scaling allows you to scale models on demand without latency concerns,
providing CPU and GPU support.
 Reduce data security vulnerabilities by access controlling your model management
system.
 Govern models and test model performance for speed, accuracy, and drift
 Multi-cloud flexibility provides the options to deploy on Algorithmia, the cloud, or on-
prem to keep models near data sources.

Popular machine learning frameworks

Arguably, TensorFlow, PyTorch, and scikit-learn are the most popular ML


frameworks. Still, choosing which framework to use will depend on the work you’re
trying to perform. These frameworks are oriented towards mathematics and
statistical modeling (machine learning) as opposed to neural network training (deep
learning).

 TensorFlow and PyTorch are direct competitors because of their similarity. They both
provide a rich set of linear algebra tools, and they can run regression analysis.
 Scikit-learn has been around a long time and would be most familiar to R
programmers, but it comes with a big caveat: it is not built to run across a cluster.
 Spark ML is built for running on a cluster, since that is what Apache Spark is all
about.

KDD Process Mode

Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data
from collection.

 Cleaning in case of Missing values.


 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.

Data Integration: Data integration is defined as heterogeneous data from multiple


sources combined in a common source.

 Data integration using Data Migration tools.


 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.

Data Selection: Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.

Data Transformation: Data Transformation is defined as the process of


transforming data into appropriate form required by mining procedure.

Data Transformation is a two-step process:

 Data Mapping: Assigning elements from source base to destination to capture


transformations.
 Code generation: Creation of the actual transformation program.

Data Mining: Data mining is defined as clever techniques that are applied to
extract patterns potentially useful.

 Transforms task relevant data into patterns.


 Decides purpose of model using classification or characterization.

Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing


patterns representing knowledge based on given measures.

 Find interestingness score of each pattern.


 Uses summarization and Visualization to make data understandable by user.

Knowledge representation: Knowledge representation is defined as technique


which utilizes visualization tools to represent data mining results.

 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules, etc.

Supervised Learning: Introduction to


classification
Supervised learning (SL) is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a function
from labeled training data consisting of a set of training examples. In supervised
learning, each example is a pair consisting of an input object (typically a vector) and
a desired output value (also called the supervisory signal). A supervised learning
algorithm analyzes the training data and produces an inferred function, which can be
used for mapping new examples. An optimal scenario will allow for the algorithm to
correctly determine the class labels for unseen instances. This requires the learning
algorithm to generalize from the training data to unseen situations in a “reasonable”
way (see inductive bias). This statistical quality of an algorithm is measured through
the so-called generalization error.
The parallel task in human and animal psychology is often referred to as concept
learning.

Steps to follow

To solve a given problem of supervised learning, one has to perform the following
steps:

 Determine the type of training examples. Before doing anything else, the user
should decide what kind of data is to be used as a training set. In the case of
handwriting analysis, for example, this might be a single handwritten character, an
entire handwritten word, an entire sentence of handwriting or perhaps a full
paragraph of handwriting.
 Gather a training set. The training set needs to be representative of the real-world
use of the function. Thus, a set of input objects is gathered and corresponding
outputs are also gathered, either from human experts or from measurements.
 Determine the input feature representation of the learned function. The
accuracy of the learned function depends strongly on how the input object is
represented. Typically, the input object is transformed into a feature vector, which
contains a number of features that are descriptive of the object. The number of
features should not be too large, because of the curse of dimensionality; but should
contain enough information to accurately predict the output.
 Determine the structure of the learned function and corresponding
learning algorithm. For example, the engineer may choose to use support-vector
machines or decision trees.
 Complete the design. Run the learning algorithm on the gathered training set.
Some supervised learning algorithms require the user to determine certain control
parameters. These parameters may be adjusted by optimizing performance on a
subset (called a validation set) of the training set, or via cross-validation.
 Evaluate the accuracy of the learned function. After parameter adjustment and
learning, the performance of the resulting function should be measured on a test set
that is separate from the training set.

Factors to consider

Other factors to consider when choosing and applying a learning algorithm include
the following:

 Heterogeneity of the data. If the feature vectors include features of many


different kinds (discrete, discrete ordered, counts, continuous values), some
algorithms are easier to apply than others. Many algorithms, including support-vector
machines, linear regression, logistic regression, neural networks, and nearest
neighbor methods, require that the input features be numerical and scaled to similar
ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such
as nearest neighbor methods and support-vector machines with Gaussian kernels,
are particularly sensitive to this. An advantage of decision trees is that they easily
handle heterogeneous data.
 Redundancy in the data. If the input features contain redundant information
(e.g., highly correlated features), some learning algorithms (e.g., linear regression,
logistic regression, and distance-based methods) will perform poorly because of
numerical instabilities. These problems can often be solved by imposing some form
of regularization.
 Presence of interactions and non-linearities. If each of the features makes an
independent contribution to the output, then algorithms based on linear functions
(e.g., linear regression, logistic regression, support-vector machines, naive Bayes)
and distance functions (e.g., nearest neighbor methods, support-vector machines
with Gaussian kernels) generally perform well. However, if there are complex
interactions among features, then algorithms such as decision trees and neural
networks work better, because they are specifically designed to discover these
interactions. Linear methods can also be applied, but the engineer must manually
specify the interactions when using them.

When considering a new application, the engineer can compare multiple learning
algorithms and experimentally determine which one works best on the problem at
hand (see cross validation). Tuning the performance of a learning algorithm can be
very time-consuming. Given fixed resources, it is often better to spend more time
collecting additional training data and more informative features than it is to spend
extra time tuning the learning algorithms.

Regression: Meaning, Assumption,


Regression Line
REGRESSION

Regression is a statistical measurement used in finance, investing and other disciplines that
attempts to determine the strength of the relationship between one dependent variable
(usually denoted by Y) and a series of other changing variables (known as independent
variables).

Regression helps investment and financial managers to value assets and understand the
relationships between variables, such as commodity prices and the stocks of businesses
dealing in those commodities.

Regression Explained

The two basic types of regression are linear regression and multiple linear regressions,
although there are non-linear regression methods for more complicated data and analysis.
Linear regression uses one independent variable to explain or predict the outcome of the
dependent variable Y, while multiple regressions use two or more independent variables to
predict the outcome.

Regression can help finance and investment professionals as well as professionals in other
businesses. Regression can also help predict sales for a company based on weather, previous
sales, GDP growth or other types of conditions. The capital asset pricing model (CAPM) is
an often-used regression model in finance for pricing assets and discovering costs of capital.

The general form of each type of regression is:


 Linear regression: Y = a + bX + u
 Multiple regression: Y = a + b1X1 + b2X2 + b3X3 + … + btXt + u

Where:

Y = the variable that you are trying to predict (dependent variable).

X = the variable that you are using to predict Y (independent variable).

a = the intercept.

b = the slope.

u = the regression residual.

Regression takes a group of random variables, thought to be predicting Y, and tries to find a
mathematical relationship between them. This relationship is typically in the form of a
straight line (linear regression) that best approximates all the individual data points. In
multiple regression, the separate variables are differentiated by using numbers with
subscripts.

ASSUMPTIONS IN REGRESSION

 Independence: The residuals are serially independent (no autocorrelation).


 The residuals are not correlated with any of the independent (predictor) variables.
 Linearity: The relationship between the dependent variable and each of the
independent variables is linear.
 Mean of Residuals: The mean of the residuals is zero.
 Homogeneity of Variance: The variance of the residuals at all levels of the
independent variables is constant.
 Errors in Variables: The independent (predictor) variables are measured without
error.
 Model Specification: All relevant variables are included in the model. No irrelevant
variables are included in the model.
 Normality: The residuals are normally distributed. This assumption is needed for
valid tests of significance but not for estimation of the regression coefficients.

REGRESSION LINE

Definition: The Regression Line is the line that best fits the data, such that the overall
distance from the line to the points (variable values) plotted on a graph is the smallest. In
other words, a line used to minimize the squared deviations of predictions is called as the
regression line.

There are as many numbers of regression lines as variables. Suppose we take two variables,
say X and Y, then there will be two regression lines:

 Regression line of Y on X: This gives the most probable values of Y from the given
values of X.
 Regression line of X on Y: This gives the most probable values of X from the given
values of Y.

The algebraic expression of these regression lines is called as Regression Equations. There
will be two regression equations for the two regression lines.

The correlation between the variables depends on the distance between these two regression
lines, such as the nearer the regression lines to each other the higher is the degree of
correlation, and the farther the regression lines to each other the lesser is the degree of
correlation.

The correlation is said to be either perfect positive or perfect negative when the two
regression lines coincide, i.e. only one line exists. In case, the variables are independent; then
the correlation will be zero, and the lines of regression will be at right angles, i.e. parallel to
the X axis and Y axis.

Note: The regression lines cut each other at the point of average of X and Y. This means,
from the point where the lines intersect each other the perpendicular is drawn on the X axis
we will get the mean value of X. Similarly, if the horizontal line is drawn on the Y axis we
will get the mean value of Y.

Metrics for evaluating linear model,


Multivariate regression, Non-Linear
Regression
Machine Learning is a branch of Artificial Intelligence. It contains many algorithms to
solve various real-world problems. Building a Machine learning model is not only the
Goal of any data scientist but deploying a more generalized model is a target of
every Machine learning engineer.

Regression is also one type of supervised Machine learning.

Regression

Regression is a type of Machine learning which helps in finding the relationship


between independent and dependent variable.

In simple words, Regression can be defined as a Machine learning problem where


we have to predict discrete values like price, Rating, Fees, etc.

R Square/Adjusted R Square

R Square measures how much variability in dependent variable can be explained by


the model. It is the square of the Correlation Coefficient(R) and that is why it is
called R Square.
R Square is a good measure to determine how well the model fits the dependent
variables. However, it does not take into consideration of overfitting problem. If your
regression model has many independent variables, because the model is too
complicated, it may fit very well to the training data but performs badly for testing
data. That is why Adjusted R Square is introduced because it will penalize additional
independent variables added to the model and adjust the metric to prevent
overfitting issues.

Advantages of MSE

 The graph of MSE is differentiable, so you can easily use it as a loss function.

Disadvantages of MSE

 The value you get after calculating MSE is a squared unit of output. for
example, the output variable is in meter(m) then after calculating MSE the
output we get is in meter squared.
 If you have outliers in the dataset, then it penalizes the outliers most and the
calculated MSE is bigger. So, in short, it is not Robust to outliers which were
an advantage in MAE.

Mean Square Error (MSE)/Root Mean Square Error (RMSE)

While R Square is a relative measure of how well the model fits dependent variables,
Mean Square Error is an absolute measure of the goodness for the fit.

MSE is calculated by the sum of square of prediction error which is real output minus
predicted output and then divide by the number of data points. It gives you an
absolute number on how much your predicted results deviate from the actual
number. You cannot interpret many insights from one single result, but it gives you
a real number to compare against other model results and help you select the best
regression model.

Root Mean Square Error(RMSE) is the square root of MSE. It is used more commonly
than MSE because firstly sometimes MSE value can be too big to compare easily.
Secondly, MSE is calculated by the square of error, and thus square root brings it
back to the same level of prediction error and makes it easier for interpretation.

Advantages of RMSE

 The output value you get is in the same unit as the required output variable which
makes interpretation of loss easy.

Disadvantages of RMSE

It is not that robust to outliers as compared to MAE.

Mean Absolute Error (MAE)


Mean Absolute Error (MAE) is similar to Mean Square Error (MSE). However, instead
of the sum of square of error in MSE, MAE is taking the sum of the absolute value of
error.

Compared to MSE or RMSE, MAE is a more direct representation of sum of error


terms. MSE gives larger penalization to big prediction error by square it while MAE
treats all errors the same.

Advantages of MAE

 The MAE you get is in the same unit as the output variable.
 It is most Robust to outliers.

Disadvantages of MAE

 The graph of MAE is not differentiable so we have to apply various optimizers


like Gradient descent which can be differentiable.

Multivariate regression

Non-Linear Regression

In statistics, nonlinear regression is a form of regression analysis in which


observational data are modelled by a function which is a nonlinear combination of
the model parameters and depends on one or more independent variables. The data
are fitted by a method of successive approximations.

In nonlinear regression, a statistical model of the form,

Y ≈ ( x, β)

relates a vector of independent variables x, and its associated observed dependent


variables, y. The function f is nonlinear in the components of the vector of
parameters β, but otherwise arbitrary. For example, the Michaelis–Menten model for
enzyme kinetics has two parameters and one independent variable, related by f by:

f(x, β) = (β1*x)/( β2+x)

This function is nonlinear because it cannot be expressed as a linear combination of


the two βs.

Systematic error may be present in the independent variables but its treatment is
outside the scope of regression analysis. If the independent variables are not error-
free, this is an errors-in-variables model, also outside this scope.

Other examples of nonlinear functions include exponential functions, logarithmic


functions, trigonometric functions, power functions, Gaussian function, and Lorentz
distributions. Some functions, such as the exponential or logarithmic functions, can
be transformed so that they are linear. When so transformed, standard linear
regression can be performed but must be applied with caution.

The goal of the model is to make the sum of the squares as small as possible.   The
sum of squares is a measure that tracks how far the Y observations vary from the
nonlinear (curved) function that is used to predict Y.

It is computed by first finding the difference between the fitted nonlinear function
and every Y point of data in the set. Then, each of those differences is squared.
Lastly, all of the squared figures are added together. The smaller the sum of these
squared figures, the better the function fits the data points in the set. Nonlinear
regression uses logarithmic functions, trigonometric functions, exponential functions,
power functions, Lorenz curves, Gaussian functions, and other fitting methods.

Nonlinear regression modeling is like linear regression modeling in that both seek to
track a particular response from a set of variables graphically. Nonlinear models are
more complicated than linear models to develop because the function is created
through a series of approximations (iterations) that may stem from trial-and-error.
Mathematicians use several established methods, such as the Gauss-Newton method
and the Levenberg-Marquardt method.

Often, regression models that appear nonlinear upon first glance are actually linear.
The curve estimation procedure can be used to identify the nature of the functional
relationships at play in your data, so you can choose the correct regression model,
whether linear or nonlinear. Linear regression models, while they typically form a
straight line, can also form curves, depending on the form of the linear regression
equation. Likewise, it’s possible to use algebra to transform a nonlinear equation so
that mimics a linear equation such a nonlinear equation is referred to as “intrinsically
linear.”

Artificial Neural Network


Artificial neural networks (ANNs), usually simply called neural networks (NNs), are
computing systems inspired by the biological neural networks that constitute animal
brains.

An ANN is based on a collection of connected units or nodes called artificial neurons,


which loosely model the neurons in a biological brain. Each connection, like the
synapses in a biological brain, can transmit a signal to other neurons. An artificial
neuron receives a signal then processes it and can signal neurons connected to it.
The “Signal” at a connection is a real number, and the output of each neuron is
computed by some non-linear function of the sum of its inputs. The connections are
called edges. Neurons and edges typically have a weight that adjusts as learning
proceeds. The weight increases or decreases the strength of the signal at a
connection. Neurons may have a threshold such that a signal is sent only if the
aggregate signal crosses that threshold. Typically, neurons are aggregated into
layers. Different layers may perform different transformations on their inputs.
Signals travel from the first layer (the input layer) to the last layer (the output layer),
possibly after traversing the layers multiple times.

Training

Neural networks learn (or are trained) by processing examples, each of which
contains a known “input” and “result,” forming probability-weighted associations
between the two, which are stored within the data structure of the net itself. The
training of a neural network from a given example is usually conducted by
determining the difference between the processed output of the network (often a
prediction) and a target output. This difference is the error. The network then
adjusts its weighted associations according to a learning rule and using this error
value. Successive adjustments will cause the neural network to produce output
which is increasingly similar to the target output. After a sufficient number of these
adjustments the training can be terminated based upon certain criteria. This is
known as supervised learning.

Such systems “Learn” to perform tasks by considering examples, generally without


being programmed with task-specific rules. For example, in image recognition, they
might learn to identify images that contain cats by analyzing example images that
have been manually labelled as “cat” or “no cat” and using the results to identify
cats in other images. They do this without any prior knowledge of cats, for example,
that they have fur, tails, whiskers, and cat-like faces. Instead, they automatically
generate identifying characteristics from the examples that they process.

The architecture of an artificial neural network:

To understand the concept of the architecture of an artificial neural network, we


have to understand what a neural network consists of. To define a neural network
that consists of a large number of artificial neurons, which are termed units arranged
in a sequence of layers. Let’s us look at various types of layers available in an
artificial neural network.

Artificial Neural Network primarily consists of three layers:

Input Layer:

As the name suggests, it accepts inputs in several different formats provided by the
programmer.

Hidden Layer:

The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.

Output Layer:
The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a transfer
function.

Advantages of Artificial Neural Network (ANN)

Storing data on the entire network:

Data that is used in traditional programming is stored on the whole network, not on
a database. The disappearance of a couple of pieces of data in one place doesn’t
prevent the network from working.

Parallel processing capability:

Artificial neural networks have a numerical value that can perform more than one
task simultaneously.

Capability to work with incomplete knowledge:

After ANN training, the information may produce output even with inadequate data.
The loss of performance here relies upon the significance of missing data.

Having fault tolerance:

Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.

Having a memory distribution:

For ANN is to be able to adapt, it is important to determine the examples and to


encourage the network according to the desired output by demonstrating these
examples to the network. The succession of the network is directly proportional to
the chosen instances, and if the event can’t appear to the network in all its aspects,
it can produce false output.

Disadvantages of Artificial Neural Network:

Unrecognized behavior of the network:

It is the most significant issue of ANN. When ANN produces a testing solution, it
does not provide insight concerning why and how. It decreases trust in the network.

Assurance of proper network structure:


There is no particular guideline for determining the structure of artificial neural
networks. The appropriate network structure is accomplished through experience,
trial, and error.

Hardware dependence:

Artificial neural networks need processors with parallel processing power, as per
their structure. Therefore, the realization of the equipment is dependent.

The duration of the network is unknown:

The network is reduced to a specific value of the error, and this value does not give
us optimum results.

Difficulty of showing the issue to the network:

ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user’s
abilities.

Network design

Neural architecture search (NAS) uses machine learning to automate ANN design.
Various approaches to NAS have designed networks that compare well with hand-
designed systems. The basic search algorithm is to propose a candidate model,
evaluate it against a dataset and use the results as feedback to teach the NAS
network.[85] Available systems include AutoML and AutoKeras.

Design issues include deciding the number, type and connectedness of network
layers, as well as the size of each and the connection type.

Hyperparameters must also be defined as part of the design (they are not learned),
governing matters such as how many neurons are in each layer, learning rate, step,
stride, depth, receptive field and padding (for CNNs), etc.

Use

Using Artificial neural networks requires an understanding of their characteristics.

 Choice of model: This depends on the data representation and the


application. Overly complex models are slow learning.
 Learning algorithm: Numerous trade-offs exist between learning
algorithms. Almost any algorithm will work well with the correct
hyperparameters for training on a particular data set. However, selecting and
tuning an algorithm for training on unseen data requires significant
experimentation.
 Robustness: If the model, cost function and learning algorithm are selected
appropriately, the resulting ANN can become robust.

ANN capabilities fall within the following broad categories:

 Function approximation, or regression analysis, including time series


prediction, fitness approximation and modelling.
 Classification, including pattern and sequence recognition, novelty detection
and sequential decision making.
 Data processing, including filtering, clustering, blind source separation and
compression.
 Robotics, including directing manipulators and prostheses.

Clustering Reinforcement Learning


Clustering of data into groups is an important task to perform dimensionality
reduction and to identify important properties of a data set. A wide range of
algorithms for clustering have been devised that all use some built-in
similarity/distance measure that is built into the algorithm to establish groupings of
the data with a range of properties. These clustering algorithms group the data,
attempting to maximize the distance between the clusters while minimizing the
variance within the individual clusters. However, traditional clustering algorithms do
not provide an efficient mechanism to fine tune the final clustering solution, given
some sparse information about the desired properties of the grouping of the data.

The need of semi-unsupervised clustering arises, for example, in data sets with large
numbers of attributes where most of the attributes are not semantically relevant but
will dominate any distance metric (due to their number), used by traditional
clustering algorithms. In these cases, sparse information regarding the quality of
clusters or regarding relations between a small number of data points might be
available which could be used to guide the cluster formation.

Semi-unsupervised clustering defines pairwise constraints on the input data in order


to direct the clustering algorithm towards an answer which satisfies given
constraints. We can have two possible types of constraints, same cluster or must link
constraints which indicate that points should be in the same cluster, and different
cluster or must not link constraints indicating that points should be in different
clusters.

Given the input samples, it is often not possible to cluster the data according to the
constraints in their original feature space using unmodified distance measures as
indications for similarity. Thus, we have to modify the feature space, usually by
scaling the dimensions, so that an unmodified clustering algorithm is able to cluster
based on it’s own distance and variance constraints. In order to solve this problem,
this paper presents a novel approach which, at first, learns a policy to compute the
scaling factors using Reinforcement learning from a set of training problems and
subsequently applies the learned policy to compute the scaling factors for new
problems. The goal here is that by working on the scaled dimensions, the traditional
clustering algorithm can yield results that satisfy the constraints.

Clustering Methods:

 Density-Based Methods: These methods consider the clusters as the dense region
having some similarities and differences from the lower dense region of the space.
These methods have good accuracy and the ability to merge two clusters. Example
DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS
(Ordering Points to Identify Clustering Structure), etc.
 Hierarchical Based Methods: The clusters formed in this method form a tree-type
structure based on the hierarchy. New clusters are formed using the previously
formed one. It is divided into two categories.
o Agglomerative (bottom-up approach)
o Divisive (top-down approach)

Applications of Clustering in different fields 

Marketing: It can be used to characterize & discover customer segments for


marketing purposes.

Biology: It can be used for classification among different species of plants and
animals.

Libraries: It is used in clustering different books on the basis of topics and


information.

Insurance: It is used to acknowledge the customers, their policies and identifying


the frauds.

City Planning: It is used to make groups of houses and to study their values based
on their geographical locations and other factors present.

Earthquake studies: By learning the earthquake-affected areas we can determine


the dangerous zones.

Clustering Algorithms:

K-means clustering algorithm: It is the simplest unsupervised learning algorithm


that solves clustering problem. K-means algorithm partitions n observations into k
clusters where each observation belongs to the cluster with the nearest mean
serving as a prototype of the cluster.
Decision Tree Learning
Decision tree learning or induction of decision trees is one of the predictive
modelling approaches used in statistics, data mining and machine learning. It uses a
decision tree (as a predictive model) to go from observations about an item
(represented in the branches) to conclusions about the item’s target value
(represented in the leaves). Tree models where the target variable can take a
discrete set of values are called classification trees; in these tree structures, leaves
represent class labels and branches represent conjunctions of features that lead to
those class labels. Decision trees where the target variable can take continuous
values (typically real numbers) are called regression trees. Decision trees are among
the most popular machine learning algorithms given their intelligibility and simplicity.

In decision analysis, a decision tree can be used to represent decisions and decision
making visually and explicitly. In data mining, a decision tree describes data (but the
resulting classification tree can be an input for decision making).

Decision tree learning is a method commonly used in data mining. The goal is to
create a model that predicts the value of a target variable based on several input
variables.

A decision tree is a simple representation for classifying examples. For this section,
assume that all of the input features have finite discrete domains, and there is a
single target feature called the “Classification”. Each element of the domain of the
classification is called a class. A decision tree or a classification tree is a tree in which
each internal (non-leaf) node is labelled with an input feature. The arcs coming from
a node labeled with an input feature are labeled with each of the possible values of
the target feature or the arc leads to a subordinate decision node on a different
input feature. Each leaf of the tree is labeled with a class or a probability distribution
over the classes, signifying that the data set has been classified by the tree into
either a specific class, or into a particular probability distribution (which, if the
decision tree is well-constructed, is skewed towards certain subsets of classes).
A tree is built by splitting the source set, constituting the root node of the tree, into
subsets which constitute the successor children. The splitting is based on a set of
splitting rules based on classification features. This process is repeated on each
derived subset in a recursive manner called recursive partitioning. The recursion is
completed when the subset at a node has all the same values of the target variable,
or when splitting no longer adds value to the predictions. This process of top-down
induction of decision trees (TDIDT) is an example of a greedy algorithm, and it is by
far the most common strategy for learning decision trees from data.

In data mining, decision trees can be described also as the combination of


mathematical and computational techniques to aid the description, categorization
and generalization of a given set of data.

Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more homogeneous
sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from
the tree.
 Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.

Decision Tree algorithm Working

In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node
of the tree. The complete process can be better understood using the below
algorithm:

Step 1: Begin the tree with the root node, says S, which contains the complete
dataset.

Step 2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).

Step 3: Divide the S into subsets that contains possible values for the best
attributes.

Step 4: Generate the decision tree node, which contains the best attribute.

Step 5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.

Bayesian networks
A Bayesian network (also known as a Bayes network, Bayes net, belief network, or
decision network) is a probabilistic graphical model that represents a set of variables
and their conditional dependencies via a directed acyclic graph (DAG). Bayesian
networks are ideal for taking an event that occurred and predicting the likelihood
that any one of several possible known causes was the contributing factor. For
example, a Bayesian network could represent the probabilistic relationships between
diseases and symptoms. Given symptoms, the network can be used to compute the
probabilities of the presence of various diseases.

Efficient algorithms can perform inference and learning in Bayesian networks.


Bayesian networks that model sequences of variables (e.g. speech signals or protein
sequences) are called dynamic Bayesian networks. Generalizations of Bayesian
networks that can represent and solve decision problems under uncertainty are
called influence diagrams.

Designing a Bayesian Network requires defining at least three things:

 Random Variables. What are the random variables in the problem?


 Conditional Relationships. What are the conditional relationships between
the variables?
 Probability Distributions. What are the probability distributions for each
variable?
Graphical model

Formally, Bayesian networks are directed acyclic graphs (DAGs) whose nodes
represent variables in the Bayesian sense: they may be observable quantities, latent
variables, unknown parameters or hypotheses. Edges represent conditional
dependencies; nodes that are not connected (no path connects one node to
another) represent variables that are conditionally independent of each other. Each
node is associated with a probability function that takes, as input, a particular set of
values for the node’s parent variables and gives (as output) the probability (or
probability distribution, if applicable) of the variable represented by the node.

A Bayesian network is a directed acyclic graph in which each edge corresponds to a


conditional dependency, and each node corresponds to a unique random variable.
Formally, if an edge (A, B) exists in the graph connecting random variables A and B,
it means that P(B|A) is a factor in the joint probability distribution, so we must know
P(B|A) for all values of B and A to conduct inference. In the above example, since
Rain has an edge going into WetGrass, it means that P(WetGrass|Rain) will be a
factor, whose probability values are specified next to the WetGrass node in a
conditional probability table.

Bayesian networks satisfy the local Markov property, which states that a node is
conditionally independent of its non-descendants given its parents. In the above
example, this means that P(Sprinkler|Cloudy, Rain) = P(Sprinkler|Cloudy) since
Sprinkler is conditionally independent of its non-descendant, Rain, given Cloudy. This
property allows us to simplify the joint distribution, obtained in the previous section
using the chain rule, to a smaller form.

Support Vector Machine


In machine learning, support-vector machines (SVMs, also support-vector networks)
are supervised learning models with associated learning algorithms that analyze
data for classification and regression analysis. Developed at AT&T Bell Laboratories
by Vladimir Vapnik with colleagues. SVMs are one of the most robust prediction
methods, being based on statistical learning frameworks or VC theory proposed by
Vapnik and Chervonenkis. Given a set of training examples, each marked as
belonging to one of two categories, an SVM training algorithm builds a model that
assigns new examples to one category or the other, making it a non-probabilistic
binary linear classifier (although methods such as Platt scaling exist to use SVM in a
probabilistic classification setting). SVM maps training examples to points in space
so as to maximize the width of the gap between the two categories. New examples
are then mapped into that same space and predicted to belong to a category based
on which side of the gap they fall.

In addition to performing linear classification, SVMs can efficiently perform a non-


linear classification using what is called the kernel trick, implicitly mapping their
inputs into high-dimensional feature spaces.

When data are unlabeled, supervised learning is not possible, and an unsupervised
learning approach is required, which attempts to find natural clustering of the data to
groups, and then map new data to these formed groups. The support-vector
clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the
statistics of support vectors, developed in the support vector machines algorithm, to
categorize unlabeled data, and is one of the most widely used clustering algorithms
in industrial applications.

Motivation

Classifying data is a common task in machine learning. Suppose some given data
points each belong to one of two classes, and the goal is to decide which class a
new data point will be in. In the case of support-vector machines, a data point is
viewed as p-dimensional vector (a list of p numbers), and we want to know whether
we can separate such points with (p-1)-dimensional hyperplane. This is called a
linear classifier. There are many hyperplanes that might classify the data. One
reasonable choice as the best hyperplane is the one that represents the largest
separation, or margin, between the two classes. So, we choose the hyperplane so
that the distance from it to the nearest data point on each side is maximized. If such
a hyperplane exists, it is known as the maximum-margin hyperplane and the linear
classifier it defines is known as a maximum-margin classifier; or equivalently, the
perceptron of optimal stability.

Applications

SVMs can be used to solve various real-world problems:

 SVMs are helpful in text and hypertext categorization, as their application can
significantly reduce the need for labeled training instances in both the
standard inductive and transductive settings. Some methods for shallow
semantic parsing are based on support vector machines.
 Classification of images can also be performed using SVMs. Experimental
results show that SVMs achieve significantly higher search accuracy than
traditional query refinement schemes after just three to four rounds of
relevance feedback. This is also true for image segmentation systems,
including those using a modified version SVM that uses the privileged
approach as suggested by Vapnik.
 Classification of satellite data like SAR data using supervised SVM.
 Hand-written characters can be recognized using SVM.
 The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly. Permutation tests based on SVM weights
have been suggested as a mechanism for interpretation of SVM models.
Support-vector machine weights have also been used to interpret SVM
models in the past. Posthoc interpretation of support-vector machine models
in order to identify features used by the model to make predictions is a
relatively new area of research with special significance in the biological
sciences.

Genetic Algorithm
In computer science and operations research, a genetic algorithm (GA) is a
metaheuristic inspired by the process of natural selection that belongs to the larger
class of evolutionary algorithms (EA). Genetic algorithms are commonly used to
generate high-quality solutions to optimization and search problems by relying on
biologically inspired operators such as mutation, crossover and selection. Some
examples of GA applications include optimizing decision trees for better
performance, automatically solve sudoku puzzles, hyperparameter optimization, etc.

Optimization problems

In a genetic algorithm, a population of candidate solutions (called individuals,


creatures, or phenotypes) to an optimization problem is evolved toward better
solutions. Each candidate solution has a set of properties (its chromosomes or
genotype) which can be mutated and altered; traditionally, solutions are represented
in binary as strings of 0s and 1s, but other encodings are also possible.

The evolution usually starts from a population of randomly generated individuals,


and is an iterative process, with the population in each iteration called a generation.
In each generation, the fitness of every individual in the population is evaluated; the
fitness is usually the value of the objective function in the optimization problem
being solved. The more fit individuals are stochastically selected from the current
population, and each individual’s genome is modified (recombined and possibly
randomly mutated) to form a new generation. The new generation of candidate
solutions is then used in the next iteration of the algorithm. Commonly, the
algorithm terminates when either a maximum number of generations has been
produced, or a satisfactory fitness level has been reached for the population.

A typical genetic algorithm requires:

 A genetic representation of the solution domain.


 A fitness function to evaluate the solution domain.
A standard representation of each candidate solution is as an array of bits (also
called bit set or bit string). Arrays of other types and structures can be used in
essentially the same way. The main property that makes these genetic
representations convenient is that their parts are easily aligned due to their fixed
size, which facilitates simple crossover operations. Variable length representations
may also be used, but crossover implementation is more complex in this case. Tree-
like representations are explored in genetic programming and graph-form
representations are explored in evolutionary programming; a mix of both linear
chromosomes and trees is explored in gene expression programming.

Once the genetic representation and the fitness function are defined, a GA proceeds
to initialize a population of solutions and then to improve it through repetitive
application of the mutation, crossover, inversion and selection operators.

Limitations

There are limitations of the use of a genetic algorithm compared to alternative


optimization algorithms:

 Repeated fitness function evaluation for complex problems is often the most
prohibitive and limiting segment of artificial evolutionary algorithms. Finding
the optimal solution to complex high-dimensional, multimodal problems often
requires very expensive fitness function evaluations. In real world problems
such as structural optimization problems, a single function evaluation may
require several hours to several days of complete simulation. Typical
optimization methods cannot deal with such types of problem. In this case, it
may be necessary to forgo an exact evaluation and use an approximated
fitness that is computationally efficient. It is apparent that amalgamation of
approximate models may be one of the most promising approaches to
convincingly use GA to solve complex real life problems.
 Genetic algorithms do not scale well with complexity. That is, where the
number of elements which are exposed to mutation is large there is often an
exponential increase in search space size. This makes it extremely difficult to
use the technique on problems such as designing an engine, a house or a
plane. In order to make such problems tractable to evolutionary search, they
must be broken down into the simplest representation possible. Hence we
typically see evolutionary algorithms encoding designs for fan blades instead
of engines, building shapes instead of detailed construction plans, and airfoils
instead of whole aircraft designs. The second problem of complexity is the
issue of how to protect parts that have evolved to represent good solutions
from further destructive mutation, particularly when their fitness assessment
requires them to combine well with other parts.
 The “better” solution is only in comparison to other solutions. As a result, the
stop criterion is not clear in every problem.
 In many problems, GAs has a tendency to converge towards local optima or
even arbitrary points rather than the global optimum of the problem. This
means that it does not “know how” to sacrifice short-term fitness to gain
longer-term fitness. The likelihood of this occurring depends on the shape of
the fitness landscape: certain problems may provide an easy ascent towards a
global optimum; others may make it easier for the function to find the local
optima. This problem may be alleviated by using a different fitness function,
increasing the rate of mutation, or by using selection techniques that maintain
a diverse population of solutions, although the No Free Lunch theorem proves
that there is no general solution to this problem. A common technique to
maintain diversity is to impose a “niche penalty”, wherein, any group of
individuals of sufficient similarity (niche radius) have a penalty added, which
will reduce the representation of that group in subsequent generations,
permitting other (less similar) individuals to be maintained in the population.
This trick, however, may not be effective, depending on the landscape of the
problem. Another possible technique would be to simply replace part of the
population with randomly generated individuals when most of the population
is too similar to each other. Diversity is important in genetic algorithms (and
genetic programming) because crossing over a homogeneous population does
not yield new solutions. In evolution strategies and evolutionary
programming, diversity is not essential because of a greater reliance on
mutation.
 Operating on dynamic data sets is difficult, as genomes begin to converge
early on towards solutions which may no longer be valid for later data.
Several methods have been proposed to remedy this by increasing genetic
diversity somehow and preventing early convergence, either by increasing the
probability of mutation when the solution quality drops (called triggered
hypermutation), or by occasionally introducing entirely new, randomly
generated elements into the gene pool (called random immigrants). Again,
evolution strategies and evolutionary programming can be implemented with
a so-called “comma strategy” in which parents are not maintained and new
parents are selected only from offspring. This can be more effective on
dynamic problems.
 GAs cannot effectively solve problems in which the only fitness measure is a
single right/wrong measure (like decision problems), as there is no way to
converge on the solution (no hill to climb). In these cases, a random search
may find a solution as quickly as a GA. However, if the situation allows the
success/failure trial to be repeated giving (possibly) different results, then the
ratio of successes to failures provides a suitable fitness measure.
 For specific optimization problems and problem instances, other optimization
algorithms may be more efficient than genetic algorithms in terms of speed of
convergence. Alternative and complementary algorithms include evolution
strategies, evolutionary programming, simulated annealing, Gaussian
adaptation, hill climbing, and swarm intelligence (e.g.: ant colony
optimization, particle swarm optimization) and methods based on integer
linear programming. The suitability of genetic algorithms is dependent on the
amount of knowledge of the problem; well-known problems often have
better, more specialized approaches.

Issues in Machine Learning


The complexity and quality trade-off

Building robust machine learning models requires substantial computational


resources to process the features and labels. Coding a complex model requires
significant effort from data scientists and software engineers. Complex models can
require substantial computing power to execute and can take longer to derive a
usable result.

This represents a trade-off for businesses. They can choose a faster response but a
potentially less accurate outcome. Or they can accept a slower response but receive
a more accurate result from the model. But these compromises aren’t all bad news.
The decision of whether to go for a higher cost and more accurate model over a
faster response comes down to the use case.

For example, making recommendations to shoppers on a retail shopping site


requires real-time responses, but can accept some unpredictability in the result. On
the other hand, a stock trading system requires a more robust result. So, a model
that uses more data and performs more computations is likely to deliver a better
outcome when a real-time result is not needed.

As Machine Learning as a Service (MLaaS) offerings enter the market, the


complexity and quality of trade-offs will get greater attention. Researchers from the
University of Chicago looked at the effectiveness of MLaaS and found that “they can
achieve results comparable to standalone classifiers if they have sufficient insight
into key decisions like classifiers and feature selection”.

Poor Quality of Data

Data plays a significant role in the machine learning process. One of the significant
issues that machine learning professionals face is the absence of good quality data.
Unclean and noisy data can make the whole process extremely exhausting. We
don’t want our algorithm to make inaccurate or faulty predictions. Hence the quality
of data is essential to enhance the output. Therefore, we need to ensure that the
process of data preprocessing which includes removing outliers, filtering missing
values, and removing unwanted features, is done with the utmost level of perfection.

Underfitting of Training Data

Let’s say for a child, to make him learn what an apple is, all it takes for you to point to
an apple and say apple repeatedly. Now the child can recognize all sorts of apples.

Well, machine learning is still not up to that level yet; it takes a lot of data for most of
the algorithms to function properly. For a simple task, it needs thousands of
examples to make something out of it, and for advanced tasks like image or speech
recognition, it may need lakhs(millions) of examples.

Overfitting of Training Data

Overfitting refers to a machine learning model trained with a massive amount of data
that negatively affect its performance. It is like trying to fit in Oversized jeans.
Unfortunately, this is one of the significant issues faced by machine learning
professionals. This means that the algorithm is trained with noisy and biased data,
which will affect its overall performance. Let’s understand this with the help of an
example. Let’s consider a model trained to differentiate between a cat, a rabbit, a
dog, and a tiger. The training data contains 1000 cats, 1000 dogs, 1000 tigers, and
4000 Rabbits. Then there is a considerable probability that it will identify the cat as a
rabbit. In this example, we had a vast amount of data, but it was biased; hence the
prediction was negatively affected.

Changing expectations and concept drift

Machine learning models operate within specific contexts. For example, ML models
that power recommendation engines for retailers operate at a specific time when
customers are looking at certain products. However, customer needs change over
time, and that means the ML model can drift away from what it was designed to
deliver.

Models can decay for several reasons. Drift can occur when new data is introduced
to the model. This is called data drift. It can also occur when our interpretation of the
data changes. This is concept drift.

To accommodate this drift, you need a model that continuously updates and
improves itself using data that comes in. That means you need to keep checking the
model.

That requires the collection of features and labels and to react to changes so the
model can be updated and retrained. While some aspects of the retraining can be
conducted automatically, some human intervention is needed. It’s critical to
recognise that the deployment of a machine learning tool is not a one-off activity.

Machine learning tools require regular review and update to remain relevant and
continue to deliver value.

Monitoring and maintenance

Creating a model is easy. Building a model can be automatic. However, maintaining


and updating the models requires a plan and resources.

Machine learning models are part of a longer pipeline that starts with the features
that are used to train the model. Then there is the model itself, which is a piece of
software that can require modification and updates. That model requires labels so
that the results of an input can be recognized and used by the model. And there may
be a disconnect between the model and the final signal in a system.

In many cases when an unexpected outcome is delivered, it’s not the machine
learning that has broken down but some other part of the chain. For example, a
recommendation engine may have offered a product to a customer, but sometimes
the connection between the sales system and the recommendation could be broken,
and it takes time to find the bug. In this case, it would be hard to tell the model if the
recommendation was successful. Troubleshooting issues like this can be quite labor
intensive.

Machine learning offers significant benefits to businesses. The ability to predict


future outcomes to anticipate and influence customer behavior and to support
business operations are substantial. However, ML also brings challenges to
businesses. By recognizing these challenges and developing strategies to address
them, companies can ensure they are prepared and equipped to handle them and
get the most out of machine learning technology.

Slow Implementation

This is one of the common issues faced by machine learning professionals. The
machine learning models are highly efficient in providing accurate results, but it takes
a tremendous amount of time. Slow programs, data overload, and excessive
requirements usually take a lot of time to provide accurate results. Further, it requires
constant monitoring and maintenance to deliver the best output.

Data Science Vs Machine Learning


“Data Science” and “Machine Learning” are some of the most searched terms in
the technology world. From 1st-year Computer Science students to big Organizations
like Netflix, Amazon, etc. are running behind these two techniques. And they also
got the reason. In the world of data space, the era of Big Data emerged when
organizations are dealing with petabytes and exabytes of data. It became very tough
for industries for the storage of data until 2010. Now when popular frameworks like
Hadoop and others solved the problem of storage, the focus is on processing the
data. And here Data Science and Machine Learning play a big role.

Data Science and Machine Learning are closely related to each other but have
different functionalities and different goals. Briefly, Data Science is a field to study
the approaches to find insights from the raw data. Whereas Machine Learning is a
technique used by the group of data scientists to enable the machines to learn
automatically from the past data. To understand the difference in-depth, let’s first
have a brief introduction to these two technologies.

Machine Learning used in Data Science

Data Acquisition: In this step, the data is acquired to solve the given problem. For
the recommendation system, we can get the ratings provided by the user for
different products, comments, purchase history, etc.

Business Requirements: In this step, we try to understand the requirement for


the business problem for which we want to use it. Suppose we want to create a
recommendation system, and the business requirement is to increase sales.

Data Processing: In this step, the raw data acquired from the previous step is
transformed into a suitable format, so that it can be easily used by the further steps.
Modeling: The data modeling is a step where machine learning algorithms are
used. So, this step includes the whole machine learning process. The machine
learning process involves importing the data, data cleaning, building a model,
training the model, testing the model, and improving the model’s efficiency.

Data Exploration: It is a step where we understand the patterns of the data, and
try to find out the useful insights from the data.

Deployment & Optimization: This is the last step where the model is deployed on
an actual project, and the performance of the model is checked.

Data Science Machine Learning


It is used for discovering insights from the It is used for making predictions and classifying
data. the result for new data points.

It deals with understanding and finding


It is a subfield of data science that enables the
hidden patterns or useful insights from
machine to learn from the past data and
the data, which helps to take smarter
experiences automatically.
business decisions.

It is a broad term that includes various


It is used in the data modelling step of the data
steps to create a model for a given
science as a complete process.
problem and deploy the model.

A data scientist needs to have skills to use Machine Learning Engineer needs to have skills
big data tools like Hadoop, Hive and Pig, such as computer science fundamentals,
statistics, programming in Python, R, or programming skills in Python or R, statistics and
Scala. probability concepts, etc.

It can work with raw, structured, and


It mostly requires structured data to work on.
unstructured data.

ML engineers spend a lot of time for managing


Data scientists spent lots of time in
the complexities that occur during the
handling the data, cleansing the data, and
implementation of algorithms and mathematical
understanding its patterns.
concepts behind that.

K-Nearest Neighbor
In statistics, the k-nearest neighbors’ algorithm (k-NN) is a non-parametric
classification method first developed by Evelyn Fix and Joseph Hodges in 1951, and
later expanded by Thomas Cover. It is used for classification and regression. In both
cases, the input consists of the k closest training examples in a data set. The output
depends on whether k-NN is used for classification or regression:

 In k-NN classification, the output is a class membership. An object is classified


by a plurality vote of its neighbors, with the object being assigned to the class
most common among its k nearest neighbors (k is a positive integer, typically
small). If k = 1, then the object is simply assigned to the class of that single
nearest neighbor.
 In k-NN regression, the output is the property value for the object. This value
is the average of the values of k nearest neighbors.

k-NN is a type of classification where the function is only approximated locally, and
all computation is deferred until function evaluation. Since this algorithm relies on
distance for classification, if the features represent different physical units or come in
vastly different scales then normalizing the training data can improve its accuracy
dramatically.

Both for classification and regression, a useful technique can be to assign weights to
the contributions of the neighbors, so that the nearer neighbors contribute more to
the average than the more distant ones. For example, a common weighting scheme
consists in giving each neighbor a weight of 1/d, where d is the distance to the
neighbor.

The neighbors are taken from a set of objects for which the class (for k-NN
classification) or the object property value (for k-NN regression) is known. This can
be thought of as the training set for the algorithm, though no explicit training step is
required.

A peculiarity of the k-NN algorithm is that it is sensitive to the local structure of the
data.

Algorithm

Example of k-NN classification. The test sample (green dot) should be classified
either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to
the red triangles because there are 2 triangles and only 1 square inside the inner
circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2
triangles inside the outer circle).
The training examples are vectors in a multidimensional feature space, each with a
class label. The training phase of the algorithm consists only of storing the feature
vectors and class labels of the training samples.

In the classification phase, k is a user-defined constant, and an unlabeled vector (a


query or test point) is classified by assigning the label which is most frequent among
the k training samples nearest to that query point.

A commonly used distance metric for continuous variables is Euclidean distance. For
discrete variables, such as for text classification, another metric can be used, such as
the overlap metric (or Hamming distance). In the context of gene expression
microarray data, for example, k-NN has been employed with correlation coefficients,
such as Pearson and Spearman, as a metric.[6] Often, the classification accuracy of
k-NN can be improved significantly if the distance metric is learned with specialized
algorithms such as Large Margin Nearest Neighbor or Neighborhood components
analysis.

A drawback of the basic “majority voting” classification occurs when the class
distribution is skewed. That is, examples of a more frequent class tend to dominate
the prediction of the new example, because they tend to be common among the k
nearest neighbors due to their large number. One way to overcome this problem is
to weight the classification, considering the distance from the test point to each of
its k nearest neighbors. The class (or value, in regression problems) of each of the k
nearest points is multiplied by a weight proportional to the inverse of the distance
from that point to the test point. Another way to overcome skew is by abstraction in
data representation. For example, in a self-organizing map (SOM), each node is a
representative (a center) of a cluster of similar points, regardless of their density in
the original training data. K-NN can then be applied to the SOM.

Parameter Selection

The best choice of k depends upon the data; generally, larger values of k reduce
effect of the noise on the classification,[8] but make boundaries between classes
less distinct. A good k can be selected by various heuristic techniques (see
hyperparameter optimization). The special case where the class is predicted to be
the class of the closest training sample (i.e. when k = 1) is called the nearest
neighbor algorithm.

The accuracy of the k-NN algorithm can be severely degraded by the presence of
noisy or irrelevant features, or if the feature scales are not consistent with their
importance. Much research effort has been put into selecting or scaling features to
improve classification. A particularly popular approach is the use of evolutionary
algorithms to optimize feature scaling. Another popular approach is to scale features
by the mutual information of the training data with the training classes.

In binary (two class) classification problems, it is helpful to choose k to be an odd


number as this avoids tied votes. One popular way of choosing the empirically
optimal k in this setting is via bootstrap method.
The KNN Algorithm

1. Load the data


2. Initialize K to your chosen number of neighbors.
3. For each example in the data.

3.1 Calculate the distance between the query example and the current example from
the data.

3.2 Add the distance and the index of the example to an ordered collection.

4. Sort the ordered collection of distances and indices from smallest to largest
(in ascending order) by the distances.
5. Pick the first K entries from the sorted collection.
6. Get the labels of the selected K entries.
7. If regression, return the mean of the K labels.
8. If classification, return the mode of the K labels.

Choosing the right value for K

To select the K that’s right for your data, we run the KNN algorithm several times
with different values of K and choose the K that reduces the number of errors we
encounter while maintaining the algorithm’s ability to accurately make predictions
when it’s given data it hasn’t seen before.

Here are some things to keep in mind:

 As we decrease the value of K to 1, our predictions become less stable. Just


think for a minute, imagine K=1 and we have a query point surrounded by
several reds and one green (I’m thinking about the top left corner of the
colored plot above), but the green is the single nearest neighbor. Reasonably,
we would think the query point is most likely red, but because K=1, KNN
incorrectly predicts that the query point is green.
 Inversely, as we increase the value of K, our predictions become more stable
due to majority voting / averaging, and thus, more likely to make more
accurate predictions (up to a certain point). Eventually, we begin to witness
an increasing number of errors. It is at this point we know we have pushed
the value of K too far.
 In cases where we are taking a majority vote (e.g. picking the mode in a
classification problem) among labels, we usually make K an odd number to
have a tiebreaker.

Advantages

 The algorithm is simple and easy to implement.


 There’s no need to build a model, tune several parameters, or make
additional assumptions.
 The algorithm is versatile. It can be used for classification, regression, and
search (as we will see in the next section).

Disadvantages

The algorithm gets significantly slower as the number of examples and/or


predictors/independent variables increase.

KNN in practice

KNN’s main disadvantage of becoming significantly slower as the volume of data


increases makes it an impractical choice in environments where predictions need to
be made rapidly. Moreover, there are faster algorithms that can produce more
accurate classification and regression results.

However, provided you have sufficient computing resources to speedily handle the
data you are using to make predictions, KNN can still be useful in solving problems
that have solutions that depend on identifying similar objects. An example of this is
using the KNN algorithm in recommender systems, an application of KNN-search.

Decision Trees
Decision trees are a method for defining complex relationships by describing decisions and
avoiding the problems in communication. A decision tree is a diagram that shows alternative
actions and conditions within horizontal tree framework. Thus, it depicts which conditions to
consider first, second, and so on.
Decision trees depict the relationship of each condition and their permissible actions. A
square node indicates an action, and a circle indicates a condition. It forces analysts to
consider the sequence of decisions and identifies the actual decision that must be made.

The major limitation of a decision tree is that it lacks information in its format to describe
what other combinations of conditions you can take for testing. It is a single representation of
the relationships between conditions and actions.

For example, refer the following decision tree:

Logistic Regression
In statistics, the logistic model (or logit model) is used to model the probability of a
certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.
This can be extended to model several classes of events such as determining
whether an image contains a cat, dog, lion, etc. Each object being detected in the
image would be assigned a probability between 0 and 1, with a sum of one.

Logistic regression is a statistical model that in its basic form uses a logistic function
to model a binary dependent variable, although many more complex extensions
exist. In regression analysis, logistic regression (or logit regression) is estimating the
parameters of a logistic model (a form of binary regression). Mathematically, a
binary logistic model has a dependent variable with two possible values, such as
pass/fail which is represented by an indicator variable, where the two values are
labeled “0” and “1”. In the logistic model, the log-odds (the logarithm of the odds)
for the value labeled “1” is a linear combination of one or more independent
variables (“predictors”); the independent variables can each be a binary variable
(two classes, coded by an indicator variable) or a continuous variable (any real
value). The corresponding probability of the value labeled “1” can vary between 0
(certainly the value “0”) and 1 (certainly the value “1”), hence the labeling; the
function that converts log-odds to probability is the logistic function, hence the
name. The unit of measurement for the log-odds scale is called a logit, from logistic
unit, hence the alternative names. Analogous models with a different sigmoid
function instead of the logistic function can also be used, such as the probit model;
the defining characteristic of the logistic model is that increasing one of the
independent variables multiplicatively scales the odds of the given outcome at a
constant rate, with each independent variable having its own parameter; for a binary
dependent variable this generalizes the odds ratio.
In a binary logistic regression model, the dependent variable has two levels
(categorical). Outputs with more than two values are modeled by multinomial
logistic regression and, if the multiple categories are ordered, by ordinal logistic
regression (for example the proportional odds ordinal logistic model). The logistic
regression model itself simply models probability of output in terms of input and
does not perform statistical classification (it is not a classifier), though it can be used
to make a classifier, for instance by choosing a cutoff value and classifying inputs
with probability greater than the cutoff as one class, below the cutoff as the other;
this is a common way to make a binary classifier. The coefficients are generally not
computed by a closed-form expression, unlike linear least squares; see § Model
fitting. The logistic regression as a general statistical model was originally developed
and popularized primarily by Joseph Berkson, beginning in Berkson (1944), where he
coined “logit”.

Applications

Logistic regression is used in various fields, including machine learning, most medical
fields, and social sciences. For example, the Trauma and Injury Severity Score
(TRISS), which is widely used to predict mortality in injured patients, was originally
developed by Boyd using logistic regression. Many other medical scales used to
assess severity of a patient have been developed using logistic regression. Logistic
regression may be used to predict the risk of developing a given disease (e.g.
diabetes; coronary heart disease), based on observed characteristics of the patient
(age, sex, body mass index, results of various blood tests, etc.). Another example
might be to predict whether a Nepalese voter will vote Nepali Congress or
Communist Party of Nepal or Any Other Party, based on age, income, sex, race,
state of residence, votes in previous elections, etc. The technique can also be used
in engineering, especially for predicting the probability of failure of a given process,
system or product. It is also used in marketing applications such as prediction of a
customer’s propensity to purchase a product or halt a subscription, etc. In
economics it can be used to predict the likelihood of a person ending up in the labor
force, and a business application would be to predict the likelihood of a homeowner
defaulting on a mortgage. Conditional random fields, an extension of logistic
regression to sequential data, are used in natural language processing.

Representation Used for Logistic Regression

Logistic regression uses an equation as the representation, very much like linear
regression.

Input values (x) are combined linearly using weights or coefficient values (referred
to as the Greek capital letter Beta) to predict an output value (y). A key difference
from linear regression is that the output value being modeled is a binary value (0 or
1) rather than a numeric value.

Below is an example logistic regression equation:

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))


Where y is the predicted output, b0 is the bias or intercept term and b1 is the
coefficient for the single input value (x). Each column in your input data has an
associated b coefficient (a constant real value) that must be learned from your
training data.

The actual representation of the model that you would store in memory or in a file
are the coefficients in the equation (the beta value or b’s).

Logistic Regression Predicts Probabilities (Technical Interlude)

Logistic regression models the probability of the default class (e.g. the first class).

For example, if we are modeling people’s sex as male or female from their height,
then the first class could be male and the logistic regression model could be written
as the probability of male given a person’s height, or more formally:

P(sex=male|height)

Written another way, we are modeling the probability that an input (X) belongs to
the default class (Y=1), we can write this formally as:

P(X) = P(Y=1|X)

We’re predicting probabilities. I thought logistic regression was a classification


algorithm.

Note that the probability prediction must be transformed into a binary values (0 or
1) in order to actually make a probability prediction. More on this later when we talk
about making predictions.

Logistic regression is a linear method, but the predictions are transformed using the
logistic function. The impact of this is that we can no longer understand the
predictions as a linear combination of the inputs as we can with linear regression, for
example, continuing on from above, the model can be stated as:

p(X) = e^(b0 + b1*X) / (1 + e^(b0 + b1*X))

I don’t want to dive into the math too much, but we can turn around the above
equation as follows (remember we can remove the e from one side by adding a
natural logarithm (ln) to the other):

ln(p(X) / 1 – p(X)) = b0 + b1 * X

This is useful because we can see that the calculation of the output on the right is
linear again (just like linear regression), and the input on the left is a log of the
probability of the default class.
This ratio on the left is called the odds of the default class (it’s historical that we use
odds, for example, odds are used in horse racing rather than probabilities). Odds are
calculated as a ratio of the probability of the event divided by the probability of not
the event, e.g. 0.8/(1-0.8) which has the odds of 4. So, we could instead write:

ln(odds) = b0 + b1 * X

Because the odds are log transformed, we call this left-hand side the log-odds or the
probit. It is possible to use other types of functions for the transform (which is out of
scope_, but as such it is common to refer to the transform that relates the linear
regression equation to the probabilities as the link function, e.g. the probit link
function.

We can move the exponent back to the right and write it as:

odds = e^(b0 + b1 * X)

All of this helps us understand that indeed the model is still a linear combination of
the inputs, but that this linear combination relates to the log-odds of the default
class.

Prepare Data for Logistic Regression

The assumptions made by logistic regression about the distribution and relationships
in your data are much the same as the assumptions made in linear regression.

Much study has gone into defining these assumptions and precise probabilistic and
statistical language is used. My advice is to use these as guidelines or rules of thumb
and experiment with different data preparation schemes.

Ultimately in predictive modelling machine learning projects you are laser focused on
making accurate predictions rather than interpreting the results. As such, you can
break some assumptions as long as the model is robust and performs well.

 Binary Output Variable: This might be obvious as we have already


mentioned it, but logistic regression is intended for binary (two-class)
classification problems. It will predict the probability of an instance belonging
to the default class, which can be snapped into a 0 or 1 classification.
 Remove Noise: Logistic regression assumes no error in the output variable
(y), consider removing outliers and possibly misclassified instances from your
training data.
 Gaussian Distribution: Logistic regression is a linear algorithm (with a non-
linear transform on output). It does assume a linear relationship between the
input variables with the output. Data transforms of your input variables that
better expose this linear relationship can result in a more accurate model. For
example, you can use log, root, Box-Cox and other univariate transforms to
better expose this relationship.
 Remove Correlated Inputs: Like linear regression, the model can overfit if
you have multiple highly correlated inputs. Consider calculating the pairwise
correlations between all inputs and removing highly correlated inputs.
 Fail to Converge: It is possible for the expected likelihood estimation
process that learns the coefficients to fail to converge. This can happen if
there are many highly correlated inputs in your data or the data is very sparse
(e.g. lots of zeros in your input data).

Support Vector Machine


In machine learning, support-vector machines (SVMs, also support-vector networks)
are supervised learning models with associated learning algorithms that analyze
data for classification and regression analysis. Developed at AT&T Bell Laboratories
by Vladimir Vapnik with colleagues. SVMs are one of the most robust prediction
methods, being based on statistical learning frameworks or VC theory proposed by
Vapnik and Chervonenkis. Given a set of training examples, each marked as
belonging to one of two categories, an SVM training algorithm builds a model that
assigns new examples to one category or the other, making it a non-probabilistic
binary linear classifier (although methods such as Platt scaling exist to use SVM in a
probabilistic classification setting). SVM maps training examples to points in space to
maximize the width of the gap between the two categories. New examples are then
mapped into that same space and predicted to belong to a category based on which
side of the gap they fall.

In addition to performing linear classification, SVMs can efficiently perform a non-


linear classification using what is called the kernel trick, implicitly mapping their
inputs into high-dimensional feature spaces.

When data are unlabeled, supervised learning is not possible, and an unsupervised
learning approach is required, which attempts to find natural clustering of the data to
groups, and then map new data to these formed groups. The support-vector
clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the
statistics of support vectors, developed in the support vector machines algorithm, to
categorize unlabeled data, and is one of the most widely used clustering algorithms
in industrial applications.

Motivation

Classifying data is a common task in machine learning. Suppose some given data
points each belong to one of two classes, and the goal is to decide which class a
new data point will be in. In the case of support-vector machines, a data point is
viewed as p-dimensional vector (a list of p numbers), and we want to know whether
we can separate such points with (p-1)-dimensional hyperplane. This is called a
linear classifier. There are many hyperplanes that might classify the data. One
reasonable choice as the best hyperplane is the one that represents the largest
separation, or margin, between the two classes. So, we choose the hyperplane so
that the distance from it to the nearest data point on each side is maximized. If such
a hyperplane exists, it is known as the maximum-margin hyperplane and the linear
classifier it defines is known as a maximum-margin classifier; or equivalently, the
perceptron of optimal stability.

Applications

SVMs can be used to solve various real-world problems:

 SVMs are helpful in text and hypertext categorization, as their application can
significantly reduce the need for labeled training instances in both the
standard inductive and transductive settings. Some methods for shallow
semantic parsing are based on support vector machines.
 Classification of images can also be performed using SVMs. Experimental
results show that SVMs achieve significantly higher search accuracy than
traditional query refinement schemes after just three to four rounds of
relevance feedback. This is also true for image segmentation systems,
including those using a modified version SVM that uses the privileged
approach as suggested by Vapnik.
 Classification of satellite data like SAR data using supervised SVM.
 Hand-written characters can be recognized using SVM.
 The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly. Permutation tests based on SVM weights
have been suggested as a mechanism for interpretation of SVM models.
Support-vector machine weights have also been used to interpret SVM
models in the past. Posthoc interpretation of support-vector machine models
in order to identify features used by the model to make predictions is a
relatively new area of research with special significance in the biological
sciences.

Applications of Supervised learning in


multiple domains
Supervised learning, the computer is taught by example. It learns from past data
and applies the learning to present data to predict future events. In this case, both
input and desired output data provide help to the prediction of future events.

1) Classification Models: Classification models are used for problems where the
output variable can be categorized, such as “Yes” or “No”, or “Pass” or “Fail.”
Classification Models are used to predict the category of the data. Real-life examples
include spam detection, sentiment analysis, scorecard prediction of exams, etc.

2) Regression Models: Regression models are used for problems where the
output variable is a real value such as a unique number, dollars, salary, weight or
pressure, for example. It is most often used to predict numerical values based on
previous data observations. Some of the more familiar regression algorithms include
linear regression, logistic regression, polynomial regression, and ridge regression.
Speech Recognition

While using Google, we get an option of “Search by Voice,” it comes under speech
recognition, and it’s a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is


also known as “Speech to text”, or “Computer speech recognition.” At present,
machine learning algorithms are widely used by various applications of speech
recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.

Image Recognition

Image recognition is one of the most common applications of machine learning. It is


used to identify objects, persons, places, digital images, etc. The popular use case of
image recognition and face detection is Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we


upload a photo with our Facebook friends, then we automatically get a tagging
suggestion with name, and the technology behind this is machine learning’s face
detection and recognition algorithm. It is based on the Facebook project named
“Deep Face,” which is responsible for face recognition and person identification in
the picture.

Traffic prediction

If we want to visit a new place, we take help of Google Maps, which shows us the
correct path with the shortest route and predicts the traffic conditions. It predicts
the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:

 Real Time location of the vehicle form Google Map app and sensors
 Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the
performance.

Self-driving cars

One of the most exciting applications of machine learning is self-driving cars.


Machine learning plays a significant role in self-driving cars. Tesla, the most popular
car manufacturing company is working on self-driving car. It is using unsupervised
learning method to train the car models to detect people and objects while driving.

Email Spam and Malware Filtering


Whenever we receive a new email, it is filtered automatically as important, normal,
and spam. We always receive an important mail in our inbox with the important
symbol and spam emails in our spam box, and the technology behind this is Machine
learning. Below are some spam filters used by Gmail:

 Content Filter
 Header filter
 General blacklists filter
 Rules-based filters
 Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree,


and Naïve Bayes classifier are used for email spam filtering and malware detection.

Product recommendations

Machine learning is widely used by various e-commerce and entertainment


companies such as Amazon, Netflix, etc., for product recommendation to the user.
Whenever we search for some product on Amazon, then we started getting an
advertisement for the same product while internet surfing on the same browser and
this is because of machine learning.

Google understands the user interest using various machine learning algorithms and
suggests the product as per customer interest. As similar, when we use Netflix, we
find some recommendations for entertainment series, movies, etc., and this is also
done with the help of machine learning.

Virtual Personal Assistant

We have various virtual personal assistants such as Google assistant, Alexa, Cortana,
Siri. As the name suggests, they help us in finding the information using our voice
instruction. These assistants can help us in various ways just by our voice
instructions such as Play music, call someone, open an email, Scheduling an
appointment, etc. These virtual assistants use machine learning algorithms as an
important part. These assistants record our voice instructions, send it over the
server on a cloud, and decode it using ML algorithms and act accordingly.

Online Fraud Detection

Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various
ways that a fraudulent transaction can take place such as fake accounts, fake ids,
and steal money in the middle of a transaction. So, to detect this, Feed Forward
Neural network helps us by checking whether it is a genuine transaction or a fraud
transaction.

For each genuine transaction, the output is converted into some hash values, and
these values become the input for the next round. For each genuine transaction,
there is a specific pattern which gets change for the fraud transaction hence, it
detects it and makes our online transactions more secure.

Stock Market trading

Machine learning is widely used in stock market trading. In the stock market, there
is always a risk of up and downs in shares, so for this machine learning’s long short-
term memory neural network is used for the prediction of stock market trends.

Medical Diagnosis

In medical science, machine learning is used for diseases diagnoses. With this,
medical technology is growing very fast and able to build 3D models that can predict
the exact position of lesions in the brain. It helps in finding brain tumors and other
brain-related diseases easily.

Automatic Language Translation

Nowadays, if we visit a new place and we are not aware of the language then it is
not a problem at all, as for this also machine learning helps us by converting the text
into our known languages. Google’s GNMT (Google Neural Machine Translation)
provide this feature, which is a Neural Machine Learning that translates the text into
our familiar language, and it called as automatic translation.

The technology behind the automatic translation is a sequence-to-sequence learning


algorithm, which is used with image recognition and translates the text from one
language to another language.

Application of supervised learning in


solving business problems such as pricing,
customer relationship management, sales,
and marketing
Pricing

Companies can mine their historical pricing data along with data sets on a host of
other variables to understand how certain dynamics from time of day to weather to
the seasons impact demand for goods and services. Machine learning algorithms can
learn from that information and combine that insight with additional market and
consumer data to help companies dynamically price their goods based on those vast
and numerous variables a strategy that ultimately helps companies maximize
revenue.
The most visible example of dynamic pricing (which is sometimes called demand
pricing) happens in the transportation industry:

Think surge pricing at Uber when conditions push up the number of people seeking
rides all at once or sky-high prices for airline tickets during school vacation weeks.

Machine learning applications don’t just help companies set prices; they also help
companies deliver the right products and services to the right areas at the right time
through predictive inventory planning and customer segmentation. Retailers, for
example, use machine learning to predict what inventory will sell best in which of its
stores based on the seasonal factors impacting a particular store, the demographics
of that region and other data points such as what’s trending on social media, said
Adnan Masood who as chief architect at UST Global specializes in AI and machine
learning.

Customer Relationship management

Sales performance. Is there a way to understand why one middle-level sales


executive brings twice as much lead conversion than another middle-level exec
sitting in the same office? Technically, they both send emails, set calls, and
participate in conferences, which somehow result in conversions or lack thereof. Any
time we talk about what drives salespeople performance, we make assumptions
prone to bias. A good example of ML use here is People.ai, a startup which tries to
address the problem by tracking all the sales data, including emails, calls, and CRM
interactions to use this data as a supervised learning set and predict which kinds of
actions bring better results. Basically, the algorithm aids in developing a playbook for
sales reps based on successful cases.

Retention. Similar tracking techniques, the use of text sentiment and other
metadata analysis (from emails and social media posts) can be applied to detect
possible job-hopping behavior among candidates.

Human resource allocation. You can use historic data from HR software sick
days, vacations, holidays, etc. to make broader predictions on your workforce.
Deloitte disclosed that several automotive companies are learning from the patterns
of unscheduled absences to forecast the periods when people are likely to take a
day off and reserve more workforce.

Customer recommendation engines

Machine learning powers the customer recommendation engines designed to


enhance the customer experience and provide personalized experiences. In this use
case, algorithms process data points about an individual customer, such as the
customer’s past purchases, as well as other data sets such as a company’s current
inventory, demographic trends and other customers’ buying histories to determine
what products and services to recommend to each individual customer.
Here are a few examples of companies whose business models rely on
recommendation engines:

 Big e-commerce companies like Amazon and Walmart use recommendation engines
to personalize and expedite the shopping experience.
 Another well-known deployer of this machine learning application is Netflix, the
streaming entertainment service, which uses a customer’s viewing history, the
viewing history of customers with similar entertainment interests, information about
individual shows and other data points to deliver personalized recommendations to
its customers.
 Online video platform YouTube uses recommendation engine technology to help
users quickly find videos that fit their tastes.

Sales and Marketing

Digital marketing and online-driven sales are the first application fields that you may
think of for machine learning adoption. People interact with the web and leave a
detailed footprint to be analyzed. While there are tangible results in unsupervised
learning techniques for marketing and sales, the largest value impact is in the
supervised learning field. Let’s have a look.

Lifetime Value. A customer lifetime value that we mentioned before is usually


measured in the net profit this customer brings to a company in the longer run. If
you’ve been tracking most of your customers and accurately documenting their in-
funnel and further purchase behavior, you have enough data to make predictions
about most budding customers early and target sales effort toward them.

Churn. The churn rate defines the number of customers who cease to complete
target actions (e.g. add to cart, leave a comment, checkout, etc.) during a given
period. Like lifetime value predictions, sorting “likely-to-churn-soon” from engaged
customers will allow you to:

1) Analyze the reasons for such behavior.

2) Refocus and personalize offerings for different groups of churning customers.

Sentiment analysis. Skimming through thousands of feedback posts in social


media and comments sections is painstaking work, especially in B2C after a new
product or feature rollout. Sentiment analysis backed by natural language processing
allows for aggregating and yielding analytics on customer feedback. You may play
with sentiment analysis using Google Cloud Natural Language API to understand
how this works and what kinds of analytics may be available.

Recommendations. Recommendation sections are something we can’t imagine


modern eCommerce or media without. The common practice is to recommend other
popular products or the ones you want to sell most. It doesn’t require machine
learning algorithms at all. But if you want to engage customers with deep
personalization, you can apply machine learning techniques to define the products
that this customer is most likely to buy next and put them on top of the
recommendation list. Also, Netflix, YouTube, and other video streaming services
operate in similar way, tailoring their recommendations to a viewer’s lifetime
behavior.

Unsupervised Learning
Unsupervised learning is a type of machine learning in which the algorithm is not
provided with any pre-assigned labels or scores for the training data. As a result,
unsupervised learning algorithms must first self-discover any naturally occurring
patterns in that training data set. Common examples include clustering, where the
algorithm automatically groups its training examples into categories with similar
features, and principal component analysis, where the algorithm finds ways to
compress the training data set by identifying which features are most useful for
discriminating between different training examples and discarding the rest. This
contrasts with supervised learning in which the training data include pre-assigned
category labels (often by a human, or from the output of non-learning classification
algorithm). Other intermediate levels in the supervision spectrum include
reinforcement learning, where only numerical scores are available for each training
example instead of detailed tags, and semi-supervised learning where only a portion
of the training data have been tagged.

Advantages of unsupervised learning include a minimal workload to prepare and


audit the training set, in contrast to supervised learning techniques where a
considerable amount of expert human labor is required to assign and verify the
initial tags, and greater freedom to identify and exploit previously undetected
patterns that may not have been noticed by the “experts”. This often comes at the
cost of unsupervised techniques requiring a greater amount of training data and
converging more slowly to acceptable performance, increased computational and
storage requirements during the exploratory process, and potentially greater
susceptibility to artifacts or anomalies in the training data that might be obviously
irrelevant or recognized as erroneous by a human, but are assigned undue
importance by the unsupervised learning algorithm.

Approaches

Common families of algorithms used in unsupervised learning include:

(1) Clustering

(2) Anomaly detection

(3) Neural networks (note that not all neural networks are unsupervised; they can
be trained by supervised, unsupervised, semi-supervised, or reinforcement methods)

(4) Latent variable models.


 Clustering methods include hierarchical clustering, k-means, mixture models,
DBSCAN, and OPTICS algorithm
 Anomaly detection methods include Local Outlier Factor, and Isolation Forest
 Neural network methods include autoencoders, deep belief networks, Hebbian
learning, generative adversarial networks (GANs), and self-organizing maps
 Approaches for learning latent variable models include expectation maximization
algorithm, the method of moments, and blind signal separation techniques (principal
component analysis, independent component analysis, non-negative matrix
factorization, singular value decomposition)

Method of moments

One statistical approach for unsupervised learning is the method of moments. In the
method of moments, the unknown parameters of interest in the model are related to
the moments of one or more random variables. These moments are empirically
estimated from the available data samples and used to calculate the most likely
value distributions for each parameter. The method of moments is shown to be
effective in learning the parameters of latent variable models, where in addition to
the observed variables available in the training and input data sets, several
unobserved latent variables are also assumed to exist and to determine the
categorization of each same. One practical example of latent variable models in
machine learning is topic modeling, which is a statistical model for predicting the
words (observed variables) in a document based on the topic (latent variable) of the
document. The method of moments (tensor decomposition techniques) has been
shown to consistently recover the parameters of a large class of latent variable
models under certain assumptions.

The expectation–maximization algorithm is another practical method for learning


latent variable models. However, it can get stuck in local optima, and it is not
guaranteed to converge to the true unknown parameters of the model. In contrast,
using the method of moments, global convergence is guaranteed under some
conditions.

Types of Unsupervised Learning Algorithm:

Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities between
the data objects and categorizes them as per the presence and absence of those
commonalities.

Association: An association rule is an unsupervised learning method which is used


for finding the relationships between variables in the large database. It determines
the set of items that occurs together in the dataset. Association rule makes
marketing strategy more effective. Such as people who buy X item (suppose a
bread) also tend to purchase Y (Butter/Jam) item. A typical example of Association
rule is Market Basket Analysis.
Advantages

 Unsupervised learning is used for more complex tasks as compared to supervised


learning because, in unsupervised learning, we don’t have labelled input data.
 Unsupervised learning is preferable as it is easy to get unlabeled data in comparison
to labelled data.

Disadvantages

 Unsupervised learning is intrinsically more difficult than supervised learning as it


does not have corresponding output.
 The result of the unsupervised learning algorithm might be less accurate as input
data is not labelled, and algorithms do not know the exact output in advance.

Unsupervised Learning algorithms:

List of some popular unsupervised learning algorithms:

 K-means clustering
 KNN (k-nearest neighbors)
 Hierarchal clustering
 Anomaly detection
 Neural Networks
 Principle Component Analysis
 Independent Component Analysis
 Apriori algorithm
 Singular value decomposition

Clustering, Hierarchical clustering


Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters). It is a main task of exploratory
data analysis, and a common technique for statistical data analysis, used in many
fields, including pattern recognition, image analysis, information retrieval,
bioinformatics, data compression, computer graphics and machine learning.

Cluster analysis itself is not one specific algorithm, but the general task to be solved.
It can be achieved by various algorithms that differ significantly in their
understanding of what constitutes a cluster and how to efficiently find them. Popular
notions of clusters include groups with small distances between cluster members,
dense areas of the data space, intervals, or statistical distributions. Clustering can
therefore be formulated as a multi-objective optimization problem. The appropriate
clustering algorithm and parameter settings (including parameters such as the
distance function to use, a density threshold or the number of expected clusters)
depend on the individual data set and intended use of the results. Cluster analysis as
such is not an automatic task, but an iterative process of knowledge discovery or
interactive multi-objective optimization that involves trial and failure. It is often
necessary to modify data pre-processing and model parameters until the result
achieves the desired properties.

Besides the term clustering, there are several terms with similar meanings, including
automatic classification, numerical taxonomy, botryology (from Greek βότρυς
“grape”), typological analysis, and community detection. The subtle differences are
often in the use of the results: while in data mining, the resulting groups are the
matter of interest, in automatic classification the resulting discriminative power is of
interest.

Cluster analysis was originated in anthropology by Driver and Kroeber in 1932 and
introduced to psychology by Joseph Zubin in 1938 and Robert Tryon in 1939 and
famously used by Cattell beginning in 1943 for trait theory classification in
personality psychology.

The notion of a “cluster” cannot be precisely defined, which is one of the reasons
why there are so many clustering algorithms. There is a common denominator: a
group of data objects. However, different researchers employ different cluster
models, and for each of these cluster models again different algorithms can be
given. The notion of a cluster, as found by different algorithms, varies significantly in
its properties. Understanding these “cluster models” is key to understanding the
differences between the various algorithms. Typical cluster models include:

 Connectivity models: for example, hierarchical clustering builds models based on


distance connectivity.
 Centroid models: for example, the k-means algorithm represents each cluster by a
single mean vector.
 Distribution models: clusters are modeled using statistical distributions, such as
multivariate normal distributions used by the expectation-maximization algorithm.
 Density models: for example, DBSCAN and OPTICS defines clusters as connected
dense regions in the data space.
 Subspace models: in biclustering (also known as co-clustering or two-mode-
clustering), clusters are modeled with both cluster members and relevant attributes.
 Group models: some algorithms do not provide a refined model for their results
and just provide the grouping information.
 Graph-based models: a clique, that is, a subset of nodes in a graph such that
every two nodes in the subset are connected by an edge can be considered as a
prototypical form of cluster. Relaxations of the complete connectivity requirement (a
fraction of the edges can be missing) are known as quasi-cliques, as in the HCS
clustering algorithm.
 Signed graph models: Every path in a signed graph has a sign from the product of
the signs on the edges. Under the assumptions of balance theory, edges may change
sign and result in a bifurcated graph. The weaker “clusterability axiom” (no cycle has
exactly one negative edge) yields results with more than two clusters, or subgraphs
with only positive edges.
 Neural models: the most well-known unsupervised neural network is the self-
organizing map, and these models can usually be characterized as similar to one or
more of the above models and including subspace models when neural networks
implement a form of Principal Component Analysis or Independent Component
Analysis.

A “Clustering” is essentially a set of such clusters, usually containing all objects in


the data set. Additionally, it may specify the relationship of the clusters to each
other, for example, a hierarchy of clusters embedded in each other. Clustering’s can
be roughly distinguished as:

 Hard clustering: Each object belongs to a cluster or not


 Soft clustering (also: fuzzy clustering): Each object belongs to each cluster to a
certain degree (for example, a likelihood of belonging to the cluster)

There are also finer distinctions possible, for example:

 Strict partitioning clustering: Each object belongs to exactly one cluster


 Strict partitioning clustering with outliers: Objects can also belong to no
cluster, and are considered outliers
 Overlapping clustering (also: alternative clustering, multi-view clustering):
Objects may belong to more than one cluster; usually involving hard clusters
 Hierarchical clustering: Objects that belong to a child cluster also belong to the
parent cluster
 Subspace clustering: While an overlapping clustering, within a uniquely defined
subspace, clusters are not expected to overlap

Applications of Clustering in different fields 

 Marketing: It can be used to characterize & discover customer segments for


marketing purposes.
 Biology: It can be used for classification among different species of plants and
animals.
 Libraries: It is used in clustering different books on the basis of topics and
information.
 Insurance: It is used to acknowledge the customers, their policies and identifying
the frauds.

Hierarchical Clustering

In data mining and statistics, hierarchical clustering (also called hierarchical cluster
analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of
clusters.

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm


that groups similar objects into groups called clusters. The endpoint is a set of
clusters, where each cluster is distinct from each other cluster, and the objects
within each cluster are broadly like each other.

Strategies for hierarchical clustering generally fall into two types:


 Agglomerative: This is a “bottom-up” approach: each observation starts in
its own cluster, and pairs of clusters are merged as one moves up the
hierarchy.
 Divisive: This is a “top-down” approach: all observations start in one cluster,
and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of
hierarchical clustering are usually presented in a dendrogram.

Measures of distance (Similarity)

In the example above, the distance between two clusters has been computed based
on the length of the straight line drawn from one cluster to another. This is
commonly referred to as the Euclidean distance. Many other distance metrics have
been developed.

The choice of distance metric should be made based on theoretical concerns from
the domain of study. That is, a distance metric needs to define similarity in a way
that is sensible for the field of study. For example, if clustering crime sites in a city,
city block distance may be appropriate. Or, better yet, the time taken to travel
between each location. Where there is no theoretical justification for an alternative,
the Euclidean should generally be preferred, as it is usually the appropriate measure
of distance in the physical world.

Linkage Criteria

After selecting a distance metric, it is necessary to determine from where distance is


computed. For example, it can be computed between the two most similar parts of a
cluster (single-linkage), the two least similar bits of a cluster (complete-linkage), the
center of the clusters (mean or average-linkage), or some other criterion. Many
linkage criteria have been developed.

As with distance metrics, the choice of linkage criteria should be made based on
theoretical considerations from the domain of application. A key theoretical issue is
what causes variation. For example, in archeology, we expect variation to occur
through innovation and natural resources, so working out if two groups of artifacts
are similar may make sense based on identifying the most similar members of the
cluster.

Where there are no clear theoretical justifications for the choice of linkage criteria,
Ward’s method is the sensible default. This method works out which observations to
group based on reducing the sum of squared distances of each observation from the
average observation in a cluster. This is often appropriate as this concept of distance
matches the standard assumptions of how to compute differences between groups
in statistics (e.g., ANOVA, MANOVA).

Agglomerative versus divisive algorithms

Hierarchical clustering typically works by sequentially merging similar clusters, as


shown above. This is known as agglomerative hierarchical clustering. In theory, it
can also be done by initially grouping all the observations into one cluster, and then
successively splitting these clusters. This is known as divisive hierarchical clustering.
Divisive clustering is rarely done in practice.

Partitioning Clustering- K-mean clustering


This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. It’s the data analysts to specify the number
of clusters that must be generated for the clustering methods.

In the partitioning method when database(D) that contains multiple(N) objects then
the partitioning method constructs user-specified(K) partitions of the data in which
each partition represents a cluster and a particular region. There are many
algorithms that come under partitioning method some of the popular ones are K-
Mean, PAM(K-Mediods), CLARA algorithm (Clustering Large Applications) etc.

K-Mean (A centroid based Technique)

The K means algorithm takes the input parameter K from the user and partitions the
dataset containing N objects into K clusters so that resulting similarity among the
data objects inside the group (intracluster) is high but the similarity of data objects
with the data objects from outside the cluster is low (intercluster). The similarity of
the cluster is determined with respect to the mean value of the cluster.

It is a type of square error algorithm. At the start randomly k objects from the
dataset are chosen in which each of the objects represents a cluster mean(center).
For the rest of the data objects, they are assigned to the nearest cluster based on
their distance from the cluster mean. The new mean of each of the cluster is then
calculated with the added data objects.

Method:

Randomly assign K objects from the dataset(D) as cluster centers(C)

(Re) Assign each object to which object is most similar based upon mean values.

Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.

Repeat Step 4 until no change occurs.

Density Based Methods DBSCAN, OPTICS


Density-Based Clustering method is one of the clustering methods based on
density (local cluster criterion), such as density-connected points. The basic ideas of
density-based clustering involve several new definitions. We intuitively present these
definitions and then follow up with an example.

The neighborhood within a radius ε of a given object is called the ε-neighborhood of


the object.

If the ε-neighborhood of an object contains at least a minimum number, MinPts, of


objects, then the object is called a core object.

Density Reachable:

A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of


points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from
pi    
Density Connected

A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such


that both, p and q are density-reachable from o wrt. Eps and MinPts.

Working Of Density-Based Clustering

Given a set of objects, D’ we say that an object p is directly density-reachable from


object q if p is within the ε-neighborhood of q, and q is a core object.

An object p is density-reachable from object q with respect to ε and MinPts in a set


of objects, D’ if there is a chain of objects p1,.,.,.pn, where p1 = q and pn = p such
that pi+1 is directly density-reachable from pi with respect to e and MinPts, for 1/n,
pi € D.

An object p is density-connected to object q with respect to ε and MinPts in a set of


objects, D’, if there is an object o, belongs D such that both p and q are density-
reachable from o with respect to ε and MinPts.

Major features:

 It is used to discover clusters of arbitrary shape.


 It is also used to handle noise in the data clusters.
 It is a one scan method.
 It needs density parameters as a termination condition.

DBSCAN
It relies on a density-based notion of cluster:  A cluster is defined as a maximal set
of density-connected points.

It discovers clusters of arbitrary shape in spatial databases with noise.

DBSCAN Algorithm

Arbitrary select a point p.

Retrieve all points density-reachable from p wrt Eps and MinPts

If p is a core point, a cluster is formed.

If p is a border point, no points are density-reachable from p and DBSCAN visits the
next point of the database.

Continue the process until all of the points have been processed.

say, let MinPts = 3.


Of the labeled points, m, p, o, and rare core objects because each is in an ε-
neighborhood containing at least three points.

q is directly density-reachable from m. m is directly density-reachable from p and


vice versa.

q is (indirectly) density-reachable from p because q is directly density-reachable


from m and m is directly density-reachable from p.

However, p is not density-reachable from q because q is not a core object.

Similarly, r and s are density-reachable from o, and o is density-reachable from o,


and o is density-reachable from R.

OPTICS: Ordering Points to Identify the Clustering Structure

An algorithm for finding density-based clusters in spatial data. It was presented by


Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander. Its basic
idea is like DBSCAN, but it addresses one of DBSCAN’s major weaknesses: the
problem of detecting meaningful clusters in data of varying density. To do so, the
points of the database are (linearly) ordered such that spatially closest points
become neighbors in the ordering. Additionally, a special distance is stored for each
point that represents the density that must be accepted for a cluster so that both
points belong to the same cluster. This is represented as a dendrogram.

 It produces a special order of the database with respect to its density-based


clustering structure.
 This cluster-ordering contains info equivalent to the density-based clustering’s
corresponding to a broad range of parameter settings.

 It is good for both automatic and interactive cluster analysis, including finding
an intrinsic clustering structure.
 It can be represented graphically or using visualization techniques.

 Core-distance and reachability-distance: The figure illustrates the concepts of


core-distance and reachability-distance.
 Suppose that e=6 mm and MinPts=5
 The core distance of p is the distance, e0, between p and the fourth closest
data object.
 The reachability-distance of q1 with respect to p is the core-distance of p
(i.e., e0 =3 mm) because this is greater than the Euclidean distance from p to
q1.
 The reachability distance of q2 with respect to p is the Euclidean distance
from p to q2 because this is greater than the core-distance of p.

Association rules: Introduction, Large Item


sets, Apriori Algorithms, and applications
Association rule learning is a type of unsupervised learning technique that checks for
the dependency of one data item on another data item and maps accordingly so that
it can be more profitable. It tries to find some interesting relations or associations
among the variables of dataset. It is based on different rules to discover the
interesting relations between variables in the database.

Association rule learning is a rule-based machine learning method for discovering


interesting relations between variables in large databases. It is intended to identify
strong rules discovered in databases using some measures of interestingness.

The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used by the
various big retailer to discover the associations between items. We can understand it
by taking an example of a supermarket, as in a supermarket, all products that are
purchased together are put together.

In addition to the above example from market basket analysis association rules are
employed today in many application areas including Web usage mining, intrusion
detection, continuous production, and bioinformatics. In contrast with sequence
mining, association rule learning typically does not consider the order of items either
within a transaction or across transactions.

Association rule learning can be divided into three types of algorithms:

Apriori

This algorithm uses frequent datasets to generate association rules. It is designed to


work on the databases that contain transactions. This algorithm uses a breadth-first
search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.

Eclat

Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a
depth-first search technique to find frequent item sets in a transaction database. It
performs faster execution than Apriori Algorithm.

F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved version
of the Apriori Algorithm. It represents the database in the form of a tree structure
that is known as a frequent pattern or tree. The purpose of this frequent tree is to
extract the most frequent patterns.

Applications of Association Rule Learning

 Market Basket Analysis: It is one of the popular examples and applications


of association rule mining. This technique is commonly used by big retailers to
determine the association between items.
 Medical Diagnosis: With the help of association rules, patients can be cured
easily, as it helps in identifying the probability of illness for a particular
disease.
 Protein Sequence: The association rules help in determining the synthesis
of artificial Proteins.
 It is also used for the Catalog Design and Loss-leader Analysis and many
more other applications.

Working of Association Rule Learning work

Association rule learning works on the concept of If and Else Statement, such as if A
then B.

If A -> Then B

Here the If element is called antecedent, and then statement is called as


Consequent. These types of relationships where we can find out some association or
relation between two items is known as single cardinality. It is all about creating
rules, and if the number of items increases, then cardinality also increases
accordingly. So, to measure the associations between thousands of data items, there
are several metrics. These metrics are given below:

 Support
 Confidence
 Lift
Let’s understand each of them:

Support

Support is the frequency of A or how frequently an item appears in the dataset. It is


defined as the fraction of the transaction T that contains the itemset X. If there are
X datasets, then for transactions T, it can be written as:

Supp (X) = Freq (X) / T

Confidence

Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already
given. It is the ratio of the transaction that contains X and Y to the number of
records that contain X.

Confidence = Freq (X,Y) / Freq (X)

Lift

It is the strength of any rule, which can be defined as below formula:

Lift = Supp (X,Y) / {Supp (X) * Supp (Y)}

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:

Lift= 1: The probability of occurrence of antecedent and consequent is independent


of each other.

Lift>1: It determines the degree to which the two item sets are dependent to each
other.

Lift<1: It tells us that one item is a substitute for other items, which means one
item has a negative effect on another.

You might also like