Unit-3 Artificial Intelligence
Unit-3 Artificial Intelligence
Unit-3 Artificial Intelligence
Machine learning (ML) is the study of computer algorithms that can improve
automatically through experience and using data. It is seen as a part of artificial
intelligence. Machine learning algorithms build a model based on sample data,
known as training data, to make predictions or decisions without being explicitly
programmed to do so. Machine learning algorithms are used in a wide variety of
applications, such as in medicine, email filtering, speech recognition, and computer
vision, where it is difficult or unfeasible to develop conventional algorithms to
perform the needed tasks.
History:
The term machine learning was coined in 1959 by Arthur Samuel, an American
IBMer and pioneer in the field of computer gaming and artificial intelligence. Also,
the synonym self-teaching computers was used in this time. A representative book
of the machine learning research during the 1960s was the Nilsson’s book on
Learning Machines, dealing mostly with machine learning for pattern classification.
Interest related to pattern recognition continued into the 1970s, as described by
Duda and Hart in 1973. In 1981 a report was given on using teaching strategies so
that a neural network learns to recognize 40 characters (26 letters, 10 digits, and 4
special symbols) from a computer terminal.
Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms
studied in the machine learning field: “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with experience E. “This
definition of the tasks in which machine learning is concerned offers a fundamentally
operational definition rather than defining the field in cognitive terms. This follows
Alan Turing’s proposal in his paper “Computing Machinery and Intelligence”, in which
the question “Can machines think?” is replaced with the question “Can machines do
what we (as thinking entities) can do?”.
Modern day machine learning has two objectives, one is to classify data based on
models which have been developed, the other purpose is to make predictions for
future outcomes based on these models. A hypothetical algorithm specific to
classifying data may use computer vision of moles coupled with supervised learning
to train it to classify the cancerous moles. Whereas a machine learning algorithm for
stock trading may inform the trader of future potential predictions.
Artificial intelligence
As a scientific endeavor, machine learning grew out of the quest for artificial
intelligence. In the early days of AI as an academic discipline, some researchers
were interested in having machines learn from data. They attempted to approach
the problem with various symbolic methods, as well as what was then termed
“neural networks”; these were mostly perceptron’s and other models that were later
found to be reinventions of the generalized linear models of statistics. Probabilistic
reasoning was also employed, especially in automated medical diagnosis.
Data mining
Machine learning and data mining often employ the same methods and overlap
significantly, but while machine learning focuses on prediction, based on known
properties learned from the training data, data mining focuses on the discovery of
(previously) unknown properties in the data (this is the analysis step of knowledge
discovery in databases). Data mining uses many machine learning methods, but with
different goals; on the other hand, machine learning also employs data mining
methods such as “Unsupervised Learning” or as a pre-processing step to improve
learner accuracy. Much of the confusion between these two research communities
(which do often have separate conferences and separate journals, ECML PKDD being
a major exception) comes from the basic assumptions they work with: in machine
learning, performance is usually evaluated with respect to the ability to reproduce
known knowledge, while in knowledge discovery and data mining (KDD) the key task
is the discovery of previously unknown knowledge. Evaluated with respect to known
knowledge, an uninformed (unsupervised) method will easily be outperformed by
other supervised methods, while in a typical KDD task, supervised methods cannot
be used due to the unavailability of training data.
Optimization
Machine learning also has intimate ties to optimization: many learning problems are
formulated as minimization of some loss function on a training set of examples. Loss
functions express the discrepancy between the predictions of the model being
trained and the actual problem instances (for example, in classification, one wants to
assign a label to instances, and models are trained to correctly predict the pre-
assigned labels of a set of examples).
Generalization
The difference between optimization and machine learning arises from the goal of
generalization: while optimization algorithms can minimize the loss on a training set,
machine learning is concerned with minimizing the loss on unseen samples.
Characterizing the generalization of various learning algorithms is an active topic of
current research, especially for deep learning algorithms.
Statistics
Machine learning and statistics are closely related fields in terms of methods, but
distinct in their principal goal: statistics draw population inferences from a sample,
while machine learning finds generalizable predictive patterns. According to Michael
I. Jordan, the ideas of machine learning, from methodological principles to
theoretical tools, have had a long pre-history in statistics. He also suggested the
term data science as a placeholder to call the overall field.
Leo Breiman distinguished two statistical modeling paradigms: data model and
algorithmic model, wherein “algorithmic model” means more or less the machine
learning algorithms like Random Forest.
Approaches
Machine learning approaches are traditionally divided into four broad categories,
depending on the nature of the “Signal” or “Feedback” available to the learning
system:
The main goal of the supervised learning technique is to map the input
variable(x) with the output variable(y). Some real-world applications of supervised
learning are Risk Assessment, Fraud Detection, Spam filtering, etc.
The computer is presented with example inputs and their desired outputs,
given by a “teacher”, and the goal is to learn a general rule that maps
inputs to outputs.
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the
output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue,
etc. The classification algorithms predict the categories present in the dataset. Some
real-world examples of classification algorithms are Spam Detection, Email
filtering, etc.
b) Regression
o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output based on prior experience.
Disadvantages:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
o Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data
to identify the patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.
In unsupervised learning, the models are trained with the data that is neither
classified nor labelled, and the model acts on that data without any supervision.
So, now the machine will discover its patterns and differences, such as colour
difference, shape difference, and predict the output when it is tested with the test
dataset.
No labels are given to the learning algorithm, leaving it on its own to find structure
in its input. Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end (feature learning).
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given
below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the
most similarities remain in one group and have fewer or no similarities with the
objects of other groups. An example of the clustering algorithm is grouping the
customers by their purchasing behaviour.
2) Association
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies
between Supervised and Unsupervised machine learning. It represents the
intermediate ground between Supervised (With Labelled training data) and
Unsupervised learning (with no labelled training data) algorithms and uses the
combination of labelled and unlabelled datasets during the training period.
Disadvantages:
4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI
agent (A software component) automatically explore its surrounding by hitting
& trail, acting, learning from experiences, and improving its
performance. Agent gets rewarded for each good action and get punished for each
bad action; hence the goal of reinforcement learning agent is to maximize the
rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
Disadvantage
The curse of dimensionality limits reinforcement learning for real physical systems.
Machine learning framework allows enterprises to deploy, manage, and scale their
machine learning portfolio. Algorithmia is the fastest route to deployment and makes
it easy to securely govern machine learning operations with a healthy ML lifecycle.
With Algorithmia, you can connect your data and pre-trained models, deploy and
serve as APIs, manage your models and monitor performance, and secure your
machine learning portfolio as it scales.
Connectivity
A flexible machine learning framework connects to all necessary data sources in one
secure, central location for reusable, repeatable, and collaborative model
management.
Manage source code by pushing models into production directly from the code
repository
Control data access by running models close to connectors and data sources for
optimal security
Deploy models from wherever they are with seamless infrastructure management
Deployment
Machine learning models only achieve value once they reach production. Efficient
deployment capabilities reduce the time it takes your organization to get a return on
your ML investment.
Deploy in any language and any format with flexible tooling capabilities.
Serve models with a git push to a highly scalable API in seconds.
Version models automatically with a framework that compares and updates models
while maintaining a dependable version for calls.
Management
Manage MLOps using access controls and governance features that secure and audit
the machine learning models you have in production.
Split machine learning workflows into reusable, independent parts and pipeline them
together with a microservices architecture.
Operate your ML portfolio from one, secure location to prevent work silos with a
robust ML management system.
Protect your models with access control.
Usage reporting allows you to gain full visibility into server use, model consumption,
and call details to control costs.
Scaling
Serverless scaling allows you to scale models on demand without latency concerns,
providing CPU and GPU support.
Reduce data security vulnerabilities by access controlling your model management
system.
Govern models and test model performance for speed, accuracy, and drift
Multi-cloud flexibility provides the options to deploy on Algorithmia, the cloud, or on-
prem to keep models near data sources.
TensorFlow and PyTorch are direct competitors because of their similarity. They both
provide a rich set of linear algebra tools, and they can run regression analysis.
Scikit-learn has been around a long time and would be most familiar to R
programmers, but it comes with a big caveat: it is not built to run across a cluster.
Spark ML is built for running on a cluster, since that is what Apache Spark is all
about.
Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data
from collection.
Data Selection: Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection.
Data selection using Neural network.
Data selection using Decision Trees.
Data selection using Naive bayes.
Data selection using Clustering, Regression, etc.
Data Mining: Data mining is defined as clever techniques that are applied to
extract patterns potentially useful.
Generate reports.
Generate tables.
Generate discriminant rules, classification rules, characterization rules, etc.
Steps to follow
To solve a given problem of supervised learning, one has to perform the following
steps:
Determine the type of training examples. Before doing anything else, the user
should decide what kind of data is to be used as a training set. In the case of
handwriting analysis, for example, this might be a single handwritten character, an
entire handwritten word, an entire sentence of handwriting or perhaps a full
paragraph of handwriting.
Gather a training set. The training set needs to be representative of the real-world
use of the function. Thus, a set of input objects is gathered and corresponding
outputs are also gathered, either from human experts or from measurements.
Determine the input feature representation of the learned function. The
accuracy of the learned function depends strongly on how the input object is
represented. Typically, the input object is transformed into a feature vector, which
contains a number of features that are descriptive of the object. The number of
features should not be too large, because of the curse of dimensionality; but should
contain enough information to accurately predict the output.
Determine the structure of the learned function and corresponding
learning algorithm. For example, the engineer may choose to use support-vector
machines or decision trees.
Complete the design. Run the learning algorithm on the gathered training set.
Some supervised learning algorithms require the user to determine certain control
parameters. These parameters may be adjusted by optimizing performance on a
subset (called a validation set) of the training set, or via cross-validation.
Evaluate the accuracy of the learned function. After parameter adjustment and
learning, the performance of the resulting function should be measured on a test set
that is separate from the training set.
Factors to consider
Other factors to consider when choosing and applying a learning algorithm include
the following:
When considering a new application, the engineer can compare multiple learning
algorithms and experimentally determine which one works best on the problem at
hand (see cross validation). Tuning the performance of a learning algorithm can be
very time-consuming. Given fixed resources, it is often better to spend more time
collecting additional training data and more informative features than it is to spend
extra time tuning the learning algorithms.
Regression is a statistical measurement used in finance, investing and other disciplines that
attempts to determine the strength of the relationship between one dependent variable
(usually denoted by Y) and a series of other changing variables (known as independent
variables).
Regression helps investment and financial managers to value assets and understand the
relationships between variables, such as commodity prices and the stocks of businesses
dealing in those commodities.
Regression Explained
The two basic types of regression are linear regression and multiple linear regressions,
although there are non-linear regression methods for more complicated data and analysis.
Linear regression uses one independent variable to explain or predict the outcome of the
dependent variable Y, while multiple regressions use two or more independent variables to
predict the outcome.
Regression can help finance and investment professionals as well as professionals in other
businesses. Regression can also help predict sales for a company based on weather, previous
sales, GDP growth or other types of conditions. The capital asset pricing model (CAPM) is
an often-used regression model in finance for pricing assets and discovering costs of capital.
Where:
a = the intercept.
b = the slope.
Regression takes a group of random variables, thought to be predicting Y, and tries to find a
mathematical relationship between them. This relationship is typically in the form of a
straight line (linear regression) that best approximates all the individual data points. In
multiple regression, the separate variables are differentiated by using numbers with
subscripts.
ASSUMPTIONS IN REGRESSION
REGRESSION LINE
Definition: The Regression Line is the line that best fits the data, such that the overall
distance from the line to the points (variable values) plotted on a graph is the smallest. In
other words, a line used to minimize the squared deviations of predictions is called as the
regression line.
There are as many numbers of regression lines as variables. Suppose we take two variables,
say X and Y, then there will be two regression lines:
Regression line of Y on X: This gives the most probable values of Y from the given
values of X.
Regression line of X on Y: This gives the most probable values of X from the given
values of Y.
The algebraic expression of these regression lines is called as Regression Equations. There
will be two regression equations for the two regression lines.
The correlation between the variables depends on the distance between these two regression
lines, such as the nearer the regression lines to each other the higher is the degree of
correlation, and the farther the regression lines to each other the lesser is the degree of
correlation.
The correlation is said to be either perfect positive or perfect negative when the two
regression lines coincide, i.e. only one line exists. In case, the variables are independent; then
the correlation will be zero, and the lines of regression will be at right angles, i.e. parallel to
the X axis and Y axis.
Note: The regression lines cut each other at the point of average of X and Y. This means,
from the point where the lines intersect each other the perpendicular is drawn on the X axis
we will get the mean value of X. Similarly, if the horizontal line is drawn on the Y axis we
will get the mean value of Y.
Regression
R Square/Adjusted R Square
Advantages of MSE
The graph of MSE is differentiable, so you can easily use it as a loss function.
Disadvantages of MSE
The value you get after calculating MSE is a squared unit of output. for
example, the output variable is in meter(m) then after calculating MSE the
output we get is in meter squared.
If you have outliers in the dataset, then it penalizes the outliers most and the
calculated MSE is bigger. So, in short, it is not Robust to outliers which were
an advantage in MAE.
While R Square is a relative measure of how well the model fits dependent variables,
Mean Square Error is an absolute measure of the goodness for the fit.
MSE is calculated by the sum of square of prediction error which is real output minus
predicted output and then divide by the number of data points. It gives you an
absolute number on how much your predicted results deviate from the actual
number. You cannot interpret many insights from one single result, but it gives you
a real number to compare against other model results and help you select the best
regression model.
Root Mean Square Error(RMSE) is the square root of MSE. It is used more commonly
than MSE because firstly sometimes MSE value can be too big to compare easily.
Secondly, MSE is calculated by the square of error, and thus square root brings it
back to the same level of prediction error and makes it easier for interpretation.
Advantages of RMSE
The output value you get is in the same unit as the required output variable which
makes interpretation of loss easy.
Disadvantages of RMSE
Advantages of MAE
The MAE you get is in the same unit as the output variable.
It is most Robust to outliers.
Disadvantages of MAE
Multivariate regression
Non-Linear Regression
Y ≈ ( x, β)
Systematic error may be present in the independent variables but its treatment is
outside the scope of regression analysis. If the independent variables are not error-
free, this is an errors-in-variables model, also outside this scope.
The goal of the model is to make the sum of the squares as small as possible. The
sum of squares is a measure that tracks how far the Y observations vary from the
nonlinear (curved) function that is used to predict Y.
It is computed by first finding the difference between the fitted nonlinear function
and every Y point of data in the set. Then, each of those differences is squared.
Lastly, all of the squared figures are added together. The smaller the sum of these
squared figures, the better the function fits the data points in the set. Nonlinear
regression uses logarithmic functions, trigonometric functions, exponential functions,
power functions, Lorenz curves, Gaussian functions, and other fitting methods.
Nonlinear regression modeling is like linear regression modeling in that both seek to
track a particular response from a set of variables graphically. Nonlinear models are
more complicated than linear models to develop because the function is created
through a series of approximations (iterations) that may stem from trial-and-error.
Mathematicians use several established methods, such as the Gauss-Newton method
and the Levenberg-Marquardt method.
Often, regression models that appear nonlinear upon first glance are actually linear.
The curve estimation procedure can be used to identify the nature of the functional
relationships at play in your data, so you can choose the correct regression model,
whether linear or nonlinear. Linear regression models, while they typically form a
straight line, can also form curves, depending on the form of the linear regression
equation. Likewise, it’s possible to use algebra to transform a nonlinear equation so
that mimics a linear equation such a nonlinear equation is referred to as “intrinsically
linear.”
Training
Neural networks learn (or are trained) by processing examples, each of which
contains a known “input” and “result,” forming probability-weighted associations
between the two, which are stored within the data structure of the net itself. The
training of a neural network from a given example is usually conducted by
determining the difference between the processed output of the network (often a
prediction) and a target output. This difference is the error. The network then
adjusts its weighted associations according to a learning rule and using this error
value. Successive adjustments will cause the neural network to produce output
which is increasingly similar to the target output. After a sufficient number of these
adjustments the training can be terminated based upon certain criteria. This is
known as supervised learning.
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a transfer
function.
Data that is used in traditional programming is stored on the whole network, not on
a database. The disappearance of a couple of pieces of data in one place doesn’t
prevent the network from working.
Artificial neural networks have a numerical value that can perform more than one
task simultaneously.
After ANN training, the information may produce output even with inadequate data.
The loss of performance here relies upon the significance of missing data.
Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.
It is the most significant issue of ANN. When ANN produces a testing solution, it
does not provide insight concerning why and how. It decreases trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per
their structure. Therefore, the realization of the equipment is dependent.
The network is reduced to a specific value of the error, and this value does not give
us optimum results.
ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user’s
abilities.
Network design
Neural architecture search (NAS) uses machine learning to automate ANN design.
Various approaches to NAS have designed networks that compare well with hand-
designed systems. The basic search algorithm is to propose a candidate model,
evaluate it against a dataset and use the results as feedback to teach the NAS
network.[85] Available systems include AutoML and AutoKeras.
Design issues include deciding the number, type and connectedness of network
layers, as well as the size of each and the connection type.
Hyperparameters must also be defined as part of the design (they are not learned),
governing matters such as how many neurons are in each layer, learning rate, step,
stride, depth, receptive field and padding (for CNNs), etc.
Use
The need of semi-unsupervised clustering arises, for example, in data sets with large
numbers of attributes where most of the attributes are not semantically relevant but
will dominate any distance metric (due to their number), used by traditional
clustering algorithms. In these cases, sparse information regarding the quality of
clusters or regarding relations between a small number of data points might be
available which could be used to guide the cluster formation.
Given the input samples, it is often not possible to cluster the data according to the
constraints in their original feature space using unmodified distance measures as
indications for similarity. Thus, we have to modify the feature space, usually by
scaling the dimensions, so that an unmodified clustering algorithm is able to cluster
based on it’s own distance and variance constraints. In order to solve this problem,
this paper presents a novel approach which, at first, learns a policy to compute the
scaling factors using Reinforcement learning from a set of training problems and
subsequently applies the learned policy to compute the scaling factors for new
problems. The goal here is that by working on the scaled dimensions, the traditional
clustering algorithm can yield results that satisfy the constraints.
Clustering Methods:
Density-Based Methods: These methods consider the clusters as the dense region
having some similarities and differences from the lower dense region of the space.
These methods have good accuracy and the ability to merge two clusters. Example
DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS
(Ordering Points to Identify Clustering Structure), etc.
Hierarchical Based Methods: The clusters formed in this method form a tree-type
structure based on the hierarchy. New clusters are formed using the previously
formed one. It is divided into two categories.
o Agglomerative (bottom-up approach)
o Divisive (top-down approach)
Biology: It can be used for classification among different species of plants and
animals.
City Planning: It is used to make groups of houses and to study their values based
on their geographical locations and other factors present.
Clustering Algorithms:
In decision analysis, a decision tree can be used to represent decisions and decision
making visually and explicitly. In data mining, a decision tree describes data (but the
resulting classification tree can be an input for decision making).
Decision tree learning is a method commonly used in data mining. The goal is to
create a model that predicts the value of a target variable based on several input
variables.
A decision tree is a simple representation for classifying examples. For this section,
assume that all of the input features have finite discrete domains, and there is a
single target feature called the “Classification”. Each element of the domain of the
classification is called a class. A decision tree or a classification tree is a tree in which
each internal (non-leaf) node is labelled with an input feature. The arcs coming from
a node labeled with an input feature are labeled with each of the possible values of
the target feature or the arc leads to a subordinate decision node on a different
input feature. Each leaf of the tree is labeled with a class or a probability distribution
over the classes, signifying that the data set has been classified by the tree into
either a specific class, or into a particular probability distribution (which, if the
decision tree is well-constructed, is skewed towards certain subsets of classes).
A tree is built by splitting the source set, constituting the root node of the tree, into
subsets which constitute the successor children. The splitting is based on a set of
splitting rules based on classification features. This process is repeated on each
derived subset in a recursive manner called recursive partitioning. The recursion is
completed when the subset at a node has all the same values of the target variable,
or when splitting no longer adds value to the predictions. This process of top-down
induction of decision trees (TDIDT) is an example of a greedy algorithm, and it is by
far the most common strategy for learning decision trees from data.
Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more homogeneous
sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from
the tree.
Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node
of the tree. The complete process can be better understood using the below
algorithm:
Step 1: Begin the tree with the root node, says S, which contains the complete
dataset.
Step 2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step 3: Divide the S into subsets that contains possible values for the best
attributes.
Step 4: Generate the decision tree node, which contains the best attribute.
Step 5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Bayesian networks
A Bayesian network (also known as a Bayes network, Bayes net, belief network, or
decision network) is a probabilistic graphical model that represents a set of variables
and their conditional dependencies via a directed acyclic graph (DAG). Bayesian
networks are ideal for taking an event that occurred and predicting the likelihood
that any one of several possible known causes was the contributing factor. For
example, a Bayesian network could represent the probabilistic relationships between
diseases and symptoms. Given symptoms, the network can be used to compute the
probabilities of the presence of various diseases.
Formally, Bayesian networks are directed acyclic graphs (DAGs) whose nodes
represent variables in the Bayesian sense: they may be observable quantities, latent
variables, unknown parameters or hypotheses. Edges represent conditional
dependencies; nodes that are not connected (no path connects one node to
another) represent variables that are conditionally independent of each other. Each
node is associated with a probability function that takes, as input, a particular set of
values for the node’s parent variables and gives (as output) the probability (or
probability distribution, if applicable) of the variable represented by the node.
Bayesian networks satisfy the local Markov property, which states that a node is
conditionally independent of its non-descendants given its parents. In the above
example, this means that P(Sprinkler|Cloudy, Rain) = P(Sprinkler|Cloudy) since
Sprinkler is conditionally independent of its non-descendant, Rain, given Cloudy. This
property allows us to simplify the joint distribution, obtained in the previous section
using the chain rule, to a smaller form.
When data are unlabeled, supervised learning is not possible, and an unsupervised
learning approach is required, which attempts to find natural clustering of the data to
groups, and then map new data to these formed groups. The support-vector
clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the
statistics of support vectors, developed in the support vector machines algorithm, to
categorize unlabeled data, and is one of the most widely used clustering algorithms
in industrial applications.
Motivation
Classifying data is a common task in machine learning. Suppose some given data
points each belong to one of two classes, and the goal is to decide which class a
new data point will be in. In the case of support-vector machines, a data point is
viewed as p-dimensional vector (a list of p numbers), and we want to know whether
we can separate such points with (p-1)-dimensional hyperplane. This is called a
linear classifier. There are many hyperplanes that might classify the data. One
reasonable choice as the best hyperplane is the one that represents the largest
separation, or margin, between the two classes. So, we choose the hyperplane so
that the distance from it to the nearest data point on each side is maximized. If such
a hyperplane exists, it is known as the maximum-margin hyperplane and the linear
classifier it defines is known as a maximum-margin classifier; or equivalently, the
perceptron of optimal stability.
Applications
SVMs are helpful in text and hypertext categorization, as their application can
significantly reduce the need for labeled training instances in both the
standard inductive and transductive settings. Some methods for shallow
semantic parsing are based on support vector machines.
Classification of images can also be performed using SVMs. Experimental
results show that SVMs achieve significantly higher search accuracy than
traditional query refinement schemes after just three to four rounds of
relevance feedback. This is also true for image segmentation systems,
including those using a modified version SVM that uses the privileged
approach as suggested by Vapnik.
Classification of satellite data like SAR data using supervised SVM.
Hand-written characters can be recognized using SVM.
The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly. Permutation tests based on SVM weights
have been suggested as a mechanism for interpretation of SVM models.
Support-vector machine weights have also been used to interpret SVM
models in the past. Posthoc interpretation of support-vector machine models
in order to identify features used by the model to make predictions is a
relatively new area of research with special significance in the biological
sciences.
Genetic Algorithm
In computer science and operations research, a genetic algorithm (GA) is a
metaheuristic inspired by the process of natural selection that belongs to the larger
class of evolutionary algorithms (EA). Genetic algorithms are commonly used to
generate high-quality solutions to optimization and search problems by relying on
biologically inspired operators such as mutation, crossover and selection. Some
examples of GA applications include optimizing decision trees for better
performance, automatically solve sudoku puzzles, hyperparameter optimization, etc.
Optimization problems
Once the genetic representation and the fitness function are defined, a GA proceeds
to initialize a population of solutions and then to improve it through repetitive
application of the mutation, crossover, inversion and selection operators.
Limitations
Repeated fitness function evaluation for complex problems is often the most
prohibitive and limiting segment of artificial evolutionary algorithms. Finding
the optimal solution to complex high-dimensional, multimodal problems often
requires very expensive fitness function evaluations. In real world problems
such as structural optimization problems, a single function evaluation may
require several hours to several days of complete simulation. Typical
optimization methods cannot deal with such types of problem. In this case, it
may be necessary to forgo an exact evaluation and use an approximated
fitness that is computationally efficient. It is apparent that amalgamation of
approximate models may be one of the most promising approaches to
convincingly use GA to solve complex real life problems.
Genetic algorithms do not scale well with complexity. That is, where the
number of elements which are exposed to mutation is large there is often an
exponential increase in search space size. This makes it extremely difficult to
use the technique on problems such as designing an engine, a house or a
plane. In order to make such problems tractable to evolutionary search, they
must be broken down into the simplest representation possible. Hence we
typically see evolutionary algorithms encoding designs for fan blades instead
of engines, building shapes instead of detailed construction plans, and airfoils
instead of whole aircraft designs. The second problem of complexity is the
issue of how to protect parts that have evolved to represent good solutions
from further destructive mutation, particularly when their fitness assessment
requires them to combine well with other parts.
The “better” solution is only in comparison to other solutions. As a result, the
stop criterion is not clear in every problem.
In many problems, GAs has a tendency to converge towards local optima or
even arbitrary points rather than the global optimum of the problem. This
means that it does not “know how” to sacrifice short-term fitness to gain
longer-term fitness. The likelihood of this occurring depends on the shape of
the fitness landscape: certain problems may provide an easy ascent towards a
global optimum; others may make it easier for the function to find the local
optima. This problem may be alleviated by using a different fitness function,
increasing the rate of mutation, or by using selection techniques that maintain
a diverse population of solutions, although the No Free Lunch theorem proves
that there is no general solution to this problem. A common technique to
maintain diversity is to impose a “niche penalty”, wherein, any group of
individuals of sufficient similarity (niche radius) have a penalty added, which
will reduce the representation of that group in subsequent generations,
permitting other (less similar) individuals to be maintained in the population.
This trick, however, may not be effective, depending on the landscape of the
problem. Another possible technique would be to simply replace part of the
population with randomly generated individuals when most of the population
is too similar to each other. Diversity is important in genetic algorithms (and
genetic programming) because crossing over a homogeneous population does
not yield new solutions. In evolution strategies and evolutionary
programming, diversity is not essential because of a greater reliance on
mutation.
Operating on dynamic data sets is difficult, as genomes begin to converge
early on towards solutions which may no longer be valid for later data.
Several methods have been proposed to remedy this by increasing genetic
diversity somehow and preventing early convergence, either by increasing the
probability of mutation when the solution quality drops (called triggered
hypermutation), or by occasionally introducing entirely new, randomly
generated elements into the gene pool (called random immigrants). Again,
evolution strategies and evolutionary programming can be implemented with
a so-called “comma strategy” in which parents are not maintained and new
parents are selected only from offspring. This can be more effective on
dynamic problems.
GAs cannot effectively solve problems in which the only fitness measure is a
single right/wrong measure (like decision problems), as there is no way to
converge on the solution (no hill to climb). In these cases, a random search
may find a solution as quickly as a GA. However, if the situation allows the
success/failure trial to be repeated giving (possibly) different results, then the
ratio of successes to failures provides a suitable fitness measure.
For specific optimization problems and problem instances, other optimization
algorithms may be more efficient than genetic algorithms in terms of speed of
convergence. Alternative and complementary algorithms include evolution
strategies, evolutionary programming, simulated annealing, Gaussian
adaptation, hill climbing, and swarm intelligence (e.g.: ant colony
optimization, particle swarm optimization) and methods based on integer
linear programming. The suitability of genetic algorithms is dependent on the
amount of knowledge of the problem; well-known problems often have
better, more specialized approaches.
This represents a trade-off for businesses. They can choose a faster response but a
potentially less accurate outcome. Or they can accept a slower response but receive
a more accurate result from the model. But these compromises aren’t all bad news.
The decision of whether to go for a higher cost and more accurate model over a
faster response comes down to the use case.
Data plays a significant role in the machine learning process. One of the significant
issues that machine learning professionals face is the absence of good quality data.
Unclean and noisy data can make the whole process extremely exhausting. We
don’t want our algorithm to make inaccurate or faulty predictions. Hence the quality
of data is essential to enhance the output. Therefore, we need to ensure that the
process of data preprocessing which includes removing outliers, filtering missing
values, and removing unwanted features, is done with the utmost level of perfection.
Let’s say for a child, to make him learn what an apple is, all it takes for you to point to
an apple and say apple repeatedly. Now the child can recognize all sorts of apples.
Well, machine learning is still not up to that level yet; it takes a lot of data for most of
the algorithms to function properly. For a simple task, it needs thousands of
examples to make something out of it, and for advanced tasks like image or speech
recognition, it may need lakhs(millions) of examples.
Overfitting refers to a machine learning model trained with a massive amount of data
that negatively affect its performance. It is like trying to fit in Oversized jeans.
Unfortunately, this is one of the significant issues faced by machine learning
professionals. This means that the algorithm is trained with noisy and biased data,
which will affect its overall performance. Let’s understand this with the help of an
example. Let’s consider a model trained to differentiate between a cat, a rabbit, a
dog, and a tiger. The training data contains 1000 cats, 1000 dogs, 1000 tigers, and
4000 Rabbits. Then there is a considerable probability that it will identify the cat as a
rabbit. In this example, we had a vast amount of data, but it was biased; hence the
prediction was negatively affected.
Machine learning models operate within specific contexts. For example, ML models
that power recommendation engines for retailers operate at a specific time when
customers are looking at certain products. However, customer needs change over
time, and that means the ML model can drift away from what it was designed to
deliver.
Models can decay for several reasons. Drift can occur when new data is introduced
to the model. This is called data drift. It can also occur when our interpretation of the
data changes. This is concept drift.
To accommodate this drift, you need a model that continuously updates and
improves itself using data that comes in. That means you need to keep checking the
model.
That requires the collection of features and labels and to react to changes so the
model can be updated and retrained. While some aspects of the retraining can be
conducted automatically, some human intervention is needed. It’s critical to
recognise that the deployment of a machine learning tool is not a one-off activity.
Machine learning tools require regular review and update to remain relevant and
continue to deliver value.
Machine learning models are part of a longer pipeline that starts with the features
that are used to train the model. Then there is the model itself, which is a piece of
software that can require modification and updates. That model requires labels so
that the results of an input can be recognized and used by the model. And there may
be a disconnect between the model and the final signal in a system.
In many cases when an unexpected outcome is delivered, it’s not the machine
learning that has broken down but some other part of the chain. For example, a
recommendation engine may have offered a product to a customer, but sometimes
the connection between the sales system and the recommendation could be broken,
and it takes time to find the bug. In this case, it would be hard to tell the model if the
recommendation was successful. Troubleshooting issues like this can be quite labor
intensive.
Slow Implementation
This is one of the common issues faced by machine learning professionals. The
machine learning models are highly efficient in providing accurate results, but it takes
a tremendous amount of time. Slow programs, data overload, and excessive
requirements usually take a lot of time to provide accurate results. Further, it requires
constant monitoring and maintenance to deliver the best output.
Data Science and Machine Learning are closely related to each other but have
different functionalities and different goals. Briefly, Data Science is a field to study
the approaches to find insights from the raw data. Whereas Machine Learning is a
technique used by the group of data scientists to enable the machines to learn
automatically from the past data. To understand the difference in-depth, let’s first
have a brief introduction to these two technologies.
Data Acquisition: In this step, the data is acquired to solve the given problem. For
the recommendation system, we can get the ratings provided by the user for
different products, comments, purchase history, etc.
Data Processing: In this step, the raw data acquired from the previous step is
transformed into a suitable format, so that it can be easily used by the further steps.
Modeling: The data modeling is a step where machine learning algorithms are
used. So, this step includes the whole machine learning process. The machine
learning process involves importing the data, data cleaning, building a model,
training the model, testing the model, and improving the model’s efficiency.
Data Exploration: It is a step where we understand the patterns of the data, and
try to find out the useful insights from the data.
Deployment & Optimization: This is the last step where the model is deployed on
an actual project, and the performance of the model is checked.
A data scientist needs to have skills to use Machine Learning Engineer needs to have skills
big data tools like Hadoop, Hive and Pig, such as computer science fundamentals,
statistics, programming in Python, R, or programming skills in Python or R, statistics and
Scala. probability concepts, etc.
K-Nearest Neighbor
In statistics, the k-nearest neighbors’ algorithm (k-NN) is a non-parametric
classification method first developed by Evelyn Fix and Joseph Hodges in 1951, and
later expanded by Thomas Cover. It is used for classification and regression. In both
cases, the input consists of the k closest training examples in a data set. The output
depends on whether k-NN is used for classification or regression:
k-NN is a type of classification where the function is only approximated locally, and
all computation is deferred until function evaluation. Since this algorithm relies on
distance for classification, if the features represent different physical units or come in
vastly different scales then normalizing the training data can improve its accuracy
dramatically.
Both for classification and regression, a useful technique can be to assign weights to
the contributions of the neighbors, so that the nearer neighbors contribute more to
the average than the more distant ones. For example, a common weighting scheme
consists in giving each neighbor a weight of 1/d, where d is the distance to the
neighbor.
The neighbors are taken from a set of objects for which the class (for k-NN
classification) or the object property value (for k-NN regression) is known. This can
be thought of as the training set for the algorithm, though no explicit training step is
required.
A peculiarity of the k-NN algorithm is that it is sensitive to the local structure of the
data.
Algorithm
Example of k-NN classification. The test sample (green dot) should be classified
either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to
the red triangles because there are 2 triangles and only 1 square inside the inner
circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2
triangles inside the outer circle).
The training examples are vectors in a multidimensional feature space, each with a
class label. The training phase of the algorithm consists only of storing the feature
vectors and class labels of the training samples.
A commonly used distance metric for continuous variables is Euclidean distance. For
discrete variables, such as for text classification, another metric can be used, such as
the overlap metric (or Hamming distance). In the context of gene expression
microarray data, for example, k-NN has been employed with correlation coefficients,
such as Pearson and Spearman, as a metric.[6] Often, the classification accuracy of
k-NN can be improved significantly if the distance metric is learned with specialized
algorithms such as Large Margin Nearest Neighbor or Neighborhood components
analysis.
A drawback of the basic “majority voting” classification occurs when the class
distribution is skewed. That is, examples of a more frequent class tend to dominate
the prediction of the new example, because they tend to be common among the k
nearest neighbors due to their large number. One way to overcome this problem is
to weight the classification, considering the distance from the test point to each of
its k nearest neighbors. The class (or value, in regression problems) of each of the k
nearest points is multiplied by a weight proportional to the inverse of the distance
from that point to the test point. Another way to overcome skew is by abstraction in
data representation. For example, in a self-organizing map (SOM), each node is a
representative (a center) of a cluster of similar points, regardless of their density in
the original training data. K-NN can then be applied to the SOM.
Parameter Selection
The best choice of k depends upon the data; generally, larger values of k reduce
effect of the noise on the classification,[8] but make boundaries between classes
less distinct. A good k can be selected by various heuristic techniques (see
hyperparameter optimization). The special case where the class is predicted to be
the class of the closest training sample (i.e. when k = 1) is called the nearest
neighbor algorithm.
The accuracy of the k-NN algorithm can be severely degraded by the presence of
noisy or irrelevant features, or if the feature scales are not consistent with their
importance. Much research effort has been put into selecting or scaling features to
improve classification. A particularly popular approach is the use of evolutionary
algorithms to optimize feature scaling. Another popular approach is to scale features
by the mutual information of the training data with the training classes.
3.1 Calculate the distance between the query example and the current example from
the data.
3.2 Add the distance and the index of the example to an ordered collection.
4. Sort the ordered collection of distances and indices from smallest to largest
(in ascending order) by the distances.
5. Pick the first K entries from the sorted collection.
6. Get the labels of the selected K entries.
7. If regression, return the mean of the K labels.
8. If classification, return the mode of the K labels.
To select the K that’s right for your data, we run the KNN algorithm several times
with different values of K and choose the K that reduces the number of errors we
encounter while maintaining the algorithm’s ability to accurately make predictions
when it’s given data it hasn’t seen before.
Advantages
Disadvantages
KNN in practice
However, provided you have sufficient computing resources to speedily handle the
data you are using to make predictions, KNN can still be useful in solving problems
that have solutions that depend on identifying similar objects. An example of this is
using the KNN algorithm in recommender systems, an application of KNN-search.
Decision Trees
Decision trees are a method for defining complex relationships by describing decisions and
avoiding the problems in communication. A decision tree is a diagram that shows alternative
actions and conditions within horizontal tree framework. Thus, it depicts which conditions to
consider first, second, and so on.
Decision trees depict the relationship of each condition and their permissible actions. A
square node indicates an action, and a circle indicates a condition. It forces analysts to
consider the sequence of decisions and identifies the actual decision that must be made.
The major limitation of a decision tree is that it lacks information in its format to describe
what other combinations of conditions you can take for testing. It is a single representation of
the relationships between conditions and actions.
Logistic Regression
In statistics, the logistic model (or logit model) is used to model the probability of a
certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.
This can be extended to model several classes of events such as determining
whether an image contains a cat, dog, lion, etc. Each object being detected in the
image would be assigned a probability between 0 and 1, with a sum of one.
Logistic regression is a statistical model that in its basic form uses a logistic function
to model a binary dependent variable, although many more complex extensions
exist. In regression analysis, logistic regression (or logit regression) is estimating the
parameters of a logistic model (a form of binary regression). Mathematically, a
binary logistic model has a dependent variable with two possible values, such as
pass/fail which is represented by an indicator variable, where the two values are
labeled “0” and “1”. In the logistic model, the log-odds (the logarithm of the odds)
for the value labeled “1” is a linear combination of one or more independent
variables (“predictors”); the independent variables can each be a binary variable
(two classes, coded by an indicator variable) or a continuous variable (any real
value). The corresponding probability of the value labeled “1” can vary between 0
(certainly the value “0”) and 1 (certainly the value “1”), hence the labeling; the
function that converts log-odds to probability is the logistic function, hence the
name. The unit of measurement for the log-odds scale is called a logit, from logistic
unit, hence the alternative names. Analogous models with a different sigmoid
function instead of the logistic function can also be used, such as the probit model;
the defining characteristic of the logistic model is that increasing one of the
independent variables multiplicatively scales the odds of the given outcome at a
constant rate, with each independent variable having its own parameter; for a binary
dependent variable this generalizes the odds ratio.
In a binary logistic regression model, the dependent variable has two levels
(categorical). Outputs with more than two values are modeled by multinomial
logistic regression and, if the multiple categories are ordered, by ordinal logistic
regression (for example the proportional odds ordinal logistic model). The logistic
regression model itself simply models probability of output in terms of input and
does not perform statistical classification (it is not a classifier), though it can be used
to make a classifier, for instance by choosing a cutoff value and classifying inputs
with probability greater than the cutoff as one class, below the cutoff as the other;
this is a common way to make a binary classifier. The coefficients are generally not
computed by a closed-form expression, unlike linear least squares; see § Model
fitting. The logistic regression as a general statistical model was originally developed
and popularized primarily by Joseph Berkson, beginning in Berkson (1944), where he
coined “logit”.
Applications
Logistic regression is used in various fields, including machine learning, most medical
fields, and social sciences. For example, the Trauma and Injury Severity Score
(TRISS), which is widely used to predict mortality in injured patients, was originally
developed by Boyd using logistic regression. Many other medical scales used to
assess severity of a patient have been developed using logistic regression. Logistic
regression may be used to predict the risk of developing a given disease (e.g.
diabetes; coronary heart disease), based on observed characteristics of the patient
(age, sex, body mass index, results of various blood tests, etc.). Another example
might be to predict whether a Nepalese voter will vote Nepali Congress or
Communist Party of Nepal or Any Other Party, based on age, income, sex, race,
state of residence, votes in previous elections, etc. The technique can also be used
in engineering, especially for predicting the probability of failure of a given process,
system or product. It is also used in marketing applications such as prediction of a
customer’s propensity to purchase a product or halt a subscription, etc. In
economics it can be used to predict the likelihood of a person ending up in the labor
force, and a business application would be to predict the likelihood of a homeowner
defaulting on a mortgage. Conditional random fields, an extension of logistic
regression to sequential data, are used in natural language processing.
Logistic regression uses an equation as the representation, very much like linear
regression.
Input values (x) are combined linearly using weights or coefficient values (referred
to as the Greek capital letter Beta) to predict an output value (y). A key difference
from linear regression is that the output value being modeled is a binary value (0 or
1) rather than a numeric value.
The actual representation of the model that you would store in memory or in a file
are the coefficients in the equation (the beta value or b’s).
Logistic regression models the probability of the default class (e.g. the first class).
For example, if we are modeling people’s sex as male or female from their height,
then the first class could be male and the logistic regression model could be written
as the probability of male given a person’s height, or more formally:
P(sex=male|height)
Written another way, we are modeling the probability that an input (X) belongs to
the default class (Y=1), we can write this formally as:
P(X) = P(Y=1|X)
Note that the probability prediction must be transformed into a binary values (0 or
1) in order to actually make a probability prediction. More on this later when we talk
about making predictions.
Logistic regression is a linear method, but the predictions are transformed using the
logistic function. The impact of this is that we can no longer understand the
predictions as a linear combination of the inputs as we can with linear regression, for
example, continuing on from above, the model can be stated as:
I don’t want to dive into the math too much, but we can turn around the above
equation as follows (remember we can remove the e from one side by adding a
natural logarithm (ln) to the other):
ln(p(X) / 1 – p(X)) = b0 + b1 * X
This is useful because we can see that the calculation of the output on the right is
linear again (just like linear regression), and the input on the left is a log of the
probability of the default class.
This ratio on the left is called the odds of the default class (it’s historical that we use
odds, for example, odds are used in horse racing rather than probabilities). Odds are
calculated as a ratio of the probability of the event divided by the probability of not
the event, e.g. 0.8/(1-0.8) which has the odds of 4. So, we could instead write:
ln(odds) = b0 + b1 * X
Because the odds are log transformed, we call this left-hand side the log-odds or the
probit. It is possible to use other types of functions for the transform (which is out of
scope_, but as such it is common to refer to the transform that relates the linear
regression equation to the probabilities as the link function, e.g. the probit link
function.
We can move the exponent back to the right and write it as:
odds = e^(b0 + b1 * X)
All of this helps us understand that indeed the model is still a linear combination of
the inputs, but that this linear combination relates to the log-odds of the default
class.
The assumptions made by logistic regression about the distribution and relationships
in your data are much the same as the assumptions made in linear regression.
Much study has gone into defining these assumptions and precise probabilistic and
statistical language is used. My advice is to use these as guidelines or rules of thumb
and experiment with different data preparation schemes.
Ultimately in predictive modelling machine learning projects you are laser focused on
making accurate predictions rather than interpreting the results. As such, you can
break some assumptions as long as the model is robust and performs well.
When data are unlabeled, supervised learning is not possible, and an unsupervised
learning approach is required, which attempts to find natural clustering of the data to
groups, and then map new data to these formed groups. The support-vector
clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the
statistics of support vectors, developed in the support vector machines algorithm, to
categorize unlabeled data, and is one of the most widely used clustering algorithms
in industrial applications.
Motivation
Classifying data is a common task in machine learning. Suppose some given data
points each belong to one of two classes, and the goal is to decide which class a
new data point will be in. In the case of support-vector machines, a data point is
viewed as p-dimensional vector (a list of p numbers), and we want to know whether
we can separate such points with (p-1)-dimensional hyperplane. This is called a
linear classifier. There are many hyperplanes that might classify the data. One
reasonable choice as the best hyperplane is the one that represents the largest
separation, or margin, between the two classes. So, we choose the hyperplane so
that the distance from it to the nearest data point on each side is maximized. If such
a hyperplane exists, it is known as the maximum-margin hyperplane and the linear
classifier it defines is known as a maximum-margin classifier; or equivalently, the
perceptron of optimal stability.
Applications
SVMs are helpful in text and hypertext categorization, as their application can
significantly reduce the need for labeled training instances in both the
standard inductive and transductive settings. Some methods for shallow
semantic parsing are based on support vector machines.
Classification of images can also be performed using SVMs. Experimental
results show that SVMs achieve significantly higher search accuracy than
traditional query refinement schemes after just three to four rounds of
relevance feedback. This is also true for image segmentation systems,
including those using a modified version SVM that uses the privileged
approach as suggested by Vapnik.
Classification of satellite data like SAR data using supervised SVM.
Hand-written characters can be recognized using SVM.
The SVM algorithm has been widely applied in the biological and other
sciences. They have been used to classify proteins with up to 90% of the
compounds classified correctly. Permutation tests based on SVM weights
have been suggested as a mechanism for interpretation of SVM models.
Support-vector machine weights have also been used to interpret SVM
models in the past. Posthoc interpretation of support-vector machine models
in order to identify features used by the model to make predictions is a
relatively new area of research with special significance in the biological
sciences.
1) Classification Models: Classification models are used for problems where the
output variable can be categorized, such as “Yes” or “No”, or “Pass” or “Fail.”
Classification Models are used to predict the category of the data. Real-life examples
include spam detection, sentiment analysis, scorecard prediction of exams, etc.
2) Regression Models: Regression models are used for problems where the
output variable is a real value such as a unique number, dollars, salary, weight or
pressure, for example. It is most often used to predict numerical values based on
previous data observations. Some of the more familiar regression algorithms include
linear regression, logistic regression, polynomial regression, and ridge regression.
Speech Recognition
While using Google, we get an option of “Search by Voice,” it comes under speech
recognition, and it’s a popular application of machine learning.
Image Recognition
Traffic prediction
If we want to visit a new place, we take help of Google Maps, which shows us the
correct path with the shortest route and predicts the traffic conditions. It predicts
the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the
performance.
Self-driving cars
Content Filter
Header filter
General blacklists filter
Rules-based filters
Permission filters
Product recommendations
Google understands the user interest using various machine learning algorithms and
suggests the product as per customer interest. As similar, when we use Netflix, we
find some recommendations for entertainment series, movies, etc., and this is also
done with the help of machine learning.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana,
Siri. As the name suggests, they help us in finding the information using our voice
instruction. These assistants can help us in various ways just by our voice
instructions such as Play music, call someone, open an email, Scheduling an
appointment, etc. These virtual assistants use machine learning algorithms as an
important part. These assistants record our voice instructions, send it over the
server on a cloud, and decode it using ML algorithms and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various
ways that a fraudulent transaction can take place such as fake accounts, fake ids,
and steal money in the middle of a transaction. So, to detect this, Feed Forward
Neural network helps us by checking whether it is a genuine transaction or a fraud
transaction.
For each genuine transaction, the output is converted into some hash values, and
these values become the input for the next round. For each genuine transaction,
there is a specific pattern which gets change for the fraud transaction hence, it
detects it and makes our online transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there
is always a risk of up and downs in shares, so for this machine learning’s long short-
term memory neural network is used for the prediction of stock market trends.
Medical Diagnosis
In medical science, machine learning is used for diseases diagnoses. With this,
medical technology is growing very fast and able to build 3D models that can predict
the exact position of lesions in the brain. It helps in finding brain tumors and other
brain-related diseases easily.
Nowadays, if we visit a new place and we are not aware of the language then it is
not a problem at all, as for this also machine learning helps us by converting the text
into our known languages. Google’s GNMT (Google Neural Machine Translation)
provide this feature, which is a Neural Machine Learning that translates the text into
our familiar language, and it called as automatic translation.
Companies can mine their historical pricing data along with data sets on a host of
other variables to understand how certain dynamics from time of day to weather to
the seasons impact demand for goods and services. Machine learning algorithms can
learn from that information and combine that insight with additional market and
consumer data to help companies dynamically price their goods based on those vast
and numerous variables a strategy that ultimately helps companies maximize
revenue.
The most visible example of dynamic pricing (which is sometimes called demand
pricing) happens in the transportation industry:
Think surge pricing at Uber when conditions push up the number of people seeking
rides all at once or sky-high prices for airline tickets during school vacation weeks.
Machine learning applications don’t just help companies set prices; they also help
companies deliver the right products and services to the right areas at the right time
through predictive inventory planning and customer segmentation. Retailers, for
example, use machine learning to predict what inventory will sell best in which of its
stores based on the seasonal factors impacting a particular store, the demographics
of that region and other data points such as what’s trending on social media, said
Adnan Masood who as chief architect at UST Global specializes in AI and machine
learning.
Retention. Similar tracking techniques, the use of text sentiment and other
metadata analysis (from emails and social media posts) can be applied to detect
possible job-hopping behavior among candidates.
Human resource allocation. You can use historic data from HR software sick
days, vacations, holidays, etc. to make broader predictions on your workforce.
Deloitte disclosed that several automotive companies are learning from the patterns
of unscheduled absences to forecast the periods when people are likely to take a
day off and reserve more workforce.
Big e-commerce companies like Amazon and Walmart use recommendation engines
to personalize and expedite the shopping experience.
Another well-known deployer of this machine learning application is Netflix, the
streaming entertainment service, which uses a customer’s viewing history, the
viewing history of customers with similar entertainment interests, information about
individual shows and other data points to deliver personalized recommendations to
its customers.
Online video platform YouTube uses recommendation engine technology to help
users quickly find videos that fit their tastes.
Digital marketing and online-driven sales are the first application fields that you may
think of for machine learning adoption. People interact with the web and leave a
detailed footprint to be analyzed. While there are tangible results in unsupervised
learning techniques for marketing and sales, the largest value impact is in the
supervised learning field. Let’s have a look.
Churn. The churn rate defines the number of customers who cease to complete
target actions (e.g. add to cart, leave a comment, checkout, etc.) during a given
period. Like lifetime value predictions, sorting “likely-to-churn-soon” from engaged
customers will allow you to:
Unsupervised Learning
Unsupervised learning is a type of machine learning in which the algorithm is not
provided with any pre-assigned labels or scores for the training data. As a result,
unsupervised learning algorithms must first self-discover any naturally occurring
patterns in that training data set. Common examples include clustering, where the
algorithm automatically groups its training examples into categories with similar
features, and principal component analysis, where the algorithm finds ways to
compress the training data set by identifying which features are most useful for
discriminating between different training examples and discarding the rest. This
contrasts with supervised learning in which the training data include pre-assigned
category labels (often by a human, or from the output of non-learning classification
algorithm). Other intermediate levels in the supervision spectrum include
reinforcement learning, where only numerical scores are available for each training
example instead of detailed tags, and semi-supervised learning where only a portion
of the training data have been tagged.
Approaches
(1) Clustering
(3) Neural networks (note that not all neural networks are unsupervised; they can
be trained by supervised, unsupervised, semi-supervised, or reinforcement methods)
Method of moments
One statistical approach for unsupervised learning is the method of moments. In the
method of moments, the unknown parameters of interest in the model are related to
the moments of one or more random variables. These moments are empirically
estimated from the available data samples and used to calculate the most likely
value distributions for each parameter. The method of moments is shown to be
effective in learning the parameters of latent variable models, where in addition to
the observed variables available in the training and input data sets, several
unobserved latent variables are also assumed to exist and to determine the
categorization of each same. One practical example of latent variable models in
machine learning is topic modeling, which is a statistical model for predicting the
words (observed variables) in a document based on the topic (latent variable) of the
document. The method of moments (tensor decomposition techniques) has been
shown to consistently recover the parameters of a large class of latent variable
models under certain assumptions.
Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities between
the data objects and categorizes them as per the presence and absence of those
commonalities.
Disadvantages
K-means clustering
KNN (k-nearest neighbors)
Hierarchal clustering
Anomaly detection
Neural Networks
Principle Component Analysis
Independent Component Analysis
Apriori algorithm
Singular value decomposition
Cluster analysis itself is not one specific algorithm, but the general task to be solved.
It can be achieved by various algorithms that differ significantly in their
understanding of what constitutes a cluster and how to efficiently find them. Popular
notions of clusters include groups with small distances between cluster members,
dense areas of the data space, intervals, or statistical distributions. Clustering can
therefore be formulated as a multi-objective optimization problem. The appropriate
clustering algorithm and parameter settings (including parameters such as the
distance function to use, a density threshold or the number of expected clusters)
depend on the individual data set and intended use of the results. Cluster analysis as
such is not an automatic task, but an iterative process of knowledge discovery or
interactive multi-objective optimization that involves trial and failure. It is often
necessary to modify data pre-processing and model parameters until the result
achieves the desired properties.
Besides the term clustering, there are several terms with similar meanings, including
automatic classification, numerical taxonomy, botryology (from Greek βότρυς
“grape”), typological analysis, and community detection. The subtle differences are
often in the use of the results: while in data mining, the resulting groups are the
matter of interest, in automatic classification the resulting discriminative power is of
interest.
Cluster analysis was originated in anthropology by Driver and Kroeber in 1932 and
introduced to psychology by Joseph Zubin in 1938 and Robert Tryon in 1939 and
famously used by Cattell beginning in 1943 for trait theory classification in
personality psychology.
The notion of a “cluster” cannot be precisely defined, which is one of the reasons
why there are so many clustering algorithms. There is a common denominator: a
group of data objects. However, different researchers employ different cluster
models, and for each of these cluster models again different algorithms can be
given. The notion of a cluster, as found by different algorithms, varies significantly in
its properties. Understanding these “cluster models” is key to understanding the
differences between the various algorithms. Typical cluster models include:
Hierarchical Clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster
analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of
clusters.
In general, the merges and splits are determined in a greedy manner. The results of
hierarchical clustering are usually presented in a dendrogram.
In the example above, the distance between two clusters has been computed based
on the length of the straight line drawn from one cluster to another. This is
commonly referred to as the Euclidean distance. Many other distance metrics have
been developed.
The choice of distance metric should be made based on theoretical concerns from
the domain of study. That is, a distance metric needs to define similarity in a way
that is sensible for the field of study. For example, if clustering crime sites in a city,
city block distance may be appropriate. Or, better yet, the time taken to travel
between each location. Where there is no theoretical justification for an alternative,
the Euclidean should generally be preferred, as it is usually the appropriate measure
of distance in the physical world.
Linkage Criteria
As with distance metrics, the choice of linkage criteria should be made based on
theoretical considerations from the domain of application. A key theoretical issue is
what causes variation. For example, in archeology, we expect variation to occur
through innovation and natural resources, so working out if two groups of artifacts
are similar may make sense based on identifying the most similar members of the
cluster.
Where there are no clear theoretical justifications for the choice of linkage criteria,
Ward’s method is the sensible default. This method works out which observations to
group based on reducing the sum of squared distances of each observation from the
average observation in a cluster. This is often appropriate as this concept of distance
matches the standard assumptions of how to compute differences between groups
in statistics (e.g., ANOVA, MANOVA).
In the partitioning method when database(D) that contains multiple(N) objects then
the partitioning method constructs user-specified(K) partitions of the data in which
each partition represents a cluster and a particular region. There are many
algorithms that come under partitioning method some of the popular ones are K-
Mean, PAM(K-Mediods), CLARA algorithm (Clustering Large Applications) etc.
The K means algorithm takes the input parameter K from the user and partitions the
dataset containing N objects into K clusters so that resulting similarity among the
data objects inside the group (intracluster) is high but the similarity of data objects
with the data objects from outside the cluster is low (intercluster). The similarity of
the cluster is determined with respect to the mean value of the cluster.
It is a type of square error algorithm. At the start randomly k objects from the
dataset are chosen in which each of the objects represents a cluster mean(center).
For the rest of the data objects, they are assigned to the nearest cluster based on
their distance from the cluster mean. The new mean of each of the cluster is then
calculated with the added data objects.
Method:
(Re) Assign each object to which object is most similar based upon mean values.
Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.
Density Reachable:
Major features:
DBSCAN
It relies on a density-based notion of cluster: A cluster is defined as a maximal set
of density-connected points.
DBSCAN Algorithm
If p is a border point, no points are density-reachable from p and DBSCAN visits the
next point of the database.
Continue the process until all of the points have been processed.
It is good for both automatic and interactive cluster analysis, including finding
an intrinsic clustering structure.
It can be represented graphically or using visualization techniques.
The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used by the
various big retailer to discover the associations between items. We can understand it
by taking an example of a supermarket, as in a supermarket, all products that are
purchased together are put together.
In addition to the above example from market basket analysis association rules are
employed today in many application areas including Web usage mining, intrusion
detection, continuous production, and bioinformatics. In contrast with sequence
mining, association rule learning typically does not consider the order of items either
within a transaction or across transactions.
Apriori
Eclat
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a
depth-first search technique to find frequent item sets in a transaction database. It
performs faster execution than Apriori Algorithm.
The F-P growth algorithm stands for Frequent Pattern, and it is the improved version
of the Apriori Algorithm. It represents the database in the form of a tree structure
that is known as a frequent pattern or tree. The purpose of this frequent tree is to
extract the most frequent patterns.
Association rule learning works on the concept of If and Else Statement, such as if A
then B.
If A -> Then B
Support
Confidence
Lift
Let’s understand each of them:
Support
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already
given. It is the ratio of the transaction that contains X and Y to the number of
records that contain X.
Lift
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
Lift>1: It determines the degree to which the two item sets are dependent to each
other.
Lift<1: It tells us that one item is a substitute for other items, which means one
item has a negative effect on another.