Notes
Notes
Notes
The SAS Institute developed SEMMA as the process of data mining. It has five steps
(Sample, Explore, Modify, Model, and Assess), earning the acronym of SEMMA. The
data mining method can be used to solve a wide range of business problems,
including fraud identification, customer retention and turnover, database marketing,
customer loyalty, bankruptcy forecasting, market segmentation, as well as risk,
affinity, and portfolio analysis.
Why SEMMA?
Data is used by businesses to achieve a competitive advantage, improve
performance, and deliver more useful services to customers. The data we collect
about our surroundings serve as the foundation for hypotheses and models of the
world we live in.
Ultimately, data is accumulated to help in collecting knowledge. That means the data
is not worth much until it is studied and analyzed. But hoarding vast volumes of data
is not equivalent to gathering valuable knowledge. It is only when data is sorted and
evaluated that we learn anything from it.
The process breaks down into its own set of stages. These include:
Sample: This step entails choosing a subset of the appropriate volume dataset
from a vast dataset that has been given for the model’s construction. The goal
of this initial stage of the process is to identify variables or factors (both
dependent and independent) influencing the process. The collected
information is then sorted into preparation and validation categories.
Explore: During this step, univariate and multivariate analysis is conducted in
order to study interconnected relationships between data elements and to
identify gaps in the data. While the multivariate analysis studies the
relationship between variables, the univariate one looks at each factor
individually to understand its part in the overall scheme. All of the influencing
factors that may influence the study’s outcome are analyzed, with heavy
reliance on data visualization.
Modify: In this step, lessons learned in the exploration phase from the data
collected in the sample phase are derived with the application of business
logic. In other words, the data is parsed and cleaned, being then passed onto
the modeling stage, and explored if the data requires refinement and
transformation.
Model: With the variables refined and data cleaned, the modeling step applies
a variety of data mining techniques in order to produce a projected model of
how this data achieves the final, desired outcome of the process.
Assess: In this final SEMMA stage, the model is evaluated for how useful and
reliable it is for the studied topic. The data can now be tested and used to
estimate the efficacy of its performance.
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application
of particular data mining methods. It is of interest to researchers in machine
learning, pattern recognition, databases, statistics, artificial intelligence, knowledge
acquisition for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the
context of large databases.
1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of
the task.
o Using dimensionality reduction or transformation methods to reduce
the effective number of variables under consideration or to find
invariant representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of
the KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form
or a set of such representations as classification rules or trees,
regression, clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
Machine learning (ML) is an important tool for the goal of leveraging technologies
around artificial intelligence. Because of its learning and decision-making abilities,
machine learning is often referred to as AI, though, in reality, it is a subdivision of AI.
Until the late 1970s, it was a part of AI’s evolution. Then, it branched off to evolve on
its own. Machine learning has become a very important response tool for cloud
computing and eCommerce, and is being used in a variety of cutting edge
technologies.
Machine learning is a necessary aspect of modern business and research for many
organizations today. It uses algorithms and neural network models to assist
computer systems in progressively improving their performance. Machine learning
algorithms automatically build a mathematical model using sample data – also
known as “training data” – to make decisions without being specifically programmed
to make those decisions.
Machine Learning is the branch of computer science that deals with the development of
computer programs that teach and grow themselves. According to Arthur Samuel, an American
pioneer in computer gaming, Machine Learning is the subfield of computer science that “gives
the computer the ability to learn without being explicitly programmed.” Machine Learning
allows developers to build algorithms that automatically improve themselves by finding patterns
in the existing data without explicit instructions from a human or developer. Machine Learning
relies entirely on the data; the more the data, the more efficient Machine Learning is.
The Machine Learning development approach includes learning from data inputs and evaluating
and optimizing the model results. Machine Learning is widely used in data analytics as a method
to develop algorithms for making predictions on data. Machine Learning is related to probability,
statistics, and linear algebra.
Machine Learning is broadly classified into three categories depending on the nature of the
learning ‘signal’ or ‘feedback’ available to a learning system.
1. Supervised learning: Computer is presented with inputs and their desired outputs. The goal is to learn
a general rule to map inputs to the output.
2. Unsupervised learning: Computer is presented with inputs without desired outputs, the goal is to find
structure in inputs.
3. Reinforcement learning: Computer program interacts with a dynamic environment, and it must
perform a certain goal without a guide or teacher.
Machine Learning takes advantage of the ability of computer systems to learn from correlations
hidden in the data; this ability can be further utilized by programming or developing intelligent
and efficient Machine Learning algorithms. While Machine Learning may seem new, it has been
around long before people observed it as popular technology. It has evolved to solve real
problems of human life and automate the processes used in various industries such as banking,
healthcare, telecom, retail, and so on. The software or application or solution developed using
Machine Learning can learn from its dynamic environment and adapt to changing requirements.
In contrast to traditional software implementation, the lessons learned from Machine Learning
algorithms can be scaled and transferred across multiple applications. Machine Learning
naturally considers a large number of variables that influence the results or observations, which
can be used in both science and business. Because of all these features and advantages,
today’s software is developed for automated decision making and more innovative solutions,
which makes an investment in Machine Learning a natural evolution of technology.
Machine Learning has been one of the most active and rewarding areas of research due to its
widespread use in many areas. It has brought a monumental shift in technology and its
applications. Some of the evolutions, which made a huge positive impact on real-world problem
solving, are highlighted in the following sections.
For instance, suppose you are given a basket filled with different kinds of
fruits. Now the first step is to train the machine with all the different fruits one
by one like this:
If the shape of the object is rounded and has a depression at the top, is
red in color, then it will be labeled as –Apple.
If the shape of the object is a long curving cylinder having Green-Yellow
color, then it will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit,
say Banana from the basket, and asked to identify it.
Since the machine has already learned the things from previous data and
this time has to use it wisely. It will first classify the fruit with its shape and
color and would confirm the fruit name as BANANA and put it in the Banana
category. Thus the machine learns the things from training data(basket
containing fruits) and then applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”.
Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that
some data is already tagged with the correct answer.
Types:-
Regression
Logistic Regression
Classification
Naive Bayes Classifiers
K-NN (k nearest neighbors)
Decision Trees
Support Vector Machine
Advantages:-
Supervised learning allows collecting data and produces data output from
previous experiences.
Helps to optimize performance criteria with the help of experience.
Supervised machine learning helps to solve various types of real-world
computation problems.
Disadvantages:-
Classifying big data can be challenging.
Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.
Steps
Unsupervised learning
Unsupervised learning is the training of a machine using information that is
neither classified nor labeled and allowing the algorithm to act on that
information without guidance. Here the task of the machine is to group
unsorted information according to similarities, patterns, and differences
without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will
be given to the machine. Therefore the machine is restricted to find the
hidden structure in unlabeled data by itself.
For instance, suppose it is given an image having both dogs and cats which
it has never seen.
Thus the machine has no idea about the features of dogs and cats so we
can’t categorize it as ‘dogs and cats ‘. But it can categorize them according
to their similarities, patterns, and differences, i.e., we can easily categorize
the above picture into two parts. The first may contain all pics having dogs in
them and the second part may contain all pics having cats in them. Here you
didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information
that was previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the
inherent groupings in the data, such as grouping customers by
purchasing behavior.
Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people
that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Supervised machine Unsupervised machine
Parameters learning learning
Computational
Complexity Simpler method Computationally complex
What is ANN?
Artificial Neural Networks are a special type of machine learning
algorithms that are modeled after the human brain. That is, just like how
the neurons in our nervous system are able to learn from the past data,
similarly, the ANN is able to learn from the data and provide responses in
the form of predictions or classifications.
ANNs are nonlinear statistical models which display a complex relationship
between the inputs and outputs to discover a new pattern. A variety of tasks
such as image recognition, speech recognition, machine translation as well
as medical diagnosis makes use of these artificial neural networks.
An important advantage of ANN is the fact that it learns from the example
data sets. Most commonly usage of ANN is that of a random function
approximation. With these types of tools, one can have a cost-effective
method of arriving at the solutions that define the distribution. ANN is also
capable of taking sample data rather than the entire dataset to provide the
output result. With ANNs, one can enhance existing data analysis
techniques owing to their advanced predictive capabilities.
Each node in the network has some weights assigned to it. A transfer
function is used for calculating the weighted sum of the inputs and the bias.
After the transfer function has calculated the sum, the activation function
obtains the result. Based on the output received, the activation functions
fire the appropriate result from the node. For example, if the output
received is above 0.5, the activation function fires a 1 otherwise it remains
0.
Some of the popular activation functions used in Artificial Neural Networks
are Sigmoid, RELU, Softmax, tanh etc.
Based on the value that the node has fired, we obtain the final output. Then,
using the error functions, we calculate the discrepancies between the
predicted output and resulting output and adjust the weights of the neural
network through a process known as backpropagation.
Back Propagation in Artificial Neural
Networks
In order to train a neural network, we provide it with examples of input-
output mappings. Finally, when the neural network completes the training,
we test the neural network where we do not provide it with these mappings.
The neural network predicts the output and we evaluate how correct the
output is using the various error functions. Finally, based on the result, the
model adjusts the weights of the neural networks to optimize the network
following gradient descent through the chain rule.
Each circle in the graph above represents a neuron-like unit called a node,
and nodes are simply where calculations take place. The nodes are
connected to each other across layers, but no two nodes of the same layer
are linked.
Now let’s follow that single pixel value, x, through the two-layer net. At node 1
of the hidden layer, x is multiplied by a weight and added to a so-called bias.
The result of those two operations is fed into an activation function, which
produces the node’s output, or the strength of the signal passing through it,
given input x.
Next, let’s look at how several inputs would combine at one hidden node.
Each x is multiplied by a separate weight, the products are summed, added to
a bias, and again the result is passed through an activation function to
produce the node’s output.
Because inputs from all visible nodes are being passed to all hidden nodes,
an RBM can be defined as a symmetrical bipartite graph.
Each hidden node receives the four inputs multiplied by their respective
weights. The sum of those products is again added to a bias (which forces at
least some activations to happen), and the result is passed through the
activation algorithm producing one output for each hidden node.
If these two layers were part of a deeper neural network, the outputs of hidden
layer no. 1 would be passed as inputs to hidden layer no. 2, and from there
through as many hidden layers as you like until they reach a final classifying
layer. (For simple feed-forward movements, the RBM nodes function as
an autoencoder and nothing more.)
In the reconstruction phase, the activations of hidden layer no. 1 become the
input in a backward pass. They are multiplied by the same weights, one per
internode edge, just as x was weight-adjusted on the forward pass. The sum
of those products is added to a visible-layer bias at each visible node, and the
output of those operations is a reconstruction; i.e. an approximation of the
original input. This can be represented by the following diagram:
Because the weights of the RBM are randomly initialized, the difference
between the reconstructions and the original input is often large. You can think
of reconstruction error as the difference between the values of r and the input
values, and that error is then backpropagated against the RBM’s weights,
again and again, in an iterative learning process until an error minimum is
reached.
A more thorough explanation of backpropagation is here.
As you can see, on its forward pass, an RBM uses inputs to make predictions
about node activations, or the probability of output given a weighted x: p(a|x;
w).
But on its backward pass, when activations are fed in and reconstructions, or
guesses about the original data, are spit out, an RBM is attempting to estimate
the probability of inputs x given activations a, which are weighted with the
same coefficients as those used on the forward pass. This second phase can
be expressed as p(x|a; w).
Together, those two estimates will lead you to the joint probability distribution
of inputs x and activations a, or p(x, a).
Let’s imagine that both the input data and the reconstructions are normal
curves of different shapes, which only partially overlap.
To measure the distance between its estimated probability distribution and the
ground-truth distribution of the input, RBMs use Kullback Leibler Divergence.
Probability Distributions
Let’s talk about probability distributions for a moment. If you’re rolling two dice,
the probability distribution for all outcomes looks like this:
That is, 7s are the most likely because there are more ways to get to 7 (3+4,
1+6, 2+5) than there are ways to arrive at any other sum between 2 and 12.
Any formula attempting to predict the outcome of dice rolls needs to take
seven’s greater frequency into account.
In the same way, image datasets have unique probability distributions for their
pixel values, depending on the kind of images in the set. Pixels values are
distributed differently depending on whether the dataset includes MNIST’s
handwritten numerals:
or the headshots found in Labeled Faces in the Wild:
Imagine for a second an RBM that was only fed images of elephants and
dogs, and which had only two output nodes, one for each animal. The
question the RBM is asking itself on the forward pass is: Given these pixels,
should my weights send a stronger signal to the elephant node or the dog
node? And the question the RBM asks on the backward pass is: Given an
elephant, which distribution of pixels should I expect?
One last point: You’ll notice that RBMs have two biases. This is one aspect
that distinguishes them from other autoencoders. The hidden bias helps the
RBM produce the activations on the forward pass (since biases impose a floor
so that at least some nodes fire no matter how sparse the data), while
the visible layer’s biases help the RBM learn the reconstructions on the
backward pass.
Multiple Layers
Once this RBM learns the structure of the input data as it relates to the
activations of the first hidden layer, then the data is passed one layer down
the net. Your first hidden layer takes on the role of visible layer. The
activations now effectively become your input, and they are multiplied by
weights at the nodes of the second hidden layer, to produce another set of
activations.
With each new hidden layer, the weights are adjusted until that layer is able to
approximate the input from the previous layer. This is greedy, layerwise and
unsupervised pre-training. It requires no labels to improve the weights of the
network, which means you can train on unlabeled data, untouched by human
hands, which is the vast majority of data in the world. As a rule, algorithms
exposed to more data produce more accurate results, and this is one of the
reasons why deep-learning algorithms are kicking butt.
Because those weights already approximate the features of the data, they are
well positioned to learn better when, in a second step, you try to classify
images with the deep-belief network in a subsequent supervised learning
stage.
While RBMs have many uses, proper initialization of weights to facilitate later
learning and classification is one of their chief advantages. In a sense, they
accomplish something similar to backpropagation: they push weights to model
data well. You could say that pre-training and backprop are substitutable
means to the same end.
For those interested in studying the structure of RBMs in greater depth, they
are one type of undirectional graphical model, also called markov random field.
Code Sample: Stacked RBMS
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/
src/main/java/org/deeplearning4j/examples/unsupervised/deepbelief/
DeepAutoEncoderExample.java
Parameters & k
The variable k is the number of times you run contrastive divergence.
Contrastive divergence is the method used to calculate the gradient (the slope
representing the relationship between a network’s weights and its error),
without which no learning can occur.
Each time contrastive divergence is run, it’s a sample of the Markov Chain
composing the restricted Boltzmann machine. A typical value is 1.
In the above example, you can see how RBMs can be created as layers with a
more general MultiLayerConfiguration. After each dot you’ll find an additional
parameter that affects the structure and performance of a deep neural net.
Most of those parameters are defined on this site.
lossFunction is the way you measure error, or the difference between your
net’s guesses and the correct labels contained in the test set. Here we
use SQUARED_ERROR, which makes all errors positive so they can be summed
and backpropagated.
learningRate, like momentum, affects how much the neural net adjusts the
coefficients on each iteration as it corrects for error. These two parameters
help determine the size of the steps the net takes down the gradient towards a
local optimum. A large learning rate will make the net learn fast, and maybe
overshoot the optimum. A small learning rate will slow down the learning,
which can be inefficient.
Continuous RBMs
A continuous restricted Boltzmann machine is a form of RBM that accepts
continuous input (i.e. numbers cut finer than integers) via a different type of
contrastive divergence sampling. This allows the CRBM to handle things like
image pixels or word-count vectors that are normalized to decimals between
zero and one.
It should be noted that every layer of a deep-learning net requires four
elements: the input, the coefficients, a bias and the transform (activation
algorithm).
The input is the numeric data, a vector, fed to it from the previous layer (or as
the original data). The coefficients are the weights given to various features
that pass through each node layer. The bias ensures that some nodes in a
layer will be activated no matter what. The transformation is an additional
algorithm that squashes the data after it passes through each layer in a way
that makes gradients easier to compute (and gradients are necessary for a net
to learn).
Those additional algorithms and their combinations can vary layer by layer.
It should be noted that RBMs do not produce the most stable, consistent
results of all shallow, feedforward networks. In many situations, a dense-
layer autoencoder works better. Indeed, the industry is moving toward tools
such as variational autoencoders and GANs.
Advantages and Disadvantages of RBM
Advantages :
Expressive enough to encode any distribution and computationally efficient.
Faster than traditional Boltzmann Machine due to the restrictions in terms of
connections between nodes.
Activations of the hidden layer can be used as input to other models as useful
features to improve performance
Disadvantages :
Training is more difficult as it is difficult to calculate the Energy gradient function.
CD-k algorithm used in RBMs is not as familiar as the back propagation algorithm.
Weight Adjustment