Notes

What is SEMMA?
The SAS Institute developed SEMMA as the process of data mining. It has five steps
(Sample, Explore, Modify, Model, and Assess), earning the acronym of SEMMA. The
data mining method can be used to solve a wide range of business problems,
including fraud identification, customer retention and turnover, database marketing,
customer loyalty, bankruptcy forecasting, market segmentation, as well as risk,
affinity, and portfolio analysis.
Why SEMMA?
Data is used by businesses to achieve a competitive advantage, improve
performance, and deliver more useful services to customers. The data we collect
about our surroundings serve as the foundation for hypotheses and models of the
world we live in.
Ultimately, data is accumulated to help in collecting knowledge. That means the data
is not worth much until it is studied and analyzed. But hoarding vast volumes of data
is not equivalent to gathering valuable knowledge. It is only when data is sorted and
evaluated that we learn anything from it.
Thus, SEMMA is designed as a data science methodology to help practitioners

convert data into knowledge.
The 5 Stages Of SEMMA

SEMMA is leveraged as an organized, functional toolset, or is claimed as such by SAS
to be associated with their SAS Enterprise Miner initiative. While it is true that the
SEMMA process is more ambiguous to those not using the tool, most regard it as a
functional data mining methodology rather than a specific tool.
The process breaks down into its own set of stages. These include:
 Sample: This step entails choosing a subset of the appropriate volume dataset
from a vast dataset that has been given for the model’s construction. The goal
of this initial stage of the process is to identify variables or factors (both
dependent and independent) influencing the process. The collected
information is then sorted into preparation and validation categories.
 Explore: During this step, univariate and multivariate analysis is conducted in
order to study interconnected relationships between data elements and to
identify gaps in the data. While the multivariate analysis studies the
relationship between variables, the univariate one looks at each factor
individually to understand its part in the overall scheme. All of the influencing
factors that may influence the study’s outcome are analyzed, with heavy
reliance on data visualization.
 Modify: In this step, lessons learned in the exploration phase from the data
collected in the sample phase are derived with the application of business
logic. In other words, the data is parsed and cleaned, being then passed onto
the modeling stage, and explored if the data requires refinement and
transformation.
 Model: With the variables refined and data cleaned, the modeling step applies
a variety of data mining techniques in order to produce a projected model of
how this data achieves the final, desired outcome of the process.
 Assess: In this final SEMMA stage, the model is evaluated for how useful and
reliable it is for the studied topic. The data can now be tested and used to
estimate the efficacy of its performance.
What is the KDD Process?
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application
of particular data mining methods. It is of interest to researchers in machine
learning, pattern recognition, databases, statistics, artificial intelligence, knowledge
acquisition for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the
context of large databases.
It does this by using data mining methods (algorithms) to extract (identify) what is

deemed knowledge, according to the specifications of measures and thresholds,
using a database along with any required preprocessing, subsampling, and
transformations of that database.
An Outline of the Steps of the KDD Process

The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of
the task.
o Using dimensionality reduction or transformation methods to reduce
the effective number of variables under consideration or to find
invariant representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of
the KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form
or a set of such representations as classification rules or trees,
regression, clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
The terms knowledge discovery and data mining are distinct.
KDD refers to the overall process of discovering useful knowledge from data. It

involves the evaluation and possibly interpretation of the patterns to make the
decision of what qualifies as knowledge. It also includes the choice of encoding
schemes, preprocessing, sampling, and projections of the data prior to the data
mining step.
Data mining refers to the application of algorithms for extracting patterns from
data without the additional steps of the KDD process.
Definitions Related to the KDD Process
Knowledge discovery in databases is the non-trivial process of

identifying valid, novel, potentially useful, and ultimately understandable
patterns in data.
Data A set of facts, F.
Pattern An expression E in a language L describing facts in a subset FE of F.
KDD is a multi-step process involving data preparation, pattern searching,
Process
knowledge evaluation, and refinement with iteration after modification.
Discovered patterns should be true on new data with some degree of certainty.
Valid
Generalize to the future (other data).
Novel Patterns must be novel (should not be previously known).
Useful Actionable; patterns should potentially lead to some useful actions.
The process should lead to human insight.
Patterns must be made understandable in order to facilitate a better understanding
Understandable
of the underlying data.
A Brief History of Machine Learning
Machine learning (ML) is an important tool for the goal of leveraging technologies
around artificial intelligence. Because of its learning and decision-making abilities,
machine learning is often referred to as AI, though, in reality, it is a subdivision of AI.
Until the late 1970s, it was a part of AI’s evolution. Then, it branched off to evolve on
its own. Machine learning has become a very important response tool for cloud
computing and eCommerce, and is being used in a variety of cutting edge
technologies.
Machine learning is a necessary aspect of modern business and research for many
organizations today. It uses algorithms and neural network models to assist
computer systems in progressively improving their performance. Machine learning
algorithms automatically build a mathematical model using sample data – also
known as “training data” – to make decisions without being specifically programmed
to make those decisions.
The Evolution of Machine Learning
Machine Learning is the branch of computer science that deals with the development of
computer programs that teach and grow themselves. According to Arthur Samuel, an American
pioneer in computer gaming, Machine Learning is the subfield of computer science that “gives
the computer the ability to learn without being explicitly programmed.” Machine Learning
allows developers to build algorithms that automatically improve themselves by finding patterns
in the existing data without explicit instructions from a human or developer. Machine Learning
relies entirely on the data; the more the data, the more efficient Machine Learning is.
The Machine Learning development approach includes learning from data inputs and evaluating
and optimizing the model results. Machine Learning is widely used in data analytics as a method
to develop algorithms for making predictions on data. Machine Learning is related to probability,
statistics, and linear algebra.
Machine Learning is broadly classified into three categories depending on the nature of the
learning ‘signal’ or ‘feedback’ available to a learning system.
1. Supervised learning: Computer is presented with inputs and their desired outputs. The goal is to learn
a general rule to map inputs to the output.
2. Unsupervised learning: Computer is presented with inputs without desired outputs, the goal is to find
structure in inputs.
3. Reinforcement learning: Computer program interacts with a dynamic environment, and it must
perform a certain goal without a guide or teacher.

Machine Learning takes advantage of the ability of computer systems to learn from correlations
hidden in the data; this ability can be further utilized by programming or developing intelligent
and efficient Machine Learning algorithms. While Machine Learning may seem new, it has been
around long before people observed it as popular technology. It has evolved to solve real
problems of human life and automate the processes used in various industries such as banking,
healthcare, telecom, retail, and so on. The software or application or solution developed using
Machine Learning can learn from its dynamic environment and adapt to changing requirements.
In contrast to traditional software implementation, the lessons learned from Machine Learning
algorithms can be scaled and transferred across multiple applications. Machine Learning
naturally considers a large number of variables that influence the results or observations, which
can be used in both science and business. Because of all these features and advantages,
today’s software is developed for automated decision making and more innovative solutions,
which makes an investment in Machine Learning a natural evolution of technology.
Evolution over the years

Machine Learning technology has been in existence since 1952. It has evolved drastically over
the last decade and saw several transition periods in the mid-90s. The data-driven approach to
Machine Learning came into existence during the 1990s. From 1995-2005, there was a lot of
focus on natural language, search, and information retrieval. In those days, Machine Learning
tools were more straightforward than the tools being used currently. Neural networks, which were
popular in the 80s, are a subset of Machine Learning that are computer systems modeled on the
human brain and nervous system. Neural networks started making a comeback around 2005. It
has become one of the trending technologies of the current decade. According to Gartner’s 2016
Hype Cycle for Emerging Technologies, Machine Learning is among the technologies at the peak
of inflated expectations and is expected to reach the mainstream adoption in the next 2–5 years.
Technological capabilities such as infrastructure and technical skills also must advance to keep
up with the growth of Machine Learning.
Machine Learning has been one of the most active and rewarding areas of research due to its
widespread use in many areas. It has brought a monumental shift in technology and its
applications. Some of the evolutions, which made a huge positive impact on real-world problem
solving, are highlighted in the following sections.
Supervised and Unsupervised learning

Supervised learning
Supervised learning, as the name indicates, has the presence of a
supervisor as a teacher. Basically supervised learning is when we teach or
train the machine using data that is well labelled. Which means some data is
already tagged with the correct answer. After that, the machine is provided
with a new set of examples(data) so that the supervised learning algorithm
analyses the training data(set of training examples) and produces a correct
outcome from labelled data.

For instance, suppose you are given a basket filled with different kinds of
fruits. Now the first step is to train the machine with all the different fruits one
by one like this:
 If the shape of the object is rounded and has a depression at the top, is
red in color, then it will be labeled as –Apple.
 If the shape of the object is a long curving cylinder having Green-Yellow
color, then it will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit,
say Banana from the basket, and asked to identify it.

Since the machine has already learned the things from previous data and
this time has to use it wisely. It will first classify the fruit with its shape and
color and would confirm the fruit name as BANANA and put it in the Banana
category. Thus the machine learns the things from training data(basket
containing fruits) and then applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:
 Classification: A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”.
 Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that
some data is already tagged with the correct answer.
Types:-
 Regression
 Logistic Regression
 Classification
 Naive Bayes Classifiers
 K-NN (k nearest neighbors)
 Decision Trees
 Support Vector Machine
Advantages:-
 Supervised learning allows collecting data and produces data output from
previous experiences.
 Helps to optimize performance criteria with the help of experience.
 Supervised machine learning helps to solve various types of real-world
computation problems.
Disadvantages:-
 Classifying big data can be challenging.
 Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.
Steps

Unsupervised learning
Unsupervised learning is the training of a machine using information that is
neither classified nor labeled and allowing the algorithm to act on that
information without guidance. Here the task of the machine is to group
unsorted information according to similarities, patterns, and differences
without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will
be given to the machine. Therefore the machine is restricted to find the
hidden structure in unlabeled data by itself.
For instance, suppose it is given an image having both dogs and cats which
it has never seen.

Thus the machine has no idea about the features of dogs and cats so we
can’t categorize it as ‘dogs and cats ‘. But it can categorize them according
to their similarities, patterns, and differences, i.e., we can easily categorize
the above picture into two parts. The first may contain all pics having dogs in
them and the second part may contain all pics having cats in them. Here you
didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information
that was previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
 Clustering: A clustering problem is where you want to discover the
inherent groupings in the data, such as grouping customers by
purchasing behavior.
 Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people
that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Supervised machine Unsupervised machine
Parameters learning learning
Algorithms are trained using Algorithms are used against data

Input Data labeled data. that is not labeled
Computational
Complexity Simpler method Computationally complex
Accuracy Highly accurate Less accurate
No. of classes No. of classes is known No. of classes is not known
Data Analysis Uses offline analysis Uses real-time analysis of data
Linear and Logistics

regression, Random forest, K-Means clustering, Hierarchical
clustering,
Support Vector Machine,
Algorithms used Neural Network, etc. Apriori algorithm, etc.
Artificial Neural Networks for
Machine Learning – Every aspect
you need to know about
Artificial Neural Networks are the most popular machine learning
algorithms today. The invention of these Neural Networks took place in the
1970s but they have achieved huge popularity due to the recent increase in
computation power because of which they are now virtually everywhere. In
every application that you use, Neural Networks power the intelligent
interface that keeps you engaged.
What is ANN?
Artificial Neural Networks are a special type of machine learning
algorithms that are modeled after the human brain. That is, just like how
the neurons in our nervous system are able to learn from the past data,
similarly, the ANN is able to learn from the data and provide responses in
the form of predictions or classifications.
ANNs are nonlinear statistical models which display a complex relationship
between the inputs and outputs to discover a new pattern. A variety of tasks
such as image recognition, speech recognition, machine translation as well
as medical diagnosis makes use of these artificial neural networks.
An important advantage of ANN is the fact that it learns from the example
data sets. Most commonly usage of ANN is that of a random function
approximation. With these types of tools, one can have a cost-effective
method of arriving at the solutions that define the distribution. ANN is also
capable of taking sample data rather than the entire dataset to provide the
output result. With ANNs, one can enhance existing data analysis
techniques owing to their advanced predictive capabilities.
Artificial Neural Networks

Architecture
The functioning of the Artificial Neural Networks is similar to the way
neurons work in our nervous system. The Neural Networks go back to the
early 1970s when Warren S McCulloch and Walter Pitts coined this term. In
order to understand the workings of ANNs, let us first understand how it is
structured. In a neural network, there are three essential layers –
Input Layers
The input layer is the first layer of an ANN that receives the input
information in the form of various texts, numbers, audio files, image pixels,
etc.
Hidden Layers
In the middle of the ANN model are the hidden layers. There can be a single
hidden layer, as in the case of a perceptron or multiple hidden layers. These
hidden layers perform various types of mathematical computation on the
input data and recognize the patterns that are part of.
Output Layer
In the output layer, we obtain the result that we obtain through rigorous
computations performed by the middle layer.
In a neural network, there are multiple parameters and hyperparameters
that affect the performance of the model. The output of ANNs is mostly
dependent on these parameters. Some of these parameters are weights,
biases, learning rate, batch size etc. Each node in the ANN has some weight.
Each node in the network has some weights assigned to it. A transfer
function is used for calculating the weighted sum of the inputs and the bias.
After the transfer function has calculated the sum, the activation function
obtains the result. Based on the output received, the activation functions
fire the appropriate result from the node. For example, if the output
received is above 0.5, the activation function fires a 1 otherwise it remains
0.
Some of the popular activation functions used in Artificial Neural Networks
are Sigmoid, RELU, Softmax, tanh etc.
Based on the value that the node has fired, we obtain the final output. Then,
using the error functions, we calculate the discrepancies between the
predicted output and resulting output and adjust the weights of the neural
network through a process known as backpropagation.
Back Propagation in Artificial Neural
Networks
In order to train a neural network, we provide it with examples of input-
output mappings. Finally, when the neural network completes the training,
we test the neural network where we do not provide it with these mappings.
The neural network predicts the output and we evaluate how correct the
output is using the various error functions. Finally, based on the result, the
model adjusts the weights of the neural networks to optimize the network
following gradient descent through the chain rule.
Types of Artificial Neural Networks

There are two important types of Artificial Neural Networks –
 FeedForward Neural Network

 FeedBack Neural Network
FeedForward Artificial Neural Networks

In the feedforward ANNs, the flow of information takes place only in one
direction. That is, the flow of information is from the input layer to the
hidden layer and finally to the output. There are no feedback loops present
in this neural network. These type of neural networks are mostly used
in supervised learning for instances such as classification, image recognition
etc. We use them in cases where the data is not sequential in nature.
Feedback Artificial Neural Networks
In the feedback ANNs, the feedback loops are a part of it. Such type of
neural networks are mainly for memory retention such as in the case of
recurrent neural networks. These types of networks are most suited for
areas where the data is sequential or time-dependent.
Restricted Boltzmann Machine and Its

Application
Definition & Structure

Invented by Geoffrey Hinton, a Restricted Boltzmann machine is an algorithm
useful for dimensionality reduction, classification, regression, collaborative
filtering, feature learning and topic modeling. (For more concrete examples of
how neural networks like RBMs can be employed, please see our page
on use cases).
Given their relative simplicity and historical importance, restricted Boltzmann

machines are the first neural network we’ll tackle. In the paragraphs below, we
describe in diagrams and plain language how they work.
RBMs are shallow, two-layer neural nets that constitute the building blocks
of deep-belief networks. The first layer of the RBM is called the visible, or
input, layer, and the second is the hidden layer. (Editor’s note: While RBMs
are occasionally used, most practitioners in the machine-learning community
have deprecated them in favor of generative adversarial networks or
variational autoencoders. RBMs are the Model T’s of neural networks –
interesting for historical reasons, but surpassed by more up-to-date models.)
Each circle in the graph above represents a neuron-like unit called a node,
and nodes are simply where calculations take place. The nodes are
connected to each other across layers, but no two nodes of the same layer
are linked.
That is, there is no intra-layer communication – this is the restriction in a

restricted Boltzmann machine. Each node is a locus of computation that
processes input, and begins by making stochastic decisions about whether to
transmit that input or not. (Stochastic means “randomly determined”, and in
this case, the coefficients that modify inputs are randomly initialized.)
Each visible node takes a low-level feature from an item in the dataset to be
learned. For example, from a dataset of grayscale images, each visible node
would receive one pixel-value for each pixel in one image. (MNIST images
have 784 pixels, so neural nets processing them must have 784 input nodes
on the visible layer.)
Now let’s follow that single pixel value, x, through the two-layer net. At node 1
of the hidden layer, x is multiplied by a weight and added to a so-called bias.
The result of those two operations is fed into an activation function, which
produces the node’s output, or the strength of the signal passing through it,
given input x.
activation f((weight w * input x) + bias b ) = output a
Next, let’s look at how several inputs would combine at one hidden node.
Each x is multiplied by a separate weight, the products are summed, added to
a bias, and again the result is passed through an activation function to
produce the node’s output.
Because inputs from all visible nodes are being passed to all hidden nodes,
an RBM can be defined as a symmetrical bipartite graph.
Symmetrical means that each visible node is connected with each hidden

node (see below). Bipartite means it has two parts, or layers, and the graph is
a mathematical term for a web of nodes.
At each hidden node, each input x is multiplied by its respective weight w.

That is, a single input x would have three weights here, making 12 weights
altogether (4 input nodes x 3 hidden nodes). The weights between two layers
will always form a matrix where the rows are equal to the input nodes, and the
columns are equal to the output nodes.
Each hidden node receives the four inputs multiplied by their respective
weights. The sum of those products is again added to a bias (which forces at
least some activations to happen), and the result is passed through the
activation algorithm producing one output for each hidden node.
If these two layers were part of a deeper neural network, the outputs of hidden
layer no. 1 would be passed as inputs to hidden layer no. 2, and from there
through as many hidden layers as you like until they reach a final classifying
layer. (For simple feed-forward movements, the RBM nodes function as
an autoencoder and nothing more.)
Learn to build AI in Simulations »

Reconstructions
But in this introduction to restricted Boltzmann machines, we’ll focus on how
they learn to reconstruct data by themselves in an unsupervised fashion
(unsupervised means without ground-truth labels in a test set), making several
forward and backward passes between the visible layer and hidden layer no. 1
without involving a deeper network.
In the reconstruction phase, the activations of hidden layer no. 1 become the
input in a backward pass. They are multiplied by the same weights, one per
internode edge, just as x was weight-adjusted on the forward pass. The sum
of those products is added to a visible-layer bias at each visible node, and the
output of those operations is a reconstruction; i.e. an approximation of the
original input. This can be represented by the following diagram:
Because the weights of the RBM are randomly initialized, the difference
between the reconstructions and the original input is often large. You can think
of reconstruction error as the difference between the values of r and the input
values, and that error is then backpropagated against the RBM’s weights,
again and again, in an iterative learning process until an error minimum is
reached.
A more thorough explanation of backpropagation is here.
As you can see, on its forward pass, an RBM uses inputs to make predictions
about node activations, or the probability of output given a weighted x: p(a|x;
w).
But on its backward pass, when activations are fed in and reconstructions, or
guesses about the original data, are spit out, an RBM is attempting to estimate
the probability of inputs x given activations a, which are weighted with the
same coefficients as those used on the forward pass. This second phase can
be expressed as p(x|a; w).
Together, those two estimates will lead you to the joint probability distribution
of inputs x and activations a, or p(x, a).
Reconstruction does something different from regression, which estimates a

continous value based on many inputs, and different from classification, which
makes guesses about which discrete label to apply to a given input example.
Reconstruction is making guesses about the probability distribution of the

original input; i.e. the values of many varied points at once. This is known
as generative learning, which must be distinguished from the so-called
discriminative learning performed by classification, which maps inputs to
labels, effectively drawing lines between groups of data points.
Let’s imagine that both the input data and the reconstructions are normal
curves of different shapes, which only partially overlap.
To measure the distance between its estimated probability distribution and the
ground-truth distribution of the input, RBMs use Kullback Leibler Divergence.
KL-Divergence measures the non-overlapping, or diverging, areas under the

two curves, and an RBM’s optimization algorithm attempts to minimize those
areas so that the shared weights, when multiplied by activations of hidden
layer one, produce a close approximation of the original input. On the left is
the probability distibution of a set of original input, p, juxtaposed with the
reconstructed distribution q; on the right, the integration of their differences.
By iteratively adjusting the weights according to the error they produce, an

RBM learns to approximate the original data. You could say that the weights
slowly come to reflect the structure of the input, which is encoded in the
activations of the first hidden layer. The learning process looks like two
probability distributions converging, step by step.
Some important features of Boltzmann Machine :
 They use recurrent and symmetric structure.

 RBMs in their learning process try to associate high probability with low
energy states and vice-versa.
 There are no intra layer connections.
 It is an unsupervised learning algorithm ie., it makes inferences from input data
without labeled responses.
Probability Distributions
Let’s talk about probability distributions for a moment. If you’re rolling two dice,
the probability distribution for all outcomes looks like this:
That is, 7s are the most likely because there are more ways to get to 7 (3+4,
1+6, 2+5) than there are ways to arrive at any other sum between 2 and 12.
Any formula attempting to predict the outcome of dice rolls needs to take
seven’s greater frequency into account.
Or take another example: Languages are specific in the probability distribution

of their letters, because each language uses certain letters more than others.
In English, the letters e, t and a are the most common, while in Icelandic, the
most common letters are a, r and n. Attempting to reconstruct Icelandic with a
weight set based on English would lead to a large divergence.
In the same way, image datasets have unique probability distributions for their
pixel values, depending on the kind of images in the set. Pixels values are
distributed differently depending on whether the dataset includes MNIST’s
handwritten numerals:
or the headshots found in Labeled Faces in the Wild:
Imagine for a second an RBM that was only fed images of elephants and
dogs, and which had only two output nodes, one for each animal. The
question the RBM is asking itself on the forward pass is: Given these pixels,
should my weights send a stronger signal to the elephant node or the dog
node? And the question the RBM asks on the backward pass is: Given an
elephant, which distribution of pixels should I expect?
That’s joint probability: the simultaneous probability of x given a and

of a given x, expressed as the shared weights between the two layers of the
RBM.
The process of learning reconstructions is, in a sense, learning which groups

of pixels tend to co-occur for a given set of images. The activations produced
by nodes of hidden layers deep in the network represent significant co-
occurrences; e.g. “nonlinear gray tube + big, floppy ears + wrinkles” might be
one.
In the two images above, you see reconstructions learned by Deeplearning4j’s

implemention of an RBM. These reconstructions represent what the RBM’s
activations “think” the original data looks like. Geoff Hinton refers to this as a
sort of machine “dreaming”. When rendered during neural net training, such
visualizations are extremely useful heuristics to reassure oneself that the RBM
is actually learning. If it is not, then its hyperparameters, discussed below,
should be adjusted.
One last point: You’ll notice that RBMs have two biases. This is one aspect
that distinguishes them from other autoencoders. The hidden bias helps the
RBM produce the activations on the forward pass (since biases impose a floor
so that at least some nodes fire no matter how sparse the data), while
the visible layer’s biases help the RBM learn the reconstructions on the
backward pass.
Multiple Layers
Once this RBM learns the structure of the input data as it relates to the
activations of the first hidden layer, then the data is passed one layer down
the net. Your first hidden layer takes on the role of visible layer. The
activations now effectively become your input, and they are multiplied by
weights at the nodes of the second hidden layer, to produce another set of
activations.
This process of creating sequential sets of activations by grouping features

and then grouping groups of features is the basis of a feature hierarchy, by
which neural networks learn more complex and abstract representations of
data.
With each new hidden layer, the weights are adjusted until that layer is able to
approximate the input from the previous layer. This is greedy, layerwise and
unsupervised pre-training. It requires no labels to improve the weights of the
network, which means you can train on unlabeled data, untouched by human
hands, which is the vast majority of data in the world. As a rule, algorithms
exposed to more data produce more accurate results, and this is one of the
reasons why deep-learning algorithms are kicking butt.
Because those weights already approximate the features of the data, they are
well positioned to learn better when, in a second step, you try to classify
images with the deep-belief network in a subsequent supervised learning
stage.
While RBMs have many uses, proper initialization of weights to facilitate later
learning and classification is one of their chief advantages. In a sense, they
accomplish something similar to backpropagation: they push weights to model
data well. You could say that pre-training and backprop are substitutable
means to the same end.
To synthesize restricted Boltzmann machines in one diagram, here is a

symmetrical bipartite and bidirectional graph:
For those interested in studying the structure of RBMs in greater depth, they
are one type of undirectional graphical model, also called markov random field.
Code Sample: Stacked RBMS
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/
src/main/java/org/deeplearning4j/examples/unsupervised/deepbelief/
DeepAutoEncoderExample.java
Parameters & k
The variable k is the number of times you run contrastive divergence.
Contrastive divergence is the method used to calculate the gradient (the slope
representing the relationship between a network’s weights and its error),
without which no learning can occur.
Each time contrastive divergence is run, it’s a sample of the Markov Chain
composing the restricted Boltzmann machine. A typical value is 1.
In the above example, you can see how RBMs can be created as layers with a
more general MultiLayerConfiguration. After each dot you’ll find an additional
parameter that affects the structure and performance of a deep neural net.
Most of those parameters are defined on this site.
weightInit, or weightInitialization represents the starting value of the

coefficients that amplify or mute the input signal coming into each node.
Proper weight initialization can save you a lot of training time, because training
a net is nothing more than adjusting the coefficients to transmit the best
signals, which allow the net to classify accurately.
activationFunction refers to one of a set of functions that determine the

threshold(s) at each node above which a signal is passed through the node,
and below which it is blocked. If a node passes the signal through, it is
“activated.”
optimizationAlgo refers to the manner by which a neural net minimizes error,

or finds a locus of least error, as it adjusts its coefficients step by step.
LBFGS, an acronym whose letters each refer to the last names of its multiple
inventors, is an optimization algorithm that makes use of second-order
derivatives to calculate the slope of gradient along which coefficients are
adjusted.
regularization methods such as l2 help fight overfitting in neural nets.

Regularization essentially punishes large coefficients, since large coefficients
by definition mean the net has learned to pin its results to a few heavily
weighted inputs. Overly strong weights can make it difficult to generalize a
net’s model when exposed to new data.
VisibleUnit/HiddenUnit refers to the layers of a neural net. The VisibleUnit,

or layer, is the layer of nodes where input goes in, and the HiddenUnit is the
layer where those inputs are recombined in more complex features. Both units
have their own so-called transforms, in this case Gaussian for the visible and
Rectified Linear for the hidden, which map the signal coming out of their
respective layers onto a new space.
lossFunction is the way you measure error, or the difference between your
net’s guesses and the correct labels contained in the test set. Here we
use SQUARED_ERROR, which makes all errors positive so they can be summed
and backpropagated.
learningRate, like momentum, affects how much the neural net adjusts the
coefficients on each iteration as it corrects for error. These two parameters
help determine the size of the steps the net takes down the gradient towards a
local optimum. A large learning rate will make the net learn fast, and maybe
overshoot the optimum. A small learning rate will slow down the learning,
which can be inefficient.
Continuous RBMs
A continuous restricted Boltzmann machine is a form of RBM that accepts
continuous input (i.e. numbers cut finer than integers) via a different type of
contrastive divergence sampling. This allows the CRBM to handle things like
image pixels or word-count vectors that are normalized to decimals between
zero and one.
It should be noted that every layer of a deep-learning net requires four
elements: the input, the coefficients, a bias and the transform (activation
algorithm).
The input is the numeric data, a vector, fed to it from the previous layer (or as
the original data). The coefficients are the weights given to various features
that pass through each node layer. The bias ensures that some nodes in a
layer will be activated no matter what. The transformation is an additional
algorithm that squashes the data after it passes through each layer in a way
that makes gradients easier to compute (and gradients are necessary for a net
to learn).
Those additional algorithms and their combinations can vary layer by layer.
An effective continuous restricted Boltzmann machine employs a Gaussian

transformation on the visible (or input) layer and a rectified-linear-unit
transformation on the hidden layer. That’s particularly useful in facial
reconstruction. For RBMs handling binary data, simply make both
transformations binary ones.
Gaussian transformations do not work well on RBMs’ hidden layers. The

rectified-linear-unit transformations used instead are capable of representing
more features than binary transformations, which we employ on deep-belief
nets.
Conclusions & Next Steps

You can interpret RBMs’ output numbers as percentages. Every time the
number in the reconstruction is not zero, that’s a good indication the RBM
learned the input.
It should be noted that RBMs do not produce the most stable, consistent
results of all shallow, feedforward networks. In many situations, a dense-
layer autoencoder works better. Indeed, the industry is moving toward tools
such as variational autoencoders and GANs.
Advantages and Disadvantages of RBM
Advantages :
 Expressive enough to encode any distribution and computationally efficient.
 Faster than traditional Boltzmann Machine due to the restrictions in terms of
connections between nodes.
 Activations of the hidden layer can be used as input to other models as useful
features to improve performance
Disadvantages :
 Training is more difficult as it is difficult to calculate the Energy gradient function.
 CD-k algorithm used in RBMs is not as familiar as the back propagation algorithm.
 Weight Adjustment

Notes

Uploaded by

Copyright:

Available Formats

Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes

Uploaded by

Copyright:

Available Formats

What is SEMMA?

Thus, SEMMA is designed as a data science methodology to help practitioners

The 5 Stages Of SEMMA

What is the KDD Process?

It does this by using data mining methods (algorithms) to extract (identify) what is

An Outline of the Steps of the KDD Process

The terms knowledge discovery and data mining are distinct.

KDD refers to the overall process of discovering useful knowledge from data. It

Definitions Related to the KDD Process

Knowledge discovery in databases is the non-trivial process of

A Brief History of Machine Learning

The Evolution of Machine Learning

Evolution over the years

Supervised and Unsupervised learning

Algorithms are trained using Algorithms are used against data

Accuracy Highly accurate Less accurate

No. of classes No. of classes is known No. of classes is not known

Data Analysis Uses offline analysis Uses real-time analysis of data

Linear and Logistics

Artificial Neural Networks

Types of Artificial Neural Networks

 FeedForward Neural Network

FeedForward Artificial Neural Networks

Restricted Boltzmann Machine and Its

Definition & Structure

Given their relative simplicity and historical importance, restricted Boltzmann

That is, there is no intra-layer communication – this is the restriction in a

activation f((weight w * input x) + bias b ) = output a

Symmetrical means that each visible node is connected with each hidden

At each hidden node, each input x is multiplied by its respective weight w.

Learn to build AI in Simulations »

Reconstruction does something different from regression, which estimates a

Reconstruction is making guesses about the probability distribution of the

KL-Divergence measures the non-overlapping, or diverging, areas under the

By iteratively adjusting the weights according to the error they produce, an

Some important features of Boltzmann Machine :

 They use recurrent and symmetric structure.

Or take another example: Languages are specific in the probability distribution

That’s joint probability: the simultaneous probability of x given a and

The process of learning reconstructions is, in a sense, learning which groups

In the two images above, you see reconstructions learned by Deeplearning4j’s

This process of creating sequential sets of activations by grouping features

To synthesize restricted Boltzmann machines in one diagram, here is a

weightInit, or weightInitialization represents the starting value of the

activationFunction refers to one of a set of functions that determine the

optimizationAlgo refers to the manner by which a neural net minimizes error,

regularization methods such as l2 help fight overfitting in neural nets.

VisibleUnit/HiddenUnit refers to the layers of a neural net. The VisibleUnit,

An effective continuous restricted Boltzmann machine employs a Gaussian

Gaussian transformations do not work well on RBMs’ hidden layers. The

Conclusions & Next Steps

You might also like