Deep Learning With H2O
Deep Learning With H2O
Arno Candel
Jessica Lanford
Erin LeDell
Viraj Parmar
http://h2o.gitbooks.io/deep-learning/
August 2015: Third Edition
Anisha Arora
CONTENTS
Contents
1 Introduction
2 What is H2O?
3 Installation
3.1 Installation in R . . .
3.2 Installation in Python
3.3 Pointing to a different
3.4 Example code . . . .
. . .
. . .
H2O
. . .
. . . .
. . . .
cluster
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
5
5
6
. .
. .
. .
. .
. .
per
. .
. .
. .
. .
. .
. .
. .
. .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
iteration
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
7
7
7
7
8
9
10
10
10
11
11
11
11
12
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
13
14
15
17
17
18
18
19
21
21
7 Deep Autoencoders
7.1 Nonlinear dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Use case: anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
22
22
24
29
30
11 Appendix D: References
31
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 Introduction
Introduction
This document introduces the reader to Deep Learning with H2O. Examples are written in R and Python.
The reader is walked through the installation of H2O, basic deep learning concepts, building deep neural
nets in H2O, how to interpret model output, how to make predictions, and various implementation details.
What is H2O?
H2O is fast, scalable, open-source machine learning and deep learning for Smarter Applications. With H2O,
enterprises like PayPal, Nielsen, Cisco, and others can use all their data without sampling to get accurate
predictions faster. Advanced algorithms, like Deep Learning, Boosting, and Bagging Ensembles are built-in
to help application designers create smarter applications through elegant APIs. Some of our initial customers
have built powerful domain-specific predictive engines for Recommendations, Customer Churn, Propensity
to Buy, Dynamic Pricing, and Fraud Detection for the Insurance, Healthcare, Telecommunications, AdTech,
Retail, and Payment Systems industries.
Using in-memory compression, H2O handles billions of data rows in-memory, even with a small cluster. To
make it easier for non-engineers to create complete analytic workflows, H2Os platform includes interfaces
for R, Python, Scala, Java, JSON, and Coffeescript/JavaScript, as well as a built-in web interface, Flow.
H2O was built alongside (and on top of) Hadoop and Spark Clusters and typically deploys within minutes.
H2O includes many common machine learning algorithms, such as generalized linear modeling (linear
regression, logistic regression, etc.), Nave Bayes, principal components analysis, time series, k-means
clustering, and others. H2O also implements best-in-class algorithms at scale, such as Random Forest,
Gradient Boosting and Deep Learning. Customers can build thousands of models and compare the results
to get the best predictions.
H2O is nurturing a grassroots movement of physicists, mathematicians, and computer scientists to herald
the new wave of discovery with data science by collaborating closely with academic researchers and
Industrial data scientists. Stanford university giants Stephen Boyd, Trevor Hastie, Rob Tibshirani advise
the H2O team on building scalable machine learning algorithms. With hundreds of meetups over the past
three years, H2O has become a word-of-mouth phenomenon, growing amongst the data community by a
hundred-fold, and is now used by 30,000+ users and is deployed using R, Python, Hadoop, and Spark in
2000+ corporations.
Try it out
H2Os R package can be installed from CRAN at https://cran.r-project.org/web/packages/
h2o/. A Python package can be installed from PyPI at https://pypi.python.org/pypi/h2o/.
Download H2O directly from http://h2o.ai/download.
Join the community
Visit the open source community forum at https://groups.google.com/d/forum/h2ostream.
To learn about our meetups, training sessions, hackathons, and product updates, visit http://h2o.ai.
3 Installation
Installation
3.1
Installation in R
install.packages("h2o")
Note: The version of H2O in CRAN is often one release behind the current version.
Alternatively, you can (and should for this tutorial) download the latest stable H2O-3 build from the H2O
download page:
1. Go to http://h2o.ai/download.
2. Choose the latest stable H2O-3 build.
3. Click the Install in R tab.
4. Copy and paste the commands into your R session.
After H2O is installed on your system, verify the installation:
library(h2o)
2
3
4
5
6
7
8
9
#Get help
?h2o.glm
?h2o.gbm
10
11
12
13
#Show a demo
demo(h2o.glm)
demo(h2o.gbm)
3.2
Installation in Python
3.3
import h2o
2
3
4
5
6
7
8
# Get help
help(h2o.glm)
help(h2o.gbm)
9
10
11
12
# Show a demo
h2o.demo("glm")
h2o.demo("gbm")
3.3
The instructions in the previous sections create a one-node H2O cluster on your local machine.
To connect to an established H2O cluster (in a multi-node Hadoop environment, for example) specify
the IP address and port number for the established cluster using the ip and port parameters in the
h2o.init() command. The syntax for this function is identical for R and Python:
1
3.4
Example code
Unlike the neural networks of the past, modern Deep Learning has cracked the code for training stability
and generalization and scales on big data. It is often the algorithm of choice for highest predictive accuracy,
as deep learning algorithms performs quite well in a number of diverse problems.
First, we present a brief overview of deep neural networks for supervised learning tasks. There are several
theoretical frameworks for deep learning, and here we summarize the feedforward architecture used by H2O.
The basic unit in the model (shown above) is the neuron, a biologically inspired model of the human
neuron. For humans, varying strengths of neurons output signals travel along the synaptic junctions and
are then
Pn aggregated as input for a connected neurons activation. In the model, the weighted combination
= i=1 wi xi + b of input signals is aggregated, and then an output signal f () transmitted by the
connected neuron. The function f represents the nonlinear activation function used throughout the network,
and the bias b accounts for the neurons activation threshold.
Multi-layer, feedforward neural networks consist of many layers of interconnected neuron units, starting
with an input layer to match the feature space, followed by multiple layers of nonlinearity, and ending with
a linear regression or classification layer to match the output space. The inputs and outputs of the models
units follow the basic logic of the single neuron described above. Bias units are included in each non-output
layer of the network. The weights linking neurons and biases with other neurons fully determine the output
of the entire network, and learning occurs when these weights are adapted to minimize the error on labeled
training data. More specifically, for each training example j, the objective is to minimize a loss function,
L(W, B | j).
Here, W is the collection {Wi }1:N 1 , where Wi denotes the weight matrix connecting layers i and i + 1
for a network of N layers. Similarly B is the collection {bi }1:N 1 , where bi denotes the column vector of
biases for layer i + 1. This basic framework of multi-layer neural networks can be used to accomplish deep
learning tasks. Deep learning architectures are models of hierarchical feature extraction, typically involving
multiple levels of nonlinearity. Deep learning models are able to learn useful representations of raw data
and have exhibited high performance on complex data such as images, speech, and text (Bengio, 2009).
As described above, H2O follows the model of multi-layer, feedforward neural networks for predictive
modeling. This section provides a more detailed description of H2Os Deep Learning features, parameter
configurations, and computational implementation.
5.1
5.1
Summary of features
Summary of features
5.2
Training protocol
The training protocol described below follows many of the ideas and advances in the recent deep learning
literature.
5.2.1
Initialization
Various deep learning architectures employ a combination of unsupervised pretraining followed by supervised
training, but H2O uses a purely supervised training protocol. The default initialization scheme is the
uniform adaptive option, which is an optimized initialization based on the size of the network. Alternatively,
you may select a random initialization to be drawn from either a uniform or normal distribution, for which
a scaling parameter may be specified as well.
5.2.2
In the introduction, we described the nonlinear activation function f ; the choices are summarized in Table 1.
Note here that xi and wi denotePthe firing neurons input values and their weights, respectively; denotes
the weighted combination = i wi xi + b. The tanh function is a rescaled and shifted logistic function
and its symmetry around 0 allows the training algorithm to converge faster. The rectified linear activation
function has demonstrated high performance on image recognition tasks, and is a more biologically accurate
model of neuron activations (LeCun et al, 1998). Maxout is a generalization of the Rectifier activation,
where each neuron picks the larger output of k separate channels, each with its own weights and bias values.
The current implementation supports only k = 2. Maxout activation works particularly well with dropout, a
5.2
Training protocol
Function
Tanh
Rectified Linear
Maxout
e e
e +e
f () =
f () = max(0, )
f (1 , 2 ) = max(1 , 2 )
Range
f () [1, 1]
f () R+
f () R
regularization method discussed later in this vignette (Goodfellow et al, 2013). The Rectifier is the special
case of Maxout where one channel always outputs 0. It is difficult to determine a best activation function
to use; each may outperform the others in separate scenarios, but grid search models (also described later)
can help to compare activation functions and other parameters. The default activation function is the
Rectifier. Each of these activation functions can be operated with dropout regularization (see below).
The user can specify the distribution function of the response variable via the distribution argument as
one of the following: AUTO, Bernoulli, Multinomial, Poisson, Gamma, Tweedie, Laplace, Huber or Gaussian.
Each distribution is has a primary association with a particular loss function, but some distributions permit
the user to specify a non-default loss function from the group of loss functions specified in Table 2.
Bernoulli and Multinomial are primarily associated with Cross Entropy (aka. Log-Loss), Gaussian with
Mean Squared Error, Laplace with Absolute loss and Huber with Huber loss. For Poisson, Gamma and
Tweedie distributions, the user cannot change the loss function and so loss must be set to AUTO.
The system default enforces the tables typical use rule based on whether regression or classification is being
performed. Note here that t(j) and o(j) are the predicted (target) output and actual output, respectively,
for training example j; further, let y denote the output units and O the output layer.
Function
Mean Squared Error
Absolute
Huber
Cross Entropy
1 (j)
2 kt
(j)
o(j) k22
o(j) k1
(j)
L(W, B|j) =
(L(W, B|j) = kt
1 (j)
kt o(j) k22
for kt o(j) k1 1,
L(W, B|j) = 2 (j)
kt o(j) k1 21 otherwise.
P
(j)
(j)
(j)
(j)
L(W, B|j) =
ln(oy ) ty + ln(1 oy ) (1 ty )
Typical use
Regression
Regression
Regression
Classification
yO
5.2.3
The procedure to minimize the loss function L(W, B | j) is a parallelized version of stochastic gradient
descent (SGD). Standard SGD can be summarized as follows, with the gradient L(W, B | j) computed
via backpropagation (LeCun et al, 1998). The constant indicates the learning rate, which controls the
step sizes during gradient descent.
Standard stochastic gradient descent
1. Initialize W, B
2. Iterate until convergence criterion reached:
a. Get training example i
b. Update all weights wjk W , biases bjk B
wjk := wjk L(W,B|j)
wjk
bjk := bjk L(W,B|j)
bjk
5.2
Training protocol
Stochastic gradient descent is known to be fast and memory-efficient, but not easily parallelizable without
becoming slow. We utilize Hogwild!, the recently developed lock-free parallelization scheme from Niu et
al, 2011, to address this issue. Hogwild! follows a shared memory model where multiple cores (each
handling separate subsets or all of the training data) are able to make independent contributions to the
gradient updates L(W, B | j) asynchronously. In a multi-node system, this parallelization scheme works
on top of H2Os distributed setup where the training data is distributed across the cluster. Each node
operates in parallel on its local data until the final parameters W, B are obtained by averaging. Below is a
rough summary.
Parallel distributed and multi-threaded training with SGD in H2O Deep Learning
Here, the weights and bias updates follow the asynchronous Hogwild! procedure to incrementally
adjust each nodes parameters Wn , Bn after seeing example i. The Avgn notation refers to the final
averaging of these local parameters across all nodes to obtain the global model parameters and complete
training.
5.2.4
H2O Deep Learning is scalable and can take advantage of large clusters of compute nodes. There are three
operating modes. The default behavior is to let every node train on the entire (replicated) dataset, but
automatically shuffling (and/or using a subset of) the training examples for each iteration locally. For
datasets that dont fit into each nodes memory (depending on the amount of heap memory specified by
the -XmX Java option), it might not be possible to replicate the data, and each compute node can be
instructed to train only with local data. An experimental single node mode is available for cases where final
convergence is slow due to the presence of too many nodes, but this has not been necessary in our testing.
The number of training examples globally presented to the distributed SGD worker nodes between model
averaging is defined by the parameter train samples per iteration. If the specified value is -1,
all nodes process all their local training data per iteration. If replicate training data is enabled,
which is the default setting, this will result in training N epochs (passes over the data) per iteration on N
nodes; otherwise, one epoch will be trained per iteration. Another special value is 0, which always results
5.3
Regularization
10
in one epoch per iteration, regardless of the number of compute nodes. In general, any user-specified
positive number is permissible for this parameter. For large datasets, we recommend specifying a fraction
of the dataset.
For example, if the training data contains 10 million rows, and we specify the number of training samples
per iteration as 100, 000 when running on four nodes, then each node will process 25, 000 examples per
iteration, and it will take 40 distributed iterations to process one epoch. If the value is too high, it might
take too long between synchronization and model convergence may be slow. If the value is too low, network
communication overhead will dominate the runtime and computational performance will suffer. A value
of -2, which is the default value, enables auto-tuning for this parameter, based on the computational
performance of the processors and the network of the system, and attempts to find a good balance between
computation and communication. This parameter can affect the convergence rate during training.
5.3
Regularization
5.4
Advanced optimization
H2O features manual and automatic versions of advanced optimization. The manual mode features include
momentum training and learning rate annealing, while automatic mode features an adaptive learning rate.
5.4.1
Momentum training
Momentum modifies back-propagation by allowing prior iterations to influence the current version. In
particular, a velocity vector, v, is defined to modify the updates as follows: with representing the
parameters W, B; representing the momentum coefficient, and denoting the learning rate.
vt+1 = vt L(t )
t+1 = t + vt+1
11
5.5
Loading data
Using the momentum parameter can aid in avoiding local minima and the associated instability (Sutskever
et al, 2014). Too much momentum can lead to instabilities, which is why it is best to ramp up the
momentum slowly. The parameters that control momentum are momentum start, momentum ramp
and momentum stable.
The Nesterov accelerated gradient method, triggered by the nesterov accelerated gradient
parameter, is a recommended improvement when using momentum updates. Using this method, the
updates are further modified such that
vt+1 = vt L(t + vt )
Wt+1 = Wt + vt+1
5.4.2
Rate annealing
Throughout training, as the model approaches a minimum, the chance of oscillation or optimum skipping
creates the need for a slower learning rate. Instead of specifying a constant learning rate , learning rate
annealing gradually reduces the learning rate t to freeze into local minima in the optimization landscape
(Zeiler, 2012).
For H2O, the annealing rate (rate annealing) is the inverse of the number of training samples it takes
to cut the learning rate in half (e.g., 106 means that it takes 106 training samples to halve the learning
rate).
5.4.3
Adaptive learning
The implemented adaptive learning rate algorithm ADADELTA (Zeiler, 2012) automatically combines the
benefits of learning rate annealing and momentum training to avoid slow convergence. Specifying only two
parameters ( and ) simplifies hyper parameter search. In some cases, manually controlled (non-adaptive)
learning rate and momentum specifications can lead to better results, but require a hyperparameter search
of up to 7 parameters. If the model is built on a topology with many local minima or long plateaus, it
is possible for a constant learning rate to produce sub-optimal results. In general, however, we find the
adaptive learning rate produces the best results, so this option is used as the default.
The first of two hyper parameters for adaptive learning is (rho). It is similar to momentum and relates
to the memory to prior weight updates. Typical values are between 0.9 and 0.999. The second of two
hyper parameters, (epsilon), for adaptive learning is similar to learning rate annealing during initial
training and momentum at later stages where it allows forward progress. Typical values are between 1010
and 104 .
5.5
Loading data
Loading a dataset in R or Python for use with H2O is slightly different from the usual methodology, as
we must convert our datasets into H2OFrame objects (distributed data frames), rather than using an R
data.frame or data.table or a Python pandas.DataFrame or numpy.array.
5.5.1
Along with categorical encoding, H2Os Deep Learning preprocesses the data to be standardized for
compatibility with the activation functions (recall Table 1s summary of each activation functions target
space). Since the activation function does not generally map into the full spectrum of real numbers, R, we
first standardize our data to be drawn from N (0, 1). Standardizing again after network propagation allows
us to compute more precise errors in this standardized space, rather than in the raw feature space.
5.6
Additional parameters
12
For autoencoding, the data is normalized (instead of standardized) to the compact interval of mathcalU (0.5, 0.5),
to allow bounded activation functions like Tanh to better reconstruct the data.
5.6
Additional parameters
This section provided some background on the various parameter configurations in H2Os Deep Learning
architecture. Since there are dozens of possible parameter arguments when creating models, H2O Deep
Learning models may seem daunting. However, most parameters do not need to be modified; the default
settings are recommended as safe. The majority of the parameters that support (and in some cases, require)
experimentation are discussed in the previous sections but there are a few more that will be discussed in
the following sections.
There is no default setting for the hidden layer size, number, or epochs. Experimenting with building deep
learning models using different network topologies and different datasets will lead to intuition for these
parameters but two general rules of thumb should be applied. First, choose larger network sizes, as they can
perform higher-level feature extraction, and techniques like dropout may train only subsets of the network
at once. Second, use more epochs for greater predictive accuracy, but only when the computational cost is
affordable. Many example tests can be found in the H2O GitHub repository for pointers on specific values
and results for these (and other) parameters.
For a full list of H2O Deep Learning model parameters and default values, see Appendix A.
6
6.1
The MNIST database is a well-known academic dataset used to benchmark classification performance. The
data consists of 60,000 training images and 10,000 test images, for which each is a standardized 282 pixel
greyscale image of a single handwritten digit. An example of the scanned handwritten digits is shown in
Figure 1.
13
6.2
Example in R
1
2
library(h2o)
h2o.init(nthreads = -1)
3
4
6
7
8
9
10
11
12
1
2
import h2o
h2o.init()
3
4
train_file = "https://h2o-public-test-data.s3.amazonaws.com/bigdata/
laptop/mnist/train.csv.gz"
test_file = "https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop
/mnist/test.csv.gz"
6
7
8
train = h2o.import_file(train_file)
test = h2o.import_file(test_file)
9
10
11
12
6.2
The example below illustrates the relative simplicity underlying most H2O Deep Learning model parameter
configurations, as a result of the default settings. We use the first 282 = 784 values of each row to
represent the full image and the final value to denote the digit class. Rectified linear activation is popular
with image processing and has performed well on the MNIST database previously and dropout has been
known to enhance performance on this dataset as well, so we train our model accordingly.
6.2
14
Example in R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
6.2.1
N-fold cross-validation
When the nfolds argument is specified is a positive integer, N-fold cross-validation will be performed
on the training frame and the cross-validation metrics will be computed and stored as model output.
15
6.2
The default for nfolds is 0, for no cross-validation. Optionally, the user can save the cross-validation predicted values (generated during cross-validation) by setting keep cross validation predictions
parameter to true. This enables the user to calculate custom cross-validated performance metrics for their
model in R or Python. Advanced users can also specify a fold column that specifies for each row which
holdout fold it belongs to. By default, the holdout fold assignment is random, but other schemes
such as round-robyn assignment via the modulo operator are also supported. An example for generic N-fold
cross-validation is given below.
Example in R
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
6.2.2
We can extract the parameters of our model, examine the scoring process, and make predictions on
new data. The h2o.performance function in R returns all pre-computed performance metrics either
the training set, validation set or returns cross-validated metrics for the training set. An equivalent
model performance method is available in Python. Utility functions that return specific metrics, such
as MSE or AUC are also available. Examples are shown below using the previously trained model and
model cv objects.
6.2
16
Example in R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Cross-validated MSE
h2o.mse(model_cv, xval = TRUE)
Example in Python
1
2
3
4
5
6
7
8
model.model_performance(train=True)
model.model_performance(valid=True)
9
10
11
12
13
14
# Cross-validated MSE
model_cv.mse(xval=True)
The second command returns the training and validation errors for the model. The training error value is
based on the parameter score training samples, which specifies the number of randomly sampled
training points to be used for scoring (the default uses 10,000 points). The validation error is based on
the parameter score validation samples, which configures the same value on the validation set;
by default, this is the entire validation set.
In general, choosing a greater number of sampled points leads to a better understanding of the models
performance on your dataset; setting either of these parameters to 0 automatically uses the entire
corresponding dataset for scoring. However, either method allows you to control the minimum and
maximum time spent on scoring with the score interval and score duty cycle parameters.
These scoring parameters also affect the final model when the parameter overwrite with best model
is enabled. This option selects the model that achieved the lowest validation error during training (based
on the sampled points used for scoring) as the final model after training. If a dataset is not specified as the
validation set, the training data is used by default; in this case, either the score training samples
or score validation samples parameter will control the error computation during training and, in
turn, the selected best model.
Once we have a satisfactory model, as determined by the validation or cross-validation metrics, the
h2o.predict() command can be used to compute and store predictions on new data, which can then
be used for further tasks in the interactive data science process.
17
6.3
Web interface
Example in R
1
2
3
4
5
6
1
2
3
4
5
6
6.3
Web interface
H2O R users have access to an intuitive web interface for H2O, Flow, to mirror the model building process
in R. After loading data or training a model in R, point your browser to your IP address and port number
(e.g., localhost:54321) to launch the web interface. From here, you can click on Admin > Jobs to view
specific details about your model. You can also click on Data > List All Frames to view all current
H2O frames.
6.3.1
Variable importances
The variable importances feature can be enabled by setting the variable importances to true. This
feature allows us to view the absolute and relative predictive strength of each feature in the prediction task.
Each H2O algorithm class has its own methodology for computing variable importance. For H2Os Deep
Learning, the Gedeon method (Gedeon, 1997) is used, which can be slow for large networks, so it is turned
off by default. If variable importance is the top priority in your analysis, you may (also) consider training a
Random Forest and inspecting the variable importances generated with that method.
The following code demonstrates training using the variable importances option enabled and how to
extract the variable importances from the trained model. From the web UI, you can also view a visualization
of the variable importances.
Example in R
1
2
3
4
5
6
7
8
9
10
11
6.4
18
epochs = 10,
variable_importances = TRUE)
12
13
#added
14
15
16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
6.3.2
Java model
Another important feature of the web interface is the Java (POJO) model, accessible from the Preview
POJO button at the bottom of the model results. This button allows access to Java code that builds the
model when called from a main method in a Java program. Instructions for downloading and running this
Java code are available from the web interface, and example production scoring code is available as well.
6.4
H2O supports grid search capabilities for model tuning by allowing users to tweak certain parameters and
observe changes in model behavior. This is done by specifying sets of values for parameter arguments. At
the time of publication, the Python grid search API is still undergoing development, so we demonstrate an
R example below.
Example in R
1
2
3
4
5
6
7
8
9
10
11
19
6.5
Checkpoint model
In this example, we specified three different network topologies and two different `1 norm weights. This
grid search model trains six different models using all possible combinations of these parameters; other
parameter combinations can be specified for a larger space of models. This provides more subtle insights
into the model tuning and selection process by inspecting and comparing our trained models after the grid
search process is complete. To learn how and when to select different parameter configurations in a grid
search, refer to Appendix A for parameter descriptions and configurable values.
Example in R
1
2
# print out all prediction errors and run times of the models
model_grid
3
4
5
6
7
8
9
6.5
Checkpoint model
To resume model training, use checkpoint model keys (model id) for incrementally training a particular
model with more iterations, more data, different data, and so forth. To train our initial model further, we
can use it (or its key) as a checkpoint argument for a new model.
In the R example below, model grid@model ids[[1]] represents the highest performing model from
the grid search used for additional training. For checkpoint restarts, the training and validation datasets, as
well as the response column, must match. In addition, any non-default model parameters, such as hidden
= c(200,200) in the example below. For the Python example, we will use the original trial run model
that we trained previously as the checkpoint model.
Example in R
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
6.5
Checkpoint model
20
checkpoint=model,
distribution="multinomial",
activation="RectifierWithDropout",
hidden=[200,200,200],
epochs=10)
7
8
9
10
11
Checkpointing can also be used to reload existing models that were saved to disk in a previous session. For
example, we can save and reload a model by running the following commands.
Example in R
1
2
3
4
5
6
7
print(model_path)
# /tmp/mymodel/DeepLearning_model_R_1441838096933
Example in Python
1
2
3
4
5
6
7
print model_path
# /tmp/mymodel/DeepLearning_model_python_1441838096933
After restarting H2O, load the saved model by specifying the host and saved model file path. Note: The
saved model must be reloaded using a compatible H2O version (i.e., the same version used to save the
model).
Example in R
1
2
1
2
1
2
21
1
2
6.6
6.6
Without distortions, convolutions, or other advanced image processing techniques, the best-ever published
test set error for the MNIST dataset is 0.83% by Microsoft. After training for 2, 000 epochs (took about
4 hours) on 4 compute nodes, we obtain 0.87% test set error and after training for 8, 000 epochs (took
about 10 hours) on 10 nodes, we obtain 0.83% test set error, which is the current world-record, notably
achieved using a distributed configuration and with a simple 1-liner from R. Details can be found in our
hands-on tutorial. Accuracies of around 1% test set error are typically achieved within 1 hour when running
on 1 node.
6.7
Computational performance
There are many parameters that affect the computational performance of H2O Deep Learning, but the
default values should result in good performance for most problems. An in-depth study of the computational
performance characteristics of H2O Deep Learning with complete code examples and results can be found
in our blog post Definitive Performance Tuning Guide for Deep Learning.
The parallel scalability of H2O for the MNIST dataset on 1 to 63 compute nodes is shown in the figure
below.
7 Deep Autoencoders
7
7.1
22
Deep Autoencoders
Nonlinear dimensionality reduction
So far, we have discussed purely supervised deep learning tasks. However, deep learning can also be used
for unsupervised feature learning or, more specifically, nonlinear dimensionality reduction (Hinton et al,
2006). Consider the diagram of a three-layer neural network with one hidden layer on the following page.
If we treat our input data as labeled with the same input values, then the network is forced to learn the
identity via a nonlinear, reduced representation of the original data. This type of algorithm is called a
deep autoencoder; these models have been used extensively for unsupervised, layer-wise pre-training of
supervised deep learning tasks, but here we discuss the autoencoders ability to discover anomalies in data.
7.2
Consider the deep autoencoder model described above. Given enough training data that resembles
some underlying pattern, the network will train itself to easily learn the identity when confronted with
that pattern. However, if some anomalous test point that does not match the learned pattern arrives,
the autoencoder will likely have a high error in reconstructing this data, which indicates it is anomalous data.
We use this framework to develop an anomaly detection demonstration using a deep autoencoder. The
dataset is an ECG time series of heartbeats and the goal is to determine which heartbeats are outliers. The
training data (20 good heartbeats) and the test data (training data with 3 bad heartbeats appended
for simplicity) can be downloaded directly into the H2O cluster, as shown below. Each row represents a
single heartbeat. The autoencoder is trained as follows:
Example in R
1
2
3
4
5
6
7
# Download and import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.
amazonaws.com/smalldata/anomaly/ecg_discord_train.csv",
header = FALSE,
sep = ",")
test_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.
amazonaws.com/smalldata/anomaly/ecg_discord_test.csv",
header = FALSE,
sep = ",")
8
9
10
11
23
7.2
training_frame = train_ecg,
activation = "Tanh",
autoencoder = TRUE,
hidden = c(50,20,50),
l1 = 1e-4,
epochs = 100)
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
1
2
# Download and import ECG train and test data into the H2O cluster
train_ecg = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com
/smalldata/anomaly/ecg_discord_train.csv")
test_ecg = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/
smalldata/anomaly/ecg_discord_test.csv")
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
active if adaptive rate is enabled. Default is 1e-8. Refer to the Adaptive learning section for
more details.
rate: The learning rate, . Higher values lead to less stable models, while lower values lead to
slower convergence. Default is 0.005.
rate annealing: The annealing learning rate is calculated as (rate)/(1+rate annealing
* N), where N is the number of training samples. It is the inverse of the number of training samples
required to cut the learning rate in half. This reduces the learning rate to freeze into local minima
in the optimization landscape. Default value is 1e-6 (when adaptive learning is disabled). Refer to
the Rate annealing section for more details.
rate decay: Learning rate decay factor between layers (L-th layer: rate * rate decay (L-1));
default is 1.0 (when adaptive learning is disabled). The learning rate decay parameter controls the
change of learning rate across layers.
momentum start: This controls the amount of momentum at the beginning of training (when
adaptive learning is disabled). Default is 0. Refer to the Momentum training section for more
details.
momentum ramp: This controls the amount of learning for which momentum increases (assuming
momentum stable is larger than momentum start). The ramp is measured in the number of
training samples and can be enabled when adaptive learning is disabled. Default is 1 million (1e6).
Refer to the Momentum training section or more details.
momentum stable: This controls the final momentum value reached after momentum ramp
number of training samples (when adaptive learning is disabled). The momentum used for training
will remain the same for training beyond reaching that point. Default is 0. Refer to the Momentum
training for more details.
nesterov accelerated gradient: Logical. The Nesterov accelerated gradient descent
method is a modification to traditional gradient descent for convex functions. The method relies on gradient information at various points to build a polynomial approximation that minimizes the
residuals in fewer iterations of the descent. This parameter is only active if adaptive learning rate is
disabled. The default is true (when adaptive learning is disabled). Refer to the Momentum training
section for more details.
input dropout ratio: The fraction of the features for each training row to be omitted from
training in order to improve generalization. The default is 0 (always use all features). Refer to the
Regularization section for more details.
hidden dropout ratios: The fraction of the inputs for each hidden layer to be omitted from
training in order to improve generalization. The default is 0.5 for each hidden layer. Refer to the
Regularization section for more details.
l1: L1 regularization, which constrains the absolute value of the weights (can add stability and
improve generalization, causes many weights to become 0). The default is 0, for no L1 regularization.
Refer to the Regularization section for more details.
l2: L2 regularization, which constrains the sum of the squared weights (can add stability and improve
generalization, causes many weights to be small). The default is 0, for no L2 regularization. Refer to
the Regularization section for more details.
max w2: A maximum on the sum of the squared incoming weights into any one neuron. This tuning
parameter is especially useful for unbound activation functions such as Maxout or Rectifier. The
default, positive infinity, leaves this maximum unbounded.
initial weight distribution: The distribution from which initial weights are to be drawn.
Select Uniform, UniformAdaptive or Normal. Default is UniformAdaptive. Refer to
the Initialization for more details.
initial weight scale: The scale of the distribution function for Uniform or Normal distributions.
For Uniform, the values are drawn uniformly from (-initial weight scale, initial weight scale).
26
For Normal, the values are drawn from a Normal distribution with a standard deviation of initial weight scale.
The default is 1. Refer to the Initialization for more details.
loss: Specify one of the loss options: Automatic, CrossEntropy (for classification only),
MeanSquare, Absolute, or Huber. Refer to the Activation and loss functions section for
more details.
distribution: Specify the distribution function of the response: AUTO, bernoulli, multinomial,
poisson, gamma, tweedie, laplace, huber, or gaussian.
tweedie power: Specify the tweedie power; applicable only if distribution is set to tweedie.
Value must be between 1.0 and 2.0.
score interval: The minimum time (in seconds) to elapse between model scoring. The actual
interval is determined by the number of training samples per iteration and the scoring duty cycle. To
use all training set samples, specify 0. Default is 5.
score training samples: The number of training samples to be used for scoring. These
samples will be randomly sampled. Use 0 to select the entire training dataset. Default is 10000.
score validation samples: The number of validation dataset points to be used for scoring.
Can be randomly sampled or stratified (if balance classes is set and score validation sampling
is set to stratify). Use 0 to select the entire validation dataset (default).
score duty cycle: Maximum duty cycle fraction spent on model scoring (on both training and
validation samples), and on diagnostics such as computation of feature importances (i.e., not on
training). Lower values result in more training, while higher values produce more scoring. Default is
0.1.
classification stop: The stopping criterion for classification error (1 - accuracy) on the
training data scoring dataset. When the error is at or below this threshold, the training process stops.
Default is 0. To disable, enter -1.
regression stop: The stopping criterion for regression error (MSE) on the training data scoring
dataset. When the error is at or below this threshold, the training process stops. Default is 1e-6. To
disable, enter -1.
quiet mode: Logical. Enable quiet mode for less output to standard output. Default is false.
max confusion matrix size: For classification models, the maximum size (in terms of classes)
of the confusion matrix to display. This option is meant to avoid printing extremely large confusion
matrices. Default is 20.
max hit ratio k: The maximum number (top K) of predictions to use for hit ratio computation
(for multi-class only, enter 0 to disable). Default is 10.
balance classes: Logical. For imbalanced data, the training data class counts can be artificially
balanced by over-sampling the minority class(es) and under-sampling the majority class(es) so that
each class will contain the same number of observations. This can result in improved predictive
accuracy. Over-sampling is done with replacement (rather than simulating new observations), and
the total number of observations after balancing is controlled by the max after balance size
parameter. Default is false.
class sampling factors: Desired over/under-sampling ratios per class (in lexicographic order).
Only applies to classification when balance classes is enabled. If not specified, the ratios will
be automatically computed to obtain class balance during training.
max after balance size: When classes are balanced, limit the resulting dataset size to the
specified multiple of the original dataset size. This is the maximum relative size of the training data
after balancing class counts (can be less than 1.0). Default is 5.
score validation sampling: Method used to sample validation dataset for scoring. The
possible methods are Uniform and Stratified. Default is Uniform.
27
diagnostics: (Deprecated) Logical. Gather diagnostics for hidden layers, such as mean and root
mean squared (RMS) values of learning rate, momentum, weights and biases. Since deprecation,
diagnostics are always enabled (set to true).
variable importances: Logical. Compute variable importances for input features using the
Gedeon method. The implementation considers the weights connecting the input features to the first
two hidden layers. Default is false, since this can be slow for large networks.
fast mode: Logical. Enable fast mode (minor approximation in back-propagation). This should
not affect results significantly. Default is true.
ignore const cols: Logical. Ignore constant training columns (no information can be gained
anyway). Default is true.
force load balance: Logical. Force extra load balancing to increase training speed for small
datasets to keep all cores busy. Default is true.
replicate training data: Logical. Replicate the entire training dataset onto every node for
faster training on small datasets. Default is true.
single node mode: Logical. Run on a single node for fine-tuning of model parameters. Can be
useful for faster convergence during checkpoint resumes after training on a very large count of nodes
(for fast initial convergence). Default is false.
shuffle training data: Logical. Enable shuffling of training data (on each node). This option
is recommended if training data is replicated on N nodes, and the number of training samples per
iteration is close to N times the dataset size, where all nodes train with (almost) all of the data. It is
automatically enabled if the number of training samples per iteration is set to -1 (or to N times the
dataset size or larger). Default is false.
sparse: (Deprecated) Logical. Enable sparse data handling.Default is false.
col major: (Deprecated) Logical. Use a column major weight matrix for the input layer; can speed
up forward propagation, but may slow down backpropagation. Default is false.
average activation: Specify the average activation for the sparse autoencoder (Experimental).
Default is 0.
sparsity beta: Specify the sparsity-based regularization optimization (Experimental). Default is
0.
max categorical features: Maximum number of categorical features allowed in a column,
enforced via hashing (Experimental). Default is 231 1 (Integer.MAX VALUE in Java).
reproducible: Logical. Force reproducibility on small data (will be slow only uses one thread).
Default is false.
export weights and biases: Logical. Specify whether to export the neural network weights
and biases as an H2OFrame. Default is false.
offset column: Specify the offset column by column name. Regression only. Offsets are per-row
bias values that are used during model training. For Gaussian distributions, they can be seen as
simple corrections to the response (y) column. Instead of learning to predict the response value
directly, the model learns to predict the (row) offset of the response column. For other distributions,
the offset corrections are applied in the linearized space before applying the inverse link function to
get the actual response values.
weights column: Specify the weights column by column name. Weights are per-row observation
weights. This is typically the number of times a row is repeated, but non-integer values are supported
as well. During training, rows with higher weights matter more, due to the larger loss function
pre-factor.
nfolds: (Optional) Number of folds for cross-validation. Default is 0 (no cross-validation is
performed).
28
fold column: (Optional) Name of column with cross-validation fold index assignment per observation; the folds are supplied by the user.
fold assignment: Cross-validation fold assignment scheme, if nfolds is greater than zero and
fold column is not specified. Options are AUTO, Random, or Modulo.
keep cross validation predictions: Logical. Specify whether to keep the predictions of
the cross-validation models. Default is false.
29
10
30
31
11
11 Appendix D: References
Appendix D: References