0% found this document useful (0 votes)

1K views

Deep Learning With H2O

Deep learning analysis.

Uploaded by

cristianmondaca

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views

Deep Learning With H2O

Deep learning analysis.

Uploaded by

cristianmondaca

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Deep Learning with H2O

Arno Candel

Jessica Lanford

Erin LeDell

Viraj Parmar

http://h2o.gitbooks.io/deep-learning/
August 2015: Third Edition

Deep Learning with H2O

by Arno Candel, Jessica Lanford, Erin LeDell, Viraj Parmar & Anisha Arora
Published by H2O.ai, Inc.
2307 Leghorn St.
Mountain View, CA 94043
2015 H2O.ai, Inc. All Rights Reserved.

August 2015: Third Edition

Photos by H2O.ai, Inc.
While every precaution has been taken in the
preparation of this book, the publisher and
authors assume no responsibility for errors or
omissions, or for damages resulting from the
use of the information contained herein.
Printed in the United States of America.

Anisha Arora

CONTENTS

Contents
1 Introduction

2 What is H2O?

3 Installation
3.1 Installation in R . . .
3.2 Installation in Python
3.3 Pointing to a different
3.4 Example code . . . .

. . .
. . .
H2O
. . .

. . . .
. . . .
cluster
. . . .

.
.
.
.

4 Deep Learning Overview

4
4
4
5
5
6

5 H2Os Deep Learning architecture

5.1 Summary of features . . . . . . . . . . . . . . .
5.2 Training protocol . . . . . . . . . . . . . . . . .
5.2.1 Initialization . . . . . . . . . . . . . . . .
5.2.2 Activation and loss functions . . . . . . .
5.2.3 Parallel distributed network training . . .
5.2.4 Specifying the number of training samples
5.3 Regularization . . . . . . . . . . . . . . . . . . .
5.4 Advanced optimization . . . . . . . . . . . . . .
5.4.1 Momentum training . . . . . . . . . . . .
5.4.2 Rate annealing . . . . . . . . . . . . . . .
5.4.3 Adaptive learning . . . . . . . . . . . . .
5.5 Loading data . . . . . . . . . . . . . . . . . . . .
5.5.1 Data standardization or normalization . .
5.6 Additional parameters . . . . . . . . . . . . . . .

. .
. .
. .
. .
. .
per
. .
. .
. .
. .
. .
. .
. .
. .

. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
iteration
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

6
7
7
7
7
8
9
10
10
10
11
11
11
11
12

6 Use case: MNIST digit classification

6.1 MNIST overview . . . . . . . . . . . . . .
6.2 Performing a trial run . . . . . . . . . . .
6.2.1 N-fold cross-validation . . . . . .
6.2.2 Extracting and handling the results
6.3 Web interface . . . . . . . . . . . . . . .
6.3.1 Variable importances . . . . . . .
6.3.2 Java model . . . . . . . . . . . .
6.4 Grid search for model comparison . . . . .
6.5 Checkpoint model . . . . . . . . . . . . .
6.6 Achieving world-record performance . . .
6.7 Computational performance . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

12
12
13
14
15
17
17
18
18
19
21
21

7 Deep Autoencoders
7.1 Nonlinear dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Use case: anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22
22
22

8 Appendix A: Complete parameter list

9 Appendix B: Common Command Summary in R

10 Appendix C: Common Command Summary in Python

11 Appendix D: References

.
.
.
.
.
.
.
.
.
.
.

1 Introduction

Introduction

This document introduces the reader to Deep Learning with H2O. Examples are written in R and Python.
The reader is walked through the installation of H2O, basic deep learning concepts, building deep neural
nets in H2O, how to interpret model output, how to make predictions, and various implementation details.

What is H2O?

H2O is fast, scalable, open-source machine learning and deep learning for Smarter Applications. With H2O,
enterprises like PayPal, Nielsen, Cisco, and others can use all their data without sampling to get accurate
predictions faster. Advanced algorithms, like Deep Learning, Boosting, and Bagging Ensembles are built-in
to help application designers create smarter applications through elegant APIs. Some of our initial customers
have built powerful domain-specific predictive engines for Recommendations, Customer Churn, Propensity
to Buy, Dynamic Pricing, and Fraud Detection for the Insurance, Healthcare, Telecommunications, AdTech,
Retail, and Payment Systems industries.
Using in-memory compression, H2O handles billions of data rows in-memory, even with a small cluster. To
make it easier for non-engineers to create complete analytic workflows, H2Os platform includes interfaces
for R, Python, Scala, Java, JSON, and Coffeescript/JavaScript, as well as a built-in web interface, Flow.
H2O was built alongside (and on top of) Hadoop and Spark Clusters and typically deploys within minutes.
H2O includes many common machine learning algorithms, such as generalized linear modeling (linear
regression, logistic regression, etc.), Nave Bayes, principal components analysis, time series, k-means
clustering, and others. H2O also implements best-in-class algorithms at scale, such as Random Forest,
Gradient Boosting and Deep Learning. Customers can build thousands of models and compare the results
to get the best predictions.
H2O is nurturing a grassroots movement of physicists, mathematicians, and computer scientists to herald
the new wave of discovery with data science by collaborating closely with academic researchers and
Industrial data scientists. Stanford university giants Stephen Boyd, Trevor Hastie, Rob Tibshirani advise
the H2O team on building scalable machine learning algorithms. With hundreds of meetups over the past
three years, H2O has become a word-of-mouth phenomenon, growing amongst the data community by a
hundred-fold, and is now used by 30,000+ users and is deployed using R, Python, Hadoop, and Spark in
2000+ corporations.
Try it out
H2Os R package can be installed from CRAN at https://cran.r-project.org/web/packages/
h2o/. A Python package can be installed from PyPI at https://pypi.python.org/pypi/h2o/.
Download H2O directly from http://h2o.ai/download.
Join the community
Visit the open source community forum at https://groups.google.com/d/forum/h2ostream.
To learn about our meetups, training sessions, hackathons, and product updates, visit http://h2o.ai.

3 Installation

Installation

The easiest way to directly install H2O is via an R or Python package.

(Note: This document was created with H2O version 3.0.1.4.)

3.1

Installation in R

To load a recent H2O package from CRAN, run:

install.packages("h2o")
Note: The version of H2O in CRAN is often one release behind the current version.
Alternatively, you can (and should for this tutorial) download the latest stable H2O-3 build from the H2O
download page:
1. Go to http://h2o.ai/download.
2. Choose the latest stable H2O-3 build.
3. Click the Install in R tab.
4. Copy and paste the commands into your R session.
After H2O is installed on your system, verify the installation:

library(h2o)

2
3
4
5

#Start H2O on your local machine using all available cores.

#By default, CRAN policies limit use to only 2 cores.
h2o.init(nthreads = -1)

6
7
8
9

#Get help
?h2o.glm
?h2o.gbm

10
11
12
13

#Show a demo
demo(h2o.glm)
demo(h2o.gbm)

3.2

Installation in Python

To load a recent H2O package from PyPI, run:

pip install h2o

Alternatively, you can (and should for this tutorial) download the latest stable H2O-3 build from the H2O
download page:
1. Go to http://h2o.ai/download.
2. Choose the latest stable H2O-3 build.
3. Click the Install in Python tab.
4. Copy and paste the commands into your Python session.

3.3

Pointing to a different H2O cluster

After H2O is installed, verify the installation:

import h2o

2
3
4

# Start H2O on your local machine

h2o.init()

5
6
7
8

# Get help
help(h2o.glm)
help(h2o.gbm)

9
10
11
12

# Show a demo
h2o.demo("glm")
h2o.demo("gbm")

3.3

Pointing to a different H2O cluster

The instructions in the previous sections create a one-node H2O cluster on your local machine.
To connect to an established H2O cluster (in a multi-node Hadoop environment, for example) specify
the IP address and port number for the established cluster using the ip and port parameters in the
h2o.init() command. The syntax for this function is identical for R and Python:
1

h2o.init(ip = "123.45.67.89", port = 54321)

3.4

Example code

R code for the examples in this document are available here:

https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/booklets/v2_2015/source/
DeepLearning_Vignette.R
Python code for the examples in this document can be found here:
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/booklets/v2_2015/source/
DeepLearning_Vignette.py
The document source itself can be found here:
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/booklets/v2_2015/source/
DeepLearning_Vignette.tex

4 Deep Learning Overview

Deep Learning Overview

Unlike the neural networks of the past, modern Deep Learning has cracked the code for training stability
and generalization and scales on big data. It is often the algorithm of choice for highest predictive accuracy,
as deep learning algorithms performs quite well in a number of diverse problems.
First, we present a brief overview of deep neural networks for supervised learning tasks. There are several
theoretical frameworks for deep learning, and here we summarize the feedforward architecture used by H2O.

The basic unit in the model (shown above) is the neuron, a biologically inspired model of the human
neuron. For humans, varying strengths of neurons output signals travel along the synaptic junctions and
are then
Pn aggregated as input for a connected neurons activation. In the model, the weighted combination
= i=1 wi xi + b of input signals is aggregated, and then an output signal f () transmitted by the
connected neuron. The function f represents the nonlinear activation function used throughout the network,
and the bias b accounts for the neurons activation threshold.

Multi-layer, feedforward neural networks consist of many layers of interconnected neuron units, starting
with an input layer to match the feature space, followed by multiple layers of nonlinearity, and ending with
a linear regression or classification layer to match the output space. The inputs and outputs of the models
units follow the basic logic of the single neuron described above. Bias units are included in each non-output
layer of the network. The weights linking neurons and biases with other neurons fully determine the output
of the entire network, and learning occurs when these weights are adapted to minimize the error on labeled
training data. More specifically, for each training example j, the objective is to minimize a loss function,
L(W, B | j).
Here, W is the collection {Wi }1:N 1 , where Wi denotes the weight matrix connecting layers i and i + 1
for a network of N layers. Similarly B is the collection {bi }1:N 1 , where bi denotes the column vector of
biases for layer i + 1. This basic framework of multi-layer neural networks can be used to accomplish deep
learning tasks. Deep learning architectures are models of hierarchical feature extraction, typically involving
multiple levels of nonlinearity. Deep learning models are able to learn useful representations of raw data
and have exhibited high performance on complex data such as images, speech, and text (Bengio, 2009).

H2Os Deep Learning architecture

As described above, H2O follows the model of multi-layer, feedforward neural networks for predictive
modeling. This section provides a more detailed description of H2Os Deep Learning features, parameter
configurations, and computational implementation.

5.1

Summary of features

H2Os Deep Learning functionalities include:

purely supervised training protocol for regression and classification tasks
fast and memory-efficient Java implementations based on columnar compression and fine-grain
Map/Reduce
multi-threaded and distributed parallel computation to be run on either a single node or a multi-node
cluster
fully automatic per-neuron adaptive learning rate for fast convergence
optional specification of learning rate, annealing and momentum options
regularization options include L1, L2, dropout, Hogwild! and model averaging to prevent model
overfitting
elegant web interface or fully scriptable R API from H2O CRAN package
grid search for hyperparameter optimization and model selection
model checkpointing for reduced run times and model tuning
automatic data pre and post-processing for categorical and numerical data
automatic imputation of missing values
automatic tuning of communication vs computation for best performance
model export in plain java code for deployment in production environments
additional expert parameters for model tuning
deep autoencoders for unsupervised feature learning and anomaly detection capabilities

5.2

Training protocol

The training protocol described below follows many of the ideas and advances in the recent deep learning
literature.

5.2.1

Initialization

Various deep learning architectures employ a combination of unsupervised pretraining followed by supervised
training, but H2O uses a purely supervised training protocol. The default initialization scheme is the
uniform adaptive option, which is an optimized initialization based on the size of the network. Alternatively,
you may select a random initialization to be drawn from either a uniform or normal distribution, for which
a scaling parameter may be specified as well.

5.2.2

Activation and loss functions

In the introduction, we described the nonlinear activation function f ; the choices are summarized in Table 1.
Note here that xi and wi denotePthe firing neurons input values and their weights, respectively; denotes
the weighted combination = i wi xi + b. The tanh function is a rescaled and shifted logistic function
and its symmetry around 0 allows the training algorithm to converge faster. The rectified linear activation
function has demonstrated high performance on image recognition tasks, and is a more biologically accurate
model of neuron activations (LeCun et al, 1998). Maxout is a generalization of the Rectifier activation,
where each neuron picks the larger output of k separate channels, each with its own weights and bias values.
The current implementation supports only k = 2. Maxout activation works particularly well with dropout, a

5.2

Training protocol

Function

Table 1: Activation functions

Formula

Tanh
Rectified Linear
Maxout

e e
e +e

f () =
f () = max(0, )
f (1 , 2 ) = max(1 , 2 )

Range
f () [1, 1]
f () R+
f () R

regularization method discussed later in this vignette (Goodfellow et al, 2013). The Rectifier is the special
case of Maxout where one channel always outputs 0. It is difficult to determine a best activation function
to use; each may outperform the others in separate scenarios, but grid search models (also described later)
can help to compare activation functions and other parameters. The default activation function is the
Rectifier. Each of these activation functions can be operated with dropout regularization (see below).
The user can specify the distribution function of the response variable via the distribution argument as
one of the following: AUTO, Bernoulli, Multinomial, Poisson, Gamma, Tweedie, Laplace, Huber or Gaussian.
Each distribution is has a primary association with a particular loss function, but some distributions permit
the user to specify a non-default loss function from the group of loss functions specified in Table 2.
Bernoulli and Multinomial are primarily associated with Cross Entropy (aka. Log-Loss), Gaussian with
Mean Squared Error, Laplace with Absolute loss and Huber with Huber loss. For Poisson, Gamma and
Tweedie distributions, the user cannot change the loss function and so loss must be set to AUTO.
The system default enforces the tables typical use rule based on whether regression or classification is being
performed. Note here that t(j) and o(j) are the predicted (target) output and actual output, respectively,
for training example j; further, let y denote the output units and O the output layer.

Table 2: Loss functions

Formula

Function
Mean Squared Error
Absolute
Huber
Cross Entropy

1 (j)
2 kt
(j)

o(j) k22
o(j) k1
(j)

L(W, B|j) =
(L(W, B|j) = kt
1 (j)
kt o(j) k22
for kt o(j) k1 1,
L(W, B|j) = 2 (j)
kt o(j) k1 21 otherwise.

P
(j)
(j)
(j)
(j)
L(W, B|j) =
ln(oy ) ty + ln(1 oy ) (1 ty )

Typical use
Regression
Regression
Regression
Classification

5.2.3

Parallel distributed network training

The procedure to minimize the loss function L(W, B | j) is a parallelized version of stochastic gradient
descent (SGD). Standard SGD can be summarized as follows, with the gradient L(W, B | j) computed
via backpropagation (LeCun et al, 1998). The constant indicates the learning rate, which controls the
step sizes during gradient descent.
Standard stochastic gradient descent

1. Initialize W, B
2. Iterate until convergence criterion reached:
a. Get training example i
b. Update all weights wjk W , biases bjk B
wjk := wjk L(W,B|j)
wjk
bjk := bjk L(W,B|j)
bjk

5.2

Training protocol

Stochastic gradient descent is known to be fast and memory-efficient, but not easily parallelizable without
becoming slow. We utilize Hogwild!, the recently developed lock-free parallelization scheme from Niu et
al, 2011, to address this issue. Hogwild! follows a shared memory model where multiple cores (each
handling separate subsets or all of the training data) are able to make independent contributions to the
gradient updates L(W, B | j) asynchronously. In a multi-node system, this parallelization scheme works
on top of H2Os distributed setup where the training data is distributed across the cluster. Each node
operates in parallel on its local data until the final parameters W, B are obtained by averaging. Below is a
rough summary.
Parallel distributed and multi-threaded training with SGD in H2O Deep Learning

1. Initialize global model parameters W, B

2. Distribute training data T across nodes (can be disjoint or replicated)
3. Iterate until convergence criterion reached:
3.1. For nodes n with training subset Tn , do in parallel:
a. Obtain copy of the global model parameters Wn , Bn
b. Select active subset Tna Tn (user-given number of samples per iteration)
c. Partition Tna into Tnac by cores nc
d. For cores nc on node n, do in parallel:
i. Get training example i Tnac
ii. Update all weights wjk Wn , biases bjk Bn
wjk := wjk L(W,B|j)
wjk
bjk := bjk L(W,B|j)
bjk
3.2. Set W, B := Avgn Wn , Avgn Bn
3.3. Optionally score the model on (potentially sampled) train/validation scoring sets

Here, the weights and bias updates follow the asynchronous Hogwild! procedure to incrementally
adjust each nodes parameters Wn , Bn after seeing example i. The Avgn notation refers to the final
averaging of these local parameters across all nodes to obtain the global model parameters and complete
training.

5.2.4

Specifying the number of training samples per iteration

H2O Deep Learning is scalable and can take advantage of large clusters of compute nodes. There are three
operating modes. The default behavior is to let every node train on the entire (replicated) dataset, but
automatically shuffling (and/or using a subset of) the training examples for each iteration locally. For
datasets that dont fit into each nodes memory (depending on the amount of heap memory specified by
the -XmX Java option), it might not be possible to replicate the data, and each compute node can be
instructed to train only with local data. An experimental single node mode is available for cases where final
convergence is slow due to the presence of too many nodes, but this has not been necessary in our testing.
The number of training examples globally presented to the distributed SGD worker nodes between model
averaging is defined by the parameter train samples per iteration. If the specified value is -1,
all nodes process all their local training data per iteration. If replicate training data is enabled,
which is the default setting, this will result in training N epochs (passes over the data) per iteration on N
nodes; otherwise, one epoch will be trained per iteration. Another special value is 0, which always results

5.3

Regularization

in one epoch per iteration, regardless of the number of compute nodes. In general, any user-specified
positive number is permissible for this parameter. For large datasets, we recommend specifying a fraction
of the dataset.
For example, if the training data contains 10 million rows, and we specify the number of training samples
per iteration as 100, 000 when running on four nodes, then each node will process 25, 000 examples per
iteration, and it will take 40 distributed iterations to process one epoch. If the value is too high, it might
take too long between synchronization and model convergence may be slow. If the value is too low, network
communication overhead will dominate the runtime and computational performance will suffer. A value
of -2, which is the default value, enables auto-tuning for this parameter, based on the computational
performance of the processors and the network of the system, and attempts to find a good balance between
computation and communication. This parameter can affect the convergence rate during training.

5.3

Regularization

H2Os Deep Learning framework supports regularization techniques to prevent overfitting.

L1 (Lasso) and L2 (Ridge) regularization enforce the same penalties as they do with other models; that is,
modifying the loss function so as to minimize some loss,
L0 (W, B | j) = L(W, B | j) + 1 R1 (W, B | j) + 2 R2 (W, B | j).
For L1 regularization, R1 (W, B | j) represents of the sum of all `1 norms of the weights and biases in the
network; L2 regularization via R2 (W, B | j) represents the sum of squares of all the weights and biases in
the network. The constants 1 and 2 are generally specified as very small, for example 105 .
The second type of regularization available for deep learning is a modern innovation called dropout (Hinton
et al., 2012). Dropout constrains the online optimization so that during forward propagation for a given
training example, each neuron in the network suppresses its activation with probability P, which is usually
less than 0.2 for input neurons and up to 0.5 for hidden neurons. There are two effects: as with L2
regularization, the network weight values are scaled toward 0. Furthermore, each training example trains
a different model, although they share the same global parameters. As a result, dropout allows an
exponentially large number of models to be averaged as an ensemble, which can prevent overfitting and
improve generalization. Input dropout specified via input dropout ratio can be especially useful
when the feature space is large and noisy. Note that input dropout can be specified independently from
dropout in the hidden layers (which requires activation to be either one of TanhWithDropout,
RectifierWithDropout or MaxoutWithDropout). The amount of hidden dropout per hidden layer
can be specified via hidden dropout ratios and defaults to 0.5.

5.4

Advanced optimization

H2O features manual and automatic versions of advanced optimization. The manual mode features include
momentum training and learning rate annealing, while automatic mode features an adaptive learning rate.

5.4.1

Momentum training

Momentum modifies back-propagation by allowing prior iterations to influence the current version. In
particular, a velocity vector, v, is defined to modify the updates as follows: with representing the
parameters W, B; representing the momentum coefficient, and denoting the learning rate.
vt+1 = vt L(t )
t+1 = t + vt+1

5.5

Loading data

Using the momentum parameter can aid in avoiding local minima and the associated instability (Sutskever
et al, 2014). Too much momentum can lead to instabilities, which is why it is best to ramp up the
momentum slowly. The parameters that control momentum are momentum start, momentum ramp
and momentum stable.
The Nesterov accelerated gradient method, triggered by the nesterov accelerated gradient
parameter, is a recommended improvement when using momentum updates. Using this method, the
updates are further modified such that
vt+1 = vt L(t + vt )
Wt+1 = Wt + vt+1

5.4.2

Rate annealing

Throughout training, as the model approaches a minimum, the chance of oscillation or optimum skipping
creates the need for a slower learning rate. Instead of specifying a constant learning rate , learning rate
annealing gradually reduces the learning rate t to freeze into local minima in the optimization landscape
(Zeiler, 2012).
For H2O, the annealing rate (rate annealing) is the inverse of the number of training samples it takes
to cut the learning rate in half (e.g., 106 means that it takes 106 training samples to halve the learning
rate).

5.4.3

Adaptive learning

The implemented adaptive learning rate algorithm ADADELTA (Zeiler, 2012) automatically combines the
benefits of learning rate annealing and momentum training to avoid slow convergence. Specifying only two
parameters ( and ) simplifies hyper parameter search. In some cases, manually controlled (non-adaptive)
learning rate and momentum specifications can lead to better results, but require a hyperparameter search
of up to 7 parameters. If the model is built on a topology with many local minima or long plateaus, it
is possible for a constant learning rate to produce sub-optimal results. In general, however, we find the
adaptive learning rate produces the best results, so this option is used as the default.
The first of two hyper parameters for adaptive learning is (rho). It is similar to momentum and relates
to the memory to prior weight updates. Typical values are between 0.9 and 0.999. The second of two
hyper parameters, (epsilon), for adaptive learning is similar to learning rate annealing during initial
training and momentum at later stages where it allows forward progress. Typical values are between 1010
and 104 .

5.5

Loading data

Loading a dataset in R or Python for use with H2O is slightly different from the usual methodology, as
we must convert our datasets into H2OFrame objects (distributed data frames), rather than using an R
data.frame or data.table or a Python pandas.DataFrame or numpy.array.

5.5.1

Data standardization or normalization

Along with categorical encoding, H2Os Deep Learning preprocesses the data to be standardized for
compatibility with the activation functions (recall Table 1s summary of each activation functions target
space). Since the activation function does not generally map into the full spectrum of real numbers, R, we
first standardize our data to be drawn from N (0, 1). Standardizing again after network propagation allows
us to compute more precise errors in this standardized space, rather than in the raw feature space.

5.6

Additional parameters

For autoencoding, the data is normalized (instead of standardized) to the compact interval of mathcalU (0.5, 0.5),
to allow bounded activation functions like Tanh to better reconstruct the data.

5.6

Additional parameters

This section provided some background on the various parameter configurations in H2Os Deep Learning
architecture. Since there are dozens of possible parameter arguments when creating models, H2O Deep
Learning models may seem daunting. However, most parameters do not need to be modified; the default
settings are recommended as safe. The majority of the parameters that support (and in some cases, require)
experimentation are discussed in the previous sections but there are a few more that will be discussed in
the following sections.
There is no default setting for the hidden layer size, number, or epochs. Experimenting with building deep
learning models using different network topologies and different datasets will lead to intuition for these
parameters but two general rules of thumb should be applied. First, choose larger network sizes, as they can
perform higher-level feature extraction, and techniques like dropout may train only subsets of the network
at once. Second, use more epochs for greater predictive accuracy, but only when the computational cost is
affordable. Many example tests can be found in the H2O GitHub repository for pointers on specific values
and results for these (and other) parameters.
For a full list of H2O Deep Learning model parameters and default values, see Appendix A.

6
6.1

Use case: MNIST digit classification

MNIST overview

The MNIST database is a well-known academic dataset used to benchmark classification performance. The
data consists of 60,000 training images and 10,000 test images, for which each is a standardized 282 pixel
greyscale image of a single handwritten digit. An example of the scanned handwritten digits is shown in
Figure 1.

Figure 1: Example MNIST digit images

For this example, we will download and import the train and test datasets from a public Amazon S3 bucket.
The train file is 13MB and the test file is 2.1MB, and below we download the data directly so the speed
that the data is imported is limited by download speed. Files can be uploaded from a variety of sources,
including remote locations and HDFS.

6.2

Performing a trial run

Example in R
1
2

library(h2o)
h2o.init(nthreads = -1)

# This means nthreads = num available cores

3
4

train_file <- "https://h2o-public-test-data.s3.amazonaws.com/bigdata/

laptop/mnist/train.csv.gz"
test_file <- "https://h2o-public-test-data.s3.amazonaws.com/bigdata/
laptop/mnist/test.csv.gz"

6
7
8

train <- h2o.importFile(train_file)

test <- h2o.importFile(test_file)

9
10
11
12

# To see a brief summary of the data, run the following command

summary(train)
summary(test)
Example in Python

1
2

import h2o
h2o.init()

# Will set up H2O cluster using all available cores

3
4

train_file = "https://h2o-public-test-data.s3.amazonaws.com/bigdata/
laptop/mnist/train.csv.gz"
test_file = "https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop
/mnist/test.csv.gz"

6
7
8

train = h2o.import_file(train_file)
test = h2o.import_file(test_file)

9
10
11
12

# To see a brief summary of the data, run the following command

train.describe()
test.describe()

6.2

Performing a trial run

The example below illustrates the relative simplicity underlying most H2O Deep Learning model parameter
configurations, as a result of the default settings. We use the first 282 = 784 values of each row to
represent the full image and the final value to denote the digit class. Rectified linear activation is popular
with image processing and has performed well on the MNIST database previously and dropout has been
known to enhance performance on this dataset as well, so we train our model accordingly.

6.2

Performing a trial run

Example in R
1
2
3

# Specify the response and predictor columns

y <- "C785"
x <- setdiff(names(train), y)

4
5

6
7

# We encode the response column as categorical for multinomial

classification
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])

8
9
10
11
12
13
14
15
16
17
18
19

# Train a Deep Learning model and validate on a test set

model <- h2o.deeplearning(x = x,
y = y,
training_frame = train,
validation_frame = test,
distribution = "multinomial",
activation = "RectifierWithDropout",
hidden = c(200,200,200),
input_dropout_ratio = 0.2,
l1 = 1e-5,
epochs = 10)
Example in Python

1
2
3

# Specify the response and predictor columns

y = "C785"
x = train.names[0:784]

4
5

6
7

# We encode the response column as categorical for multinomial

classification
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

8
9
10
11
12
13
14
15
16
17
18
19

# Train a Deep Learning model and validate on a test set

model = h2o.deeplearning(x=x,
y=y,
training_frame=train,
validation_frame=test,
distribution="multinomial",
activation="RectifierWithDropout",
hidden=[200,200,200],
input_dropout_ratio=0.2,
l1=1e-5,
epochs=10)
The model runs for only 10 epochs since it is just meant as a trial run. In this trial run, we also specified the
validation set as the test set. In addition to, or instead of using a validation set, another option to estimate
generalization error is to perform N-fold cross-validation by specifying, nfolds = 5, for example.

6.2.1

N-fold cross-validation

When the nfolds argument is specified is a positive integer, N-fold cross-validation will be performed
on the training frame and the cross-validation metrics will be computed and stored as model output.

6.2

Performing a trial run

The default for nfolds is 0, for no cross-validation. Optionally, the user can save the cross-validation predicted values (generated during cross-validation) by setting keep cross validation predictions
parameter to true. This enables the user to calculate custom cross-validated performance metrics for their
model in R or Python. Advanced users can also specify a fold column that specifies for each row which
holdout fold it belongs to. By default, the holdout fold assignment is random, but other schemes
such as round-robyn assignment via the modulo operator are also supported. An example for generic N-fold
cross-validation is given below.
Example in R
1
2
3
4
5
6
7
8
9
10
11

# Perform 5-fold cross-validation on the training_frame

model_cv <- h2o.deeplearning(x = x,
y = y,
training_frame = train,
distribution = "multinomial",
activation = "RectifierWithDropout",
hidden = c(200,200,200),
input_dropout_ratio = 0.2,
l1 = 1e-5,
epochs = 10,
nfolds = 5)
Example in Python

1
2
3
4
5
6
7
8
9
10
11

# Perform 5-fold cross-validation on the training_frame

model_cv = h2o.deeplearning(x=x,
y=y,
training_frame=train,
distribution="multinomial",
activation="RectifierWithDropout",
hidden=[200,200,200],
input_dropout_ratio=0.2,
l1=1e-5,
epochs=10,
nfolds=5)

6.2.2

Extracting and handling the results

We can extract the parameters of our model, examine the scoring process, and make predictions on
new data. The h2o.performance function in R returns all pre-computed performance metrics either
the training set, validation set or returns cross-validated metrics for the training set. An equivalent
model performance method is available in Python. Utility functions that return specific metrics, such
as MSE or AUC are also available. Examples are shown below using the previously trained model and
model cv objects.

6.2

Performing a trial run

Example in R

1
2

# View the specified parameters of your deep learning model

model@parameters

3
4
5

# Examine the performance of the trained model

model # display all performance metrics

6
7
8

h2o.performance(model, train = TRUE)

h2o.performance(model, valid = TRUE)

# training set metrics

# validation set metrics

9
10
11

# Get MSE only

h2o.mse(model, valid = TRUE)

12
13
14

# Cross-validated MSE
h2o.mse(model_cv, xval = TRUE)
Example in Python

1
2

# View the specified parameters of your deep learning model

model.params

3
4
5

# Examine the performance of the trained model

model # display all performance metrics

6
7
8

model.model_performance(train=True)
model.model_performance(valid=True)

# training set metrics

# validation set metrics

9
10
11

# Get MSE only

model.mse(valid=True)

12
13
14

# Cross-validated MSE
model_cv.mse(xval=True)
The second command returns the training and validation errors for the model. The training error value is
based on the parameter score training samples, which specifies the number of randomly sampled
training points to be used for scoring (the default uses 10,000 points). The validation error is based on
the parameter score validation samples, which configures the same value on the validation set;
by default, this is the entire validation set.
In general, choosing a greater number of sampled points leads to a better understanding of the models
performance on your dataset; setting either of these parameters to 0 automatically uses the entire
corresponding dataset for scoring. However, either method allows you to control the minimum and
maximum time spent on scoring with the score interval and score duty cycle parameters.
These scoring parameters also affect the final model when the parameter overwrite with best model
is enabled. This option selects the model that achieved the lowest validation error during training (based
on the sampled points used for scoring) as the final model after training. If a dataset is not specified as the
validation set, the training data is used by default; in this case, either the score training samples
or score validation samples parameter will control the error computation during training and, in
turn, the selected best model.
Once we have a satisfactory model, as determined by the validation or cross-validation metrics, the
h2o.predict() command can be used to compute and store predictions on new data, which can then
be used for further tasks in the interactive data science process.

6.3

Web interface

Example in R
1
2
3

# Perform classification on the test set (predict class labels)

# This also returns the probability for each class
pred <- h2o.predict(model, newdata = test)

4
5
6

# Take a look at the predictions

head(pred)
Example in Python

1
2
3

# Perform classification on the test set (predict class labels)

# This also returns the probability for each class
pred = model.predict(test)

4
5
6

# Take a look at the predictions

pred.head()

6.3

Web interface

H2O R users have access to an intuitive web interface for H2O, Flow, to mirror the model building process
in R. After loading data or training a model in R, point your browser to your IP address and port number
(e.g., localhost:54321) to launch the web interface. From here, you can click on Admin > Jobs to view
specific details about your model. You can also click on Data > List All Frames to view all current
H2O frames.

6.3.1

Variable importances

The variable importances feature can be enabled by setting the variable importances to true. This
feature allows us to view the absolute and relative predictive strength of each feature in the prediction task.
Each H2O algorithm class has its own methodology for computing variable importance. For H2Os Deep
Learning, the Gedeon method (Gedeon, 1997) is used, which can be slow for large networks, so it is turned
off by default. If variable importance is the top priority in your analysis, you may (also) consider training a
Random Forest and inspecting the variable importances generated with that method.
The following code demonstrates training using the variable importances option enabled and how to
extract the variable importances from the trained model. From the web UI, you can also view a visualization
of the variable importances.
Example in R
1
2
3
4
5
6
7
8
9
10
11

# Train a Deep Learning model and validate on a test set

# and save the variable importances
model_vi <- h2o.deeplearning(x = x,
y = y,
training_frame = train,
distribution = "multinomial",
activation = "RectifierWithDropout",
hidden = c(200,200,200),
input_dropout_ratio = 0.2,
l1 = 1e-5,
validation_frame = test,

6.4

Grid search for model comparison

epochs = 10,
variable_importances = TRUE)

12
13

#added

14
15
16

# Retrieve the variable importance

h2o.varimp(model_vi)
Example in Python

1
2
3
4
5
6
7
8
9
10
11
12
13

# Train a Deep Learning model and validate on a test set

# and save the variable importances
model_vi = h2o.deeplearning(x=x,
y=y,
training_frame=train,
validation_frame=test,
distribution="multinomial",
activation="RectifierWithDropout",
hidden=[200,200,200],
input_dropout_ratio=0.2,
l1=1e-5,
epochs=10,
variable_importances=True) #added

14
15
16

# Retrieve the variable importance

model_vi.varimp()

6.3.2

Java model

Another important feature of the web interface is the Java (POJO) model, accessible from the Preview
POJO button at the bottom of the model results. This button allows access to Java code that builds the
model when called from a main method in a Java program. Instructions for downloading and running this
Java code are available from the web interface, and example production scoring code is available as well.

6.4

Grid search for model comparison

H2O supports grid search capabilities for model tuning by allowing users to tweak certain parameters and
observe changes in model behavior. This is done by specifying sets of values for parameter arguments. At
the time of publication, the Python grid search API is still undergoing development, so we demonstrate an
R example below.
Example in R
1
2
3

hidden_opt <- list(c(200,200), c(100,300,100), c(500,500,500))

l1_opt <- c(1e-5,1e-7)
hyper_params <- list(hidden = hidden_opt, l1 = l1_opt)

4
5
6
7
8
9
10
11

model_grid <- h2o.grid("deeplearning",

hyper_params = hyper_params,
x = x,
y = y,
distribution = "multinomial",
training_frame = train,
validation_frame = test)

6.5

Checkpoint model

In this example, we specified three different network topologies and two different `1 norm weights. This
grid search model trains six different models using all possible combinations of these parameters; other
parameter combinations can be specified for a larger space of models. This provides more subtle insights
into the model tuning and selection process by inspecting and comparing our trained models after the grid
search process is complete. To learn how and when to select different parameter configurations in a grid
search, refer to Appendix A for parameter descriptions and configurable values.
Example in R
1
2

# print out all prediction errors and run times of the models
model_grid

3
4
5
6
7
8
9

# print out the Test MSE for all of the models

for (model_id in model_grid@model_ids) {
model <- h2o.getModel(model_id)
mse <- h2o.mse(model, valid = TRUE)
print(sprintf("Test set MSE: %f", mse))
}

6.5

Checkpoint model

To resume model training, use checkpoint model keys (model id) for incrementally training a particular
model with more iterations, more data, different data, and so forth. To train our initial model further, we
can use it (or its key) as a checkpoint argument for a new model.
In the R example below, model grid@model ids[[1]] represents the highest performing model from
the grid search used for additional training. For checkpoint restarts, the training and validation datasets, as
well as the response column, must match. In addition, any non-default model parameters, such as hidden
= c(200,200) in the example below. For the Python example, we will use the original trial run model
that we trained previously as the checkpoint model.
Example in R
1
2
3
4
5
6
7
8
9
10
11

# Re-start the training process on a saved DL model

# using the checkpoint argument
model_chkp <- h2o.deeplearning(x = x,
y = y,
training_frame = train,
validation_frame = test,
distribution = "multinomial",
checkpoint = model_grid@model_ids[[1]],
hidden = c(200,200),
validation_frame = test,
epochs = 10)
Example in Python

1
2
3
4
5
6

# Re-start the training process on a saved DL model

# using the checkpoint argument
model_chkp = h2o.deeplearning(x=x,
y=y,
training_frame=train,
validation_frame=test,

6.5

Checkpoint model

checkpoint=model,
distribution="multinomial",
activation="RectifierWithDropout",
hidden=[200,200,200],
epochs=10)

7
8
9
10
11

Checkpointing can also be used to reload existing models that were saved to disk in a previous session. For
example, we can save and reload a model by running the following commands.
Example in R
1
2
3
4

# Specify a model and the file path where it is to be saved

model_path <- h2o.saveModel(object = model,
path = "/tmp/mymodel",
force = TRUE)

5
6
7

print(model_path)
# /tmp/mymodel/DeepLearning_model_R_1441838096933
Example in Python

1
2
3
4

# Specify a model and the file path where it is to be saved

model_path = h2o.save_model(model = model,
path = "/tmp/mymodel",
force = True)

5
6
7

print model_path
# /tmp/mymodel/DeepLearning_model_python_1441838096933
After restarting H2O, load the saved model by specifying the host and saved model file path. Note: The
saved model must be reloaded using a compatible H2O version (i.e., the same version used to save the
model).
Example in R

1
2

# Load model from disk

saved_model <- h2o.loadModel(model_path)
Example in Python

1
2

# Load model from disk

saved_model = h2o.load_model(model_path)
Additionally, you can also use the following command to retrieve a model from its H2O key. This command
is useful, for example, if you have created an H2O model using the web interface and wish to proceed with
the modeling process in R.
Example in R

1
2

# Retrieve model by H2O key

model <- h2o.getModel(model_id = "DeepLearning_model_R_1441838096933")
Example in Python

1
2

6.6

Achieving world-record performance

# Retrieve model by H2O key

model = h2o.get_model(model_id="DeepLearning_model_python_1441838096933")

6.6

Achieving world-record performance

Without distortions, convolutions, or other advanced image processing techniques, the best-ever published
test set error for the MNIST dataset is 0.83% by Microsoft. After training for 2, 000 epochs (took about
4 hours) on 4 compute nodes, we obtain 0.87% test set error and after training for 8, 000 epochs (took
about 10 hours) on 10 nodes, we obtain 0.83% test set error, which is the current world-record, notably
achieved using a distributed configuration and with a simple 1-liner from R. Details can be found in our
hands-on tutorial. Accuracies of around 1% test set error are typically achieved within 1 hour when running
on 1 node.

6.7

Computational performance

There are many parameters that affect the computational performance of H2O Deep Learning, but the
default values should result in good performance for most problems. An in-depth study of the computational
performance characteristics of H2O Deep Learning with complete code examples and results can be found
in our blog post Definitive Performance Tuning Guide for Deep Learning.
The parallel scalability of H2O for the MNIST dataset on 1 to 63 compute nodes is shown in the figure
below.

7 Deep Autoencoders

7
7.1

Deep Autoencoders
Nonlinear dimensionality reduction

So far, we have discussed purely supervised deep learning tasks. However, deep learning can also be used
for unsupervised feature learning or, more specifically, nonlinear dimensionality reduction (Hinton et al,
2006). Consider the diagram of a three-layer neural network with one hidden layer on the following page.
If we treat our input data as labeled with the same input values, then the network is forced to learn the
identity via a nonlinear, reduced representation of the original data. This type of algorithm is called a
deep autoencoder; these models have been used extensively for unsupervised, layer-wise pre-training of
supervised deep learning tasks, but here we discuss the autoencoders ability to discover anomalies in data.

7.2

Use case: anomaly detection

Consider the deep autoencoder model described above. Given enough training data that resembles
some underlying pattern, the network will train itself to easily learn the identity when confronted with
that pattern. However, if some anomalous test point that does not match the learned pattern arrives,
the autoencoder will likely have a high error in reconstructing this data, which indicates it is anomalous data.
We use this framework to develop an anomaly detection demonstration using a deep autoencoder. The
dataset is an ECG time series of heartbeats and the goal is to determine which heartbeats are outliers. The
training data (20 good heartbeats) and the test data (training data with 3 bad heartbeats appended
for simplicity) can be downloaded directly into the H2O cluster, as shown below. Each row represents a
single heartbeat. The autoencoder is trained as follows:
Example in R
1
2

3
4
5

6
7

# Download and import ECG train and test data into the H2O cluster
train_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.
amazonaws.com/smalldata/anomaly/ecg_discord_train.csv",
header = FALSE,
sep = ",")
test_ecg <- h2o.importFile(path = "http://h2o-public-test-data.s3.
amazonaws.com/smalldata/anomaly/ecg_discord_test.csv",
header = FALSE,
sep = ",")

8
9
10
11

# Train deep autoencoder learning model on "normal"

# training data, y ignored
anomaly_model <- h2o.deeplearning(x = names(train),

7.2

Use case: anomaly detection

training_frame = train_ecg,
activation = "Tanh",
autoencoder = TRUE,
hidden = c(50,20,50),
l1 = 1e-4,
epochs = 100)

12
13
14
15
16
17
18
19
20
21

# Compute reconstruction error with the Anomaly

# detection app (MSE between output layer and input layer)
recon_error <- h2o.anomaly(anomaly_model, test_ecg)

22
23
24
25
26
27

# Pull reconstruction error data into R and

# plot to find outliers (last 3 heartbeats)
recon_error <- as.data.frame(recon_error)
recon_error
plot.ts(recon_error)

28
29
30
31

# Note: Testing = Reconstructing the test dataset

test_recon <- h2o.predict(anomaly_model, test_ecg)
head(test_recon)
Example in Python

1
2

# Download and import ECG train and test data into the H2O cluster
train_ecg = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com
/smalldata/anomaly/ecg_discord_train.csv")
test_ecg = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/
smalldata/anomaly/ecg_discord_test.csv")

4
5
6
7
8
9
10
11
12
13
14

# Train deep autoencoder learning model on "normal"

# training data, y ignored
anomaly_model = h2o.deeplearning(x=train_ecg.names,
training_frame=train_ecg,
activation="Tanh",
autoencoder=True,
hidden=[50,50,50],
l1=1e-4,
epochs=100)

15
16
17
18

# Compute reconstruction error with the Anomaly

# detection app (MSE between output layer and input layer)
recon_error = anomaly_model.anomaly(test_ecg)

19
20
21
22
23

# Note: Testing = Reconstructing the test dataset

test_recon = anomaly_model.predict(test_ecg)
test_recon

8 Appendix A: Complete parameter list

Appendix A: Complete parameter list

x: A vector containing the names of the predictors in the model. No default.
y: The name of the response variable in the model. No default.
training frame: An H2OFrame object containing the variables in the model. No default.
model id: (Optional) The unique ID assigned to the resulting model. If not specified, an ID is
generated automatically.
overwrite with best model: Logical. If enabled, overwrite the final model with the best
model that was ever scored during training. Default is true.
validation frame: (Optional) An H2OFrame object that represents the validation dataset used
to construct the confusion matrix. If blank and nfolds = 0, the training data is used by default.
checkpoint: (Optional) Model checkpoint (either key or H2ODeepLearningModel) used to resume
training.
autoencoder: Logical. Enable autoencoder; default is false. Refer to the Deep Autoencoders
section for more details.
use all factor levels: Logical. Use all factor levels of categorical variance. Otherwise,
the first factor level is omitted (without loss of accuracy). Useful for variable importances and
auto-enabled for autoencoder. Default is true.
activation: The choice of nonlinear, differentiable activation function used throughout the network. Options are Tanh, TanhWithDropout, Rectifier, RectifierWithDropout,
Maxout, MaxoutWithDropout, and the default is Rectifier. Refer to the Activation and
loss functions or ?? sections for more details.
hidden: The number and size of each hidden layer in the model. For example, if c(100,200,100)
is specified, a model with 3 hidden layers will be produced, the middle hidden layer will have 200
neurons and the first and third hidden layers will have 100 neurons. The default is c(200,200).
For grid search, use list(c(10,10), c(20,20)) etc. Refer to the section on Performing a
trial run for more details.
epochs: The number of passes or over the training dataset (can be fractional). It is recommended
to start with lower values for initial grid searches. The value can be modified during checkpoint
restarts and allows continuation of selected models. Default is 10.
train samples per iteration: The number of training samples (globally) per MapReduce
iteration. The following special values are also supported: 0 (one epoch); -1 (all available data;
e.g., replicated training data); -2 (auto-tuning; default). Refer to the Specifying the number of
training samples per iteration for more details.
seed: The random seed controls sampling and initialization. Reproducible results are only expected
with single-threaded operation (i.e. when running on one node, turning off load balancing and
providing a small dataset that fits in one chunk). In general, the multi-threaded asynchronous
updates to the model parameters will result in (intentional) race conditions and non-reproducible
results. Default is a randomly generated number.
adaptive rate: Logical. Enables adaptive learning rate (ADADELTA). Default is true. Refer to
the Adaptive learning section for more details.
rho: Adaptive learning rate time decay factor. This parameter is similar to momentum and relates
to the memory to prior weight updates. Typical values are between 0.9 and 0.999. Default value is
0.99. Refer to the section on Adaptive learning for more details.
epsilon: The second of two hyperparameters for adaptive learning rate (when it is enabled). This
parameter is similar to learning rate annealing during initial training and momentum at later stages
where it allows forward progress. Typical values are between 1e-10 and 1e-4. This parameter is only

8 Appendix A: Complete parameter list

active if adaptive rate is enabled. Default is 1e-8. Refer to the Adaptive learning section for
more details.
rate: The learning rate, . Higher values lead to less stable models, while lower values lead to
slower convergence. Default is 0.005.
rate annealing: The annealing learning rate is calculated as (rate)/(1+rate annealing
* N), where N is the number of training samples. It is the inverse of the number of training samples
required to cut the learning rate in half. This reduces the learning rate to freeze into local minima
in the optimization landscape. Default value is 1e-6 (when adaptive learning is disabled). Refer to
the Rate annealing section for more details.
rate decay: Learning rate decay factor between layers (L-th layer: rate * rate decay (L-1));
default is 1.0 (when adaptive learning is disabled). The learning rate decay parameter controls the
change of learning rate across layers.
momentum start: This controls the amount of momentum at the beginning of training (when
adaptive learning is disabled). Default is 0. Refer to the Momentum training section for more
details.
momentum ramp: This controls the amount of learning for which momentum increases (assuming
momentum stable is larger than momentum start). The ramp is measured in the number of
training samples and can be enabled when adaptive learning is disabled. Default is 1 million (1e6).
Refer to the Momentum training section or more details.
momentum stable: This controls the final momentum value reached after momentum ramp
number of training samples (when adaptive learning is disabled). The momentum used for training
will remain the same for training beyond reaching that point. Default is 0. Refer to the Momentum
training for more details.
nesterov accelerated gradient: Logical. The Nesterov accelerated gradient descent
method is a modification to traditional gradient descent for convex functions. The method relies on gradient information at various points to build a polynomial approximation that minimizes the
residuals in fewer iterations of the descent. This parameter is only active if adaptive learning rate is
disabled. The default is true (when adaptive learning is disabled). Refer to the Momentum training
section for more details.
input dropout ratio: The fraction of the features for each training row to be omitted from
training in order to improve generalization. The default is 0 (always use all features). Refer to the
Regularization section for more details.
hidden dropout ratios: The fraction of the inputs for each hidden layer to be omitted from
training in order to improve generalization. The default is 0.5 for each hidden layer. Refer to the
Regularization section for more details.
l1: L1 regularization, which constrains the absolute value of the weights (can add stability and
improve generalization, causes many weights to become 0). The default is 0, for no L1 regularization.
Refer to the Regularization section for more details.
l2: L2 regularization, which constrains the sum of the squared weights (can add stability and improve
generalization, causes many weights to be small). The default is 0, for no L2 regularization. Refer to
the Regularization section for more details.
max w2: A maximum on the sum of the squared incoming weights into any one neuron. This tuning
parameter is especially useful for unbound activation functions such as Maxout or Rectifier. The
default, positive infinity, leaves this maximum unbounded.
initial weight distribution: The distribution from which initial weights are to be drawn.
Select Uniform, UniformAdaptive or Normal. Default is UniformAdaptive. Refer to
the Initialization for more details.
initial weight scale: The scale of the distribution function for Uniform or Normal distributions.
For Uniform, the values are drawn uniformly from (-initial weight scale, initial weight scale).

8 Appendix A: Complete parameter list

For Normal, the values are drawn from a Normal distribution with a standard deviation of initial weight scale.
The default is 1. Refer to the Initialization for more details.
loss: Specify one of the loss options: Automatic, CrossEntropy (for classification only),
MeanSquare, Absolute, or Huber. Refer to the Activation and loss functions section for
more details.
distribution: Specify the distribution function of the response: AUTO, bernoulli, multinomial,
poisson, gamma, tweedie, laplace, huber, or gaussian.
tweedie power: Specify the tweedie power; applicable only if distribution is set to tweedie.
Value must be between 1.0 and 2.0.
score interval: The minimum time (in seconds) to elapse between model scoring. The actual
interval is determined by the number of training samples per iteration and the scoring duty cycle. To
use all training set samples, specify 0. Default is 5.
score training samples: The number of training samples to be used for scoring. These
samples will be randomly sampled. Use 0 to select the entire training dataset. Default is 10000.
score validation samples: The number of validation dataset points to be used for scoring.
Can be randomly sampled or stratified (if balance classes is set and score validation sampling
is set to stratify). Use 0 to select the entire validation dataset (default).
score duty cycle: Maximum duty cycle fraction spent on model scoring (on both training and
validation samples), and on diagnostics such as computation of feature importances (i.e., not on
training). Lower values result in more training, while higher values produce more scoring. Default is
0.1.
classification stop: The stopping criterion for classification error (1 - accuracy) on the
training data scoring dataset. When the error is at or below this threshold, the training process stops.
Default is 0. To disable, enter -1.
regression stop: The stopping criterion for regression error (MSE) on the training data scoring
dataset. When the error is at or below this threshold, the training process stops. Default is 1e-6. To
disable, enter -1.
quiet mode: Logical. Enable quiet mode for less output to standard output. Default is false.
max confusion matrix size: For classification models, the maximum size (in terms of classes)
of the confusion matrix to display. This option is meant to avoid printing extremely large confusion
matrices. Default is 20.
max hit ratio k: The maximum number (top K) of predictions to use for hit ratio computation
(for multi-class only, enter 0 to disable). Default is 10.
balance classes: Logical. For imbalanced data, the training data class counts can be artificially
balanced by over-sampling the minority class(es) and under-sampling the majority class(es) so that
each class will contain the same number of observations. This can result in improved predictive
accuracy. Over-sampling is done with replacement (rather than simulating new observations), and
the total number of observations after balancing is controlled by the max after balance size
parameter. Default is false.
class sampling factors: Desired over/under-sampling ratios per class (in lexicographic order).
Only applies to classification when balance classes is enabled. If not specified, the ratios will
be automatically computed to obtain class balance during training.
max after balance size: When classes are balanced, limit the resulting dataset size to the
specified multiple of the original dataset size. This is the maximum relative size of the training data
after balancing class counts (can be less than 1.0). Default is 5.
score validation sampling: Method used to sample validation dataset for scoring. The
possible methods are Uniform and Stratified. Default is Uniform.

8 Appendix A: Complete parameter list

diagnostics: (Deprecated) Logical. Gather diagnostics for hidden layers, such as mean and root
mean squared (RMS) values of learning rate, momentum, weights and biases. Since deprecation,
diagnostics are always enabled (set to true).
variable importances: Logical. Compute variable importances for input features using the
Gedeon method. The implementation considers the weights connecting the input features to the first
two hidden layers. Default is false, since this can be slow for large networks.
fast mode: Logical. Enable fast mode (minor approximation in back-propagation). This should
not affect results significantly. Default is true.
ignore const cols: Logical. Ignore constant training columns (no information can be gained
anyway). Default is true.
force load balance: Logical. Force extra load balancing to increase training speed for small
datasets to keep all cores busy. Default is true.
replicate training data: Logical. Replicate the entire training dataset onto every node for
faster training on small datasets. Default is true.
single node mode: Logical. Run on a single node for fine-tuning of model parameters. Can be
useful for faster convergence during checkpoint resumes after training on a very large count of nodes
(for fast initial convergence). Default is false.
shuffle training data: Logical. Enable shuffling of training data (on each node). This option
is recommended if training data is replicated on N nodes, and the number of training samples per
iteration is close to N times the dataset size, where all nodes train with (almost) all of the data. It is
automatically enabled if the number of training samples per iteration is set to -1 (or to N times the
dataset size or larger). Default is false.
sparse: (Deprecated) Logical. Enable sparse data handling.Default is false.
col major: (Deprecated) Logical. Use a column major weight matrix for the input layer; can speed
up forward propagation, but may slow down backpropagation. Default is false.
average activation: Specify the average activation for the sparse autoencoder (Experimental).
Default is 0.
sparsity beta: Specify the sparsity-based regularization optimization (Experimental). Default is
0.
max categorical features: Maximum number of categorical features allowed in a column,
enforced via hashing (Experimental). Default is 231 1 (Integer.MAX VALUE in Java).
reproducible: Logical. Force reproducibility on small data (will be slow only uses one thread).
Default is false.
export weights and biases: Logical. Specify whether to export the neural network weights
and biases as an H2OFrame. Default is false.
offset column: Specify the offset column by column name. Regression only. Offsets are per-row
bias values that are used during model training. For Gaussian distributions, they can be seen as
simple corrections to the response (y) column. Instead of learning to predict the response value
directly, the model learns to predict the (row) offset of the response column. For other distributions,
the offset corrections are applied in the linearized space before applying the inverse link function to
get the actual response values.
weights column: Specify the weights column by column name. Weights are per-row observation
weights. This is typically the number of times a row is repeated, but non-integer values are supported
as well. During training, rows with higher weights matter more, due to the larger loss function
pre-factor.
nfolds: (Optional) Number of folds for cross-validation. Default is 0 (no cross-validation is
performed).

8 Appendix A: Complete parameter list

fold column: (Optional) Name of column with cross-validation fold index assignment per observation; the folds are supplied by the user.
fold assignment: Cross-validation fold assignment scheme, if nfolds is greater than zero and
fold column is not specified. Options are AUTO, Random, or Modulo.
keep cross validation predictions: Logical. Specify whether to keep the predictions of
the cross-validation models. Default is false.

9 Appendix B: Common Command Summary in R

Appendix B: Common Command Summary in R

library(h2o): Imports the h2o R package.
h2o.init(): Connects to (or starts) an H2O cluster.
h2o.shutdown(): Shuts down the H2O cluster.
h2o.importFile(path): Imports a file into H2O.
h2o.deeplearning(x, y, training frame, hidden, epochs): Creates a Deep Learning model.
h2o.grid(algorithm, grid id, ..., hyper params = list()): Starts H2O grid
support and gives results.
h2o.predict(model, newdata): Generate predictions from an H2O model on a test set.

10 Appendix C: Common Command Summary in Python

Appendix C: Common Command Summary in Python

import h2o: Imports the h2o Python package.

h2o.init(): Connects to (or starts) an H2O cluster.
h2o.shutdown(): Shuts down the H2O cluster.
h2o.import file(path): Imports a file into H2O.
h2o.deeplearning(x, y, training frame, hidden, epochs): Creates a Deep Learning model.
h2o.grid (not yet available): Starts H2O grid support and gives results.
h2o.predict(model, newdata): Generate predictions from an H2O model on a test set.

11 Appendix D: References

Appendix D: References

H2O website: http://h2o.ai/

H2O documentation: http://docs.h2o.ai
H2O Github Repository: http://github.com/h2oai/h2o-3.git
Code for this Document:
https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo
H2O open-source support: h2ostream@googlegroups.com and https://groups.google.
com/forum/#!forum/h2ostream
H2O JIRA: https://0xdata.atlassian.net/secure/Dashboard.jspa
H2O YouTube Channel: https://www.youtube.com/user/0xdata
Learning Deep Architectures for AI. Bengio, Yoshua, 2009.
http://www.iro.umontreal.ca/lisa/pointeurs/TR1312.pdf
Efficient BackProp. LeCun et al, 1998. http://yann.lecun.com/exdb/publis/pdf/lecun-98b.
pdf
Maxout Networks. Goodfellow et al, 2013. http://arxiv.org/pdf/1302.4389.pdf
HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent.
Niu et al, 2011. http://i.stanford.edu/hazy/papers/hogwild-nips.pdf
Improving neural networks by preventing co-adaptation of feature detectors.
Hinton et al., 2012. http://arxiv.org/pdf/1207.0580.pdf
On the importance of initialization and momentum in deep learning.
http://www.cs.toronto.edu/ fritz/absps/momentum.pdf

Sutskever et al, 2014.

ADADELTA: An Adaptive Learning Rate Method. Zeiler, 2012.

http://arxiv.org/pdf/1212.5701v1.pdf
MNIST database. http://yann.lecun.com/exdb/mnist/
Reducing the Dimensionality of Data with Neural Networks. Hinton et al, 2006.
http://www.cs.toronto.edu/ hinton/science.pdf
Definitive Performance Tuning Guide for Deep Learning. http://h2o.ai/blog/2015/08/deep-learningperformance/