This is the html version of the file https://arxiv.org/abs/1403.7550.
Google automatically generates html versions of documents as we crawl the web.
arXiv:1403.7550v3 [cs.DB] 7 Jul 2014
  Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page 1
DimmWitted: A Study of Main-Memory Statistical Analytics
Ce Zhang†‡
Christopher Ré
Stanford University
University of Wisconsin-Madison
{czhang, chrismre}@cs.stanford.edu
ABSTRACT
We perform the first study of the tradeoff space of access
methods and replication to support statistical analytics us-
ing first-order methods executed in the main memory of a
Non-Uniform Memory Access (NUMA) machine. Statistical
analytics systems differ from conventional SQL-analytics in
the amount and types of memory incoherence that they can
tolerate. Our goal is to understand tradeoffs in accessing the
data in row- or column-order and at what granularity one
should share the model and data for a statistical task. We
study this new tradeoff space and discover that there are
tradeoffs between hardware and statistical efficiency. We
argue that our tradeoff study may provide valuable infor-
mation for designers of analytics engines: for each system
we consider, our prototype engine can run at least one pop-
ular task at least 100× faster. We conduct our study across
five architectures using popular models, including SVMs, lo-
gistic regression, Gibbs sampling, and neural networks.
1. INTRODUCTION
Statistical data analytics is one of the hottest topics in
data-management research and practice. Today, even small
organizations have access to machines with large main mem-
ories (via Amazon’s EC2) or for purchase at $5/GB. As a
result, there has been a flurry of activity to support main-
memory analytics in both industry (Google Brain, Impala,
and Pivotal) and research (GraphLab, and MLlib). Each
of these systems picks one design point in a larger tradeoff
space. The goal of this paper is to define and explore this
space. We find that today’s research and industrial systems
under-utilize modern commodity hardware for analytics—
sometimes by two orders of magnitude. We hope that our
study identifies some useful design points for the next gen-
eration of such main-memory analytics systems.
Throughout, we use the term statistical analytics to refer
to those tasks that can be solved by first-order methods–a
class of iterative algorithms that use gradient information;
these methods are the core algorithm in systems such as ML-
lib, GraphLab, and Google Brain. Our study examines an-
alytics on commodity multi-socket, multi-core, non-uniform
memory access (NUMA) machines, which are the de facto
standard machine configuration and thus a natural target for
an in-depth study. Moreover, our experience with several
enterprise companies suggests that, after appropriate pre-
processing, a large class of enterprise analytics problems fit
into the main memory of a single, modern machine. While
this architecture has been recently studied for traditional
SQL-analytics systems [16], it has not been studied for sta-
tistical analytics systems.
Statistical analytics systems are different from traditional
SQL-analytics systems. In comparison to traditional SQL-
analytics, the underlying methods are intrinsically robust to
error. On the other hand, traditional statistical theory does
not consider which operations can be efficiently executed.
This leads to a fundamental tradeoff between statistical ef-
ficiency (how many steps are needed until convergence to
a given tolerance) and hardware efficiency (how efficiently
those steps can be carried out).
To describe such tradeoffs more precisely, we describe the
setup of the analytics tasks that we consider in this paper.
The input data is a matrix in RN×d and the goal is to find
a vector x ∈ Rd that minimizes some (convex) loss function,
say the logistic loss or the hinge loss (SVM). Typically, one
makes several complete passes over the data while updating
the model; we call each such pass an epoch. There may be
some communication at the end of the epoch, e.g., in bulk-
synchronous parallel systems such as Spark. We identify
three tradeoffs that have not been explored in the literature:
(1) access methods for the data, (2) model replication, and
(3) data replication. Current systems have picked one point
in this space; we explain each space and discover points
that have not been previously considered. Using these new
points, we can perform 100× faster than previously explored
points in the tradeoff space for several popular tasks.
Access Methods. Analytics systems access (and store) data
in either row-major or column-major order. For example,
systems that use stochastic gradient descent methods (SGD)
access the data row-wise; examples include MADlib [23]
in Impala and Pivotal, Google Brain [29], and MLlib in
Spark [47]; and stochastic coordinate descent methods (SCD)
access the data column-wise; examples include GraphLab [34],
Shogun [46], and Thetis [48]. These methods have essen-
tially identical statistical efficiency, but their wall-clock per-
formance can be radically different due to hardware effi-
1
arXiv:1403.7550v3 [cs.DB] 7 Jul 2014

Page 2
Node 2
L3 Cache
RAM
Core3 Core4
N
d
(a) Memory Model for DimmWitted
Data A
Model x
Machine
Node 1
L3 Cache
RAM
Core1 Core2
d
(d) NUMA Architecture
(c) Access Methods
Row-wise
Col.-wise Col.-to-row
Read set of data
Write set of model
(b) Pseudocode of SGD
Procedure:
m <-- initial model
while not converge
foreach row z in A:
m <-- m α grad(z, m)
test convergence
Input: Data matrix A, stepsize α
Gradient function grad.
Output: Model m.
Epoch
L1/L2 L1/L2 L1/L2 L1/L2
Figure 1: Illustration of (a) DimmWitted’s Memory Model, (b) Pseudocode for SGD, (c) Different Statistical
Methods in DimmWitted and Their Access Patterns, and (d) NUMA Architecture.
ciency. However, this tradeoff has not been systematically
studied. To study this tradeoff, we introduce a storage ab-
straction that captures the access patterns of popular statis-
tical analytics tasks and a prototype called DimmWitted.
In particular, we identify three access methods that are used
in popular analytics tasks, including standard supervised
machine learning models such as SVMs, logistic regression,
and least squares; and more advanced methods such as neu-
ral networks and Gibbs sampling on factor graphs. For dif-
ferent access methods for the same problem, we find that the
time to converge to a given loss can differ by up to 100×.
We also find that no access method dominates all oth-
ers, so an engine designer may want to include both access
methods. To show that it may be possible to support both
methods in a single engine, we develop a simple cost model
to choose among these access methods. We describe a sim-
ple cost model that selects a nearly optimal point in our
data sets, models, and different machine configurations.
Data and Model Replication. We study two sets of trade-
offs: the level of granularity, and the mechanism by which
mutable state and immutable data are shared in analytics
tasks. We describe the tradeoffs we explore in both (1) mu-
table state sharing, which we informally call model replica-
tion, and (2) data replication.
(1) Model Replication. During execution, there is some
state that the task mutates (typically an update to the
model). We call this state, which may be shared among
one or more processors, a model replica. We consider three
different granularities at which to share model replicas:
• The PerCore approach treats a NUMA machine as a
distributed system in which every core is treated as an
individual machine, e.g., in bulk-synchronous models
such as MLlib on Spark or event-driven systems such as
GraphLab. These approaches are the classical shared-
nothing and event-driven architectures, respectively.
In PerCore, the part of the model that is updated by
each core is only visible to that core until the end of
an epoch. This method is efficient and scalable from
a hardware perspective, but it is less statistically ef-
ficient, as there is only coarse-grained communication
between cores.
• The PerMachine approach acts as if each processor has
uniform access to memory. This approach is taken in
Hogwild! and Google Downpour [19]. In this method,
the hardware takes care of the coherence of the shared
state. The PerMachine method is statistically efficient
due to high communication rates, but it may cause
contention in the hardware, which may lead to subop-
timal running times.
• A natural hybrid is PerNode; this method uses the
fact that PerCore communication through the last-level
cache (LLC) is dramatically faster than communica-
tion through remote main memory. This method is
novel; for some models, PerNode can be an order of
magnitude faster.
Because model replicas are mutable, a key question is how
often should we synchronize model replicas? We find that it
is beneficial to synchronize the models as much as possible—
so long as we do not impede throughput to data in main
memory. A natural idea, then, is to use PerMachine sharing,
in which the hardware is responsible for synchronizing the
replicas. However, this decision can be suboptimal, as the
cache-coherence protocol may stall a processor to preserve
coherence, but this information may not be worth the cost
of a stall from a statistical efficiency perspective. We find
that the PerNode method, coupled with a simple technique
to batch writes across sockets, can dramatically reduce com-
munication and processor stalls. The PerNode method can
result in an over 10× runtime improvement. This technique
depends on the fact that we do not need to maintain the
model consistently: we are effectively delaying some updates
to reduce the total number of updates across sockets (which
lead to processor stalls).
(2) Data Replication. The data for analytics is immutable,
so there are no synchronization issues for data replication.
The classical approach is to partition the data to take ad-
vantage of higher aggregate memory bandwidth. However,
each partition may contain skewed data, which may slow
convergence. Thus, an alternate approach is to replicate the
data fully (say, per NUMA node). In this approach, each
node accesses that node’s data in a different order, which
means that the replicas provide non-redundant statistical
information; in turn, this reduces the variance of the esti-
mates based on the data in each replicate. We find that for
some tasks, fully replicating the data four ways can converge
to the same loss almost 4× faster than the sharding strategy.
Summary of Contributions. We are the first to study the
three tradeoffs listed above for main-memory statistical an-
alytics systems. These tradeoffs are not intended to be an
exhaustive set of optimizations, but they demonstrate our
2

Page 3
main conceptual point: treating NUMA machines as dis-
tributed systems or SMP is suboptimal for statistical ana-
lytics. We design a storage manager, DimmWitted, that
shows it is possible to exploit these ideas on real data sets.
Finally, we evaluate our techniques on multiple real datasets,
models, and architectures.
2. BACKGROUND
In this section, we describe the memory model for
DimmWitted, which provides a unified memory model to
implement popular analytics methods. Then, we recall some
basic properties of modern NUMA architectures.
Data for Analytics. The data for an analytics task is a
pair (A, x), which we call the data and the model, respec-
tively. For concreteness, we consider a matrix A ∈ RN×d.
In machine learning parlance, each row is called an exam-
ple. Thus, N is often the number of examples and d is often
called the dimension of the model. There is also a model,
typically a vector x ∈ Rd. The distinction is that the data
A is read-only, while the model vector, x, will be updated
during execution. From the perspective of this paper, the
important distinction we make is that data is an immutable
matrix, while the model (or portions of it) are mutable data.
First-Order Methods for Analytic Algorithms. DimmWit-
ted considers a class of popular algorithms called first-order
methods. Such algorithms make several passes over the data;
we refer to each such pass as an epoch. A popular exam-
ple algorithm is stochastic gradient descent (SGD), which
is widely used by web-companies, e.g., Google Brain [29]
and VowPal Wabbit [1], and in enterprise systems such as
Pivotal, Oracle, and Impala. Pseudocode for this method
is shown in Figure 1(b). During each epoch, SGD reads a
single example z; it uses the current value of the model and
z to estimate the derivative; and it then updates the model
vector with this estimate. It reads each example in this loop.
After each epoch, these methods test convergence (usually
by computing or estimating the norm of the gradient); this
computation requires a scan over the complete dataset.
2.1 Memory Models for Analytics
We design DimmWitted’s memory model to capture the
trend in recent high-performance sampling and statistical
methods. There are two aspects to this memory model: the
coherence level and the storage layout.
Coherence Level. Classically, memory systems are coher-
ent: reads and writes are executed atomically. For analytics
systems, we say that a memory model is coherent if reads
and writes of the entire model vector are atomic. That is,
access to the model is enforced by a critical section. How-
ever, many modern analytics algorithms are designed for an
incoherent memory model. The Hogwild! method showed
that one can run such a method in parallel without locking
but still provably converge. The Hogwild! memory model
relies on the fact that writes of individual components are
atomic, but it does not require that the entire vector be
updated atomically. However, atomicity at the level of the
cacheline is provided by essentially all modern processors.
Empirically, these results allow one to forgo costly locking
(and coherence) protocols. Similar algorithms have been
Algorithm
Access Method
Implementation
Stochastic Gradient Descent
Row-wise
MADlib, Spark, Hogwild!
Stochastic Coordinate Descent
Column-wise
GraphLab, Shogun, Thetis
Column-to-row
Figure 2: Algorithms and Their Access Methods.
proposed for other popular methods, including Gibbs sam-
pling [25,45], stochastic coordinate descent (SCD) [42,46],
and linear systems solvers [48]. This technique was applied
by Dean et al. [19] to solve convex optimization problems
with billions of elements in a model. This memory model is
distinct from the classical, fully coherent database execution.
The DimmWitted prototype allows us to specify that a
region of memory is coherent or not. This region of memory
may be shared by one or more processors. If the memory
is only shared per thread, then we can simulate a shared-
nothing execution. If the memory is shared per machine, we
can simulate Hogwild!.
Access Methods. We identify three distinct access paths
used by modern analytics systems, which we call row-wise,
column-wise, and column-to-row. They are graphically il-
lustrated in Figure 1(c). Our prototype supports all three
access methods. All of our methods perform several epochs,
that is, passes over the data. However, the algorithm may
iterate over the data row-wise or column-wise.
• In row-wise access, the system scans each row of the
table and applies a function that takes that row, ap-
plies a function to it, and then updates the model.
This method may write to all components of the
model. Popular methods that use this access method
include stochastic gradient descent, gradient descent,
and higher-order methods (such as l-BFGS).
• In column-wise access, the system scans each column j
of the table. This method reads just the j component
of the model. The write set of the method is typically
a single component of the model. This method is used
by stochastic coordinate descent.
• In column-to-row access, the system iterates conceptu-
ally over the columns. This method is typically applied
to sparse matrices. When iterating on column j, it
will read all rows in which column j is non-zero. This
method also updates a single component of the model.
This method is used by non-linear support vector ma-
chines in GraphLab and is the de facto approach for
Gibbs sampling.
DimmWitted is free to iterate over rows or columns in es-
sentially any order (although typically some randomness in
the ordering is desired). Figure 2 classifies popular imple-
mentations by their access method.
2.2 Architecture of NUMA Machines
We briefly describe the architecture of a modern NUMA
machine. As illustrated in Figure 1(d), a NUMA machine
contains multiple NUMA nodes. Each node has multiple
cores and processor caches, including the L3 cache. Each
3

Page 4
W
ork
er
RAM
6GB/s
QPI 11GB/s
Name
(abbrv.)
#Node
#Cores/
Node
RAM/
Node (GB)
CPU
Clock (GHz)
LLC
(MB)
local2 (l2)
2
6
32
2.6
12
local4 (l4)
4
10
64
2.0
24
local8 (l8)
8
8
128
2.6
24
ec2.1 (e1)
2
8
122
2.6
20
ec2.2 (e2)
2
8
30
2.6
20
local2  
W
ork
er
RAM
6GB/s
Figure 3: Summary of Machines and Memory Band-
width on local2 Tested with STREAM [9].
Data A
M
achine
Data
Replica
Model
Replica
Worker
Read
Update
Execution Plan
Model x
O
p
timizer
Figure 4: Illustration of DimmWitted’s Engine.
node is directly connected to a region of DRAM. NUMA
nodes are connected to each other by buses on the main
board; in our case, this connection is the Intel Quick Path
Interconnects (QPIs), which has a bandwidth as high as
25.6GB/s.1 To access DRAM regions of other NUMA nodes,
data is transferred across NUMA nodes using the QPI. These
NUMA architectures are cache coherent, and the coherency
actions use the QPI. Figure 3 describes the configuration
of each machine that we use in this paper. Machines con-
trolled by us have names with the prefix “local”; the other
machines are Amazon EC2 configurations.
3. THE DIMMWITTED ENGINE
We describe the tradeoff space that DimmWitted’s op-
timizer considers, namely (1) access method selection, (2)
model replication, and (3) data replication. To help un-
derstand the statistical-versus-hardware tradeoff space, we
present some experimental results in a Tradeoffs paragraph
within each subsection. We describe implementation details
for DimmWitted in the full version of this paper.
3.1 System Overview
We describe analytics tasks in DimmWitted and the ex-
ecution model of DimmWitted given an analytics task.
System Input. For each analytics task that we study, we
assume that the user provides data A ∈ RN×d and an initial
model that is a vector of length d. In addition, for each
access method listed above, there is a function of an ap-
propriate type that solves the same underlying model. For
example, we provide both a row- and column-wise way of
solving a support vector machine. Each method takes two
arguments; the first is a pointer to a model.
1www.intel.com/content/www/us/
en/io/quickpath-technology/
quick-path-interconnect-introduction-paper.html
Tradeoff
Strategies
Existing Systems
Access
Methods
Row-wise
SP, HW
Column-wise
GL
Column-to-row
Model
Replication
Per Core
GL, SP
Per Node
Per Machine
HW
Data
Replication
Sharding
GL, SP, HW
Full Replication
Figure 5: A Summary of DimmWitted’s Tradeoffs
and Existing Systems (GraphLab (GL), Hogwild!
(HW), Spark (SP)).
• frow captures the the row-wise access method, and its
second argument is the index of a single row.
• fcol captures the column-wise access method, and its
second argument is the index of a single column.
• fctr captures the column-to-row access method, and
its second argument is a pair of one column index and
a set of row indexes. These rows correspond to the
non-zero entries in a data matrix for a single column.2
Each of the functions modifies the model to which they re-
ceive a pointer in place. However, in our study, frow can
modify the whole model, while fcol and fctr only modify a
single variable of the model. We call the above tuple of func-
tions a model specification. Note that a model specification
contains either fcol or fctr but typically not both.
Execution. Given a model specification, our goal is to gen-
erate an execution plan. An execution plan, schematically
illustrated in Figure 4, specifies three things for each CPU
core in the machine: (1) a subset of the data matrix to op-
erate on, (2) a replica of the model to update, and (3) the
access method used to update the model. We call the set of
replicas of data and models locality groups, as the replicas
are described physically; i.e., they correspond to regions of
memory that are local to particular NUMA nodes, and one
or more workers may be mapped to each locality group. The
data assigned to distinct locality groups may overlap. We
use DimmWitted’s engine to explore three tradeoffs:
(1) Access methods, in which we can select between
either the row or column method to access the data.
(2) Model replication, in which we choose how to create
and assign replicas of the model to each worker. When
a worker needs to read or write the model, it will read
or write the model replica that it is assigned.
(3) Data replication, in which we choose a subset of data
tuples for each worker. The replicas may be overlap-
ping, disjoint, or some combination.
Figure 5 summarizes the tradeoff space. In each section,
we illustrate the tradeoff along two axes, namely (1) the
statistical efficiency, i.e., the number of epochs it takes to
converge, and (2) hardware efficiency, the time that each
method takes to finish a single epoch.
2Define S(j) = {i : aij = 0}. For a column j, the input to
fctr is a pair (j, S(j)).
4

Page 5
Algorithm
Read
Write (Dense) Write (Sparse)
Row-wise
Column-wise
Column-to-row
ni
d
dN
ni
2
ni
ni
Figure 6: Per Epoch Execution Cost of Row- and
Column-wise Access. The Write column is for a single
model replica. Given a dataset A ∈ RN×d, let ni be the
number of non-zero elements ai.
0.01  
0.1  
1  
10  
0.1  
1  
10  
T
ime
/Ep
oc
h
(se
c
on
d
s)
Cost Ratio
Row-wise
Column-wise
(a) Number of Epochs to Converge
(b) Time for Each Epoch
1  
10  
100  
SVM1  SVM2   LP1   LP2  
# Ep
oc
h
Models and Data Sets
Column-wise
Row-wise
Figure 7: Illustration of the Method Selection
Tradeoff. (a) These four datasets are RCV1, Reuters,
Amazon, and Google, respectively. (b) The “cost ratio”
is defined as the ratio of costs estimated for row-wise and
column-wise methods: (1+α) ∑i ni/(∑i n2
i +αd), where ni
is the number of non-zero elements of ith row of A and
α is the cost ratio between writing and reads. We set
α = 10 to plot this graph.
3.2 Access Method Selection
In this section, we examine each access method: row-wise,
column-wise, and column-to-row. We find that the execu-
tion time of an access method depends more on hardware
efficiency than on statistical efficiency.
Tradeoffs. We consider the two tradeoffs that we use for
a simple cost model (Figure 6). Let ni be the number of
non-zeros in row i; when we store the data as sparse vec-
tors/matrices in CSR format, the number of reads in a row-
wise access method is ∑N
i=1 ni. Since each example is likely
to be written back in a dense write, we perform dN writes
per epoch. Our cost model combines these two costs lin-
early with a factor α that accounts for writes being more
expensive, on average, because of contention. The factor α
is estimated at installation time by measuring on a small set
of datasets. The parameter α is in 4 to 12 and grows with
the number of sockets; e.g., for local2, α ≈ 4, and for local8,
α ≈ 12. Thus, α may increase in the future.
Statistical Efficiency. We observe that each access
method has comparable statistical efficiency. To illustrate
this, we run all methods on all of our datasets and report
the number of epochs that one method converges to a given
error to the optimal loss, and Figure 7(a) shows the result
on four datasets with 10% error. We see that the gap in the
number of epochs across different methods is small (always
within 50% of each other).
Hardware Efficiency. Different access methods can
change the time per epoch by up to a factor of 10×, and
there is a cross-over point. To see this, we run both meth-
ods on a series of synthetic datasets where we control the
number of non-zero elements per row by subsampling each
row on the Music dataset (see Section 4 for more details).
For each subsampled dataset, we plot the cost ratio on the
x-axis, and we plot their actual running time per epoch in
Figure 7(b). We see a cross-over point on the time used
per epoch: when the cost ratio is small, row-wise outper-
forms column-wise by 6×, as the column-wise method reads
more data; on the other hand, when the ratio is large, the
column-wise method outperforms the row-wise method by
3×, as the column-wise method has lower write contention.
We observe similar cross-over points on our other datasets.
Cost-based Optimizer. DimmWitted estimates the exe-
cution time of different access methods using the number of
bytes that each method reads and writes in one epoch, as
shown in Figure 6. For writes, it is slightly more complex:
for models such as SVM, each gradient step in row-wise ac-
cess only updates the coordinates where the input vector
contains non-zero elements. We call this scenario a sparse
update; otherwise, it is a dense update.
DimmWitted needs to estimate the ratio of the cost of
reads to writes. To do this, it runs a simple benchmark
dataset. We find that, for all the eight datasets, five statis-
tical models, and five machines that we use in the experi-
ments, the cost model is robust to this parameter: as long
as writes are 4× to 100× more expensive than reading, the
cost model makes the correct decision between row-wise and
column-wise access.
3.3 Model Replication
In DimmWitted, we consider three model replication
strategies. The first two strategies, namely PerCore and
PerMachine, are similar to traditional shared-nothing and
shared-memory architecture, respectively. We also consider
a hybrid strategy, PerNode, designed for NUMA machines.
3.3.1 Granularity of Model Replication
The difference between the three model replication strate-
gies is the granularity of replicating a model. We first de-
scribe PerCore and PerMachine and their relationship with
other existing systems (Figure 5). We then describe PerN-
ode, a simple, novel hybrid strategy that we designed to
leverage the structure of NUMA machines.
PerCore. In the PerCore strategy, each core maintains a mu-
table state, and these states are combined to form a new
version of the model (typically at the end of each epoch).
This is essentially a shared-nothing architecture; it is imple-
mented in Impala, Pivotal, and Hadoop-based frameworks.
PerCore is popularly implemented by state-of-the-art sta-
tistical analytics frameworks such as Bismarck, Spark, and
GraphLab. There are subtle variations to this approach: in
Bismarck’s implementation, each worker processes a parti-
tion of the data, and its model is averaged at the end of
each epoch; Spark implements a minibatch-based approach
in which parallel workers calculate the gradient based on
examples, and then gradients are aggregated by a single
thread to update the final model; GraphLab implements
an event-based approach where each different task is dy-
namically scheduled to satisfy the given consistency require-
ment. In DimmWitted, we implement PerCore in a way
that is similar to Bismarck, where each worker has its own
model replica, and each worker is responsible for updating
5

Page 6
1  
10  
100  
1000  
10000  
1%  
10%  
100%  
PerCore
# Ep
oc
h
Error to Optimal Loss
(a) Number of Epochs to Converge
PerNode
PerMachine
(b) Time for Each Epoch
0.1  
1  
10  
T
ime
/Ep
oc
h
(se
c
on
d
)
Model Replication Strategies
PerMachine
PerCore
PerNode
Figure 8: Illustration of Model Replication.
its replica.3
As we will show in the experiment section,
DimmWitted’s implementation is 3-100× faster than either
GraphLab and Spark. Both systems have additional sources
of overhead that DimmWitted does not, e.g., for fault tol-
erance in Spark and a distributed environment in both. We
are not making an argument about the relative merits of
these features in applications, only that they would obscure
the tradeoffs that we study in this paper.
PerMachine. In the PerMachine strategy, there is a single
model replica that all workers update during execution. Per-
Machine is implemented in Hogwild! and Google’s Down-
pour. Hogwild! implements a lock-free protocol, which
forces the hardware to deal with coherence. Although differ-
ent writers may overwrite each other and readers may have
dirty reads, Niu et al. [38] prove that Hogwild! converges.
PerNode. The PerNode strategy is a hybrid of PerCore and
PerMachine. In PerNode, each NUMA node has a single
model replica that is shared among all cores on that node.
Model Synchronization. Deciding how often the replicas
synchronize is key to the design. In Hadoop-based and
Bismarck-based models, they synchronize at the end of each
epoch. This is a shared-nothing approach that works well in
user-defined aggregations. However, we consider finer gran-
ularities of sharing. In DimmWitted, we chose to have one
thread that periodically reads models on all other cores, av-
erages their results, and updates each replica.
One key question for model synchronization is how fre-
quently should the model be synchronized? Intuitively, we
might expect that more frequent synchronization will lower
the throughput; on the other hand, the more frequently we
synchronize, the fewer number of iterations we might need
to converge. However, in DimmWitted, we find that the
optimal choice is to communicate as frequently as possi-
ble. The intuition is that the QPI has staggering band-
width (25GB/s) compared to the small amount of data we
are shipping (megabytes). As a result, in DimmWitted, we
implement an asynchronous version of the model averaging
protocol: a separate thread averages models, with the effect
of batching many writes together across the cores into one
write, reducing the number of stalls.
3We implemented MLlib’s minibatch in DimmWitted. We
find that the Hogwild!-like implementation always dom-
inates the minibatch implementation.
DimmWitted’s
column-wise implementation for PerMachine is similar to
GraphLab, with the only difference that DimmWitted does
not schedule the task in an event-driven way.
1  
100  
10000  
1%  
10%  
100%  
Sharding
FullReplication
(a) Number of Epochs to Converge
(b) Time for Each Epoch
0.0001  
0.001  
0.01  
local2   local4   local8  
Sharding
FullReplication
T
ime
/Ep
oc
h
(se
c
on
d
s)
# Ep
oc
h
Error to Optimal Loss
Different Machines
Figure 9: Illustration of Data Replication.
Tradeoffs. We observe that PerNode is more hardware ef-
ficient, as it takes less time to execute an epoch than Per-
Machine; PerMachine might use fewer number of epochs to
converge than PerNode.
Statistical Efficiency. We observe that PerMachine usu-
ally takes fewer epochs to converge to the same loss com-
pared to PerNode, and PerNode uses fewer number of epochs
than PerCore. To illustrate this observation, Figure 8(a)
shows the number of epochs that each strategy requires to
converge to a given loss for SVM (RCV1). We see that
PerMachine always uses the least number of epochs to con-
verge to a given loss: intuitively, the single model replica
has more information at each step, which means that there
is less redundant work. We observe similar phenomena when
comparing PerCore and PerNode.
Hardware Efficiency. We observe that PerNode uses
much less time to execute an epoch than PerMachine. To
illustrate the difference in the time that each model replica-
tion strategy uses to finish one epoch, we show in Figure 8(b)
the execution time of three strategies on SVM (RCV1). We
see that PerNode is 23× faster than PerMachine and that Per-
Core is 1.5× faster than PerNode. PerNode takes advantage
of the locality provided by the NUMA architecture. Using
PMUs, we find that PerMachine incurs 11× more cross-node
DRAM requests than PerNode.
Rule of Thumb. For SGD-based models, PerNode usually
gives optimal results, while for SCD-based models, PerMa-
chine does. Intuitively, this is caused by the fact that SGD
has a denser update pattern than SCD, so, PerMachine suf-
fers from hardware efficiency.
3.4 Data Replication
In DimmWitted, each worker processes a subset of data
and then updates its model replica. To assign a subset of
data to each worker, we consider two strategies.
Sharding. Sharding is a popular strategy implemented in sys-
tems such as Hogwild!, Spark, and Bismarck, in which the
dataset is partitioned, and each worker only works on its par-
tition of data. When there is a single model replica, Sharding
avoids wasted computation, as each tuple is processed once
per epoch. However, when there are multiple model repli-
cas, Sharding might increase the variance of the estimate
we form on each node, lowering the statistical efficiency. In
DimmWitted, we implement Sharding by randomly parti-
tioning the rows (resp. columns) of a data matrix for the
row-wise (resp. column-wise) access method. In column-to-
row access, we also replicate other rows that are needed.
FullReplication. A simple alternative to Sharding is FullRepli-
cation, in which we replicate the whole dataset many times
6

Page 7
Model Dataset #Row #Col. NNZ
Size
(Sparse)
Size
(Dense)
Sparse
SVM
LR
LS
RCV1
781K 47K
60M 914MB 275GB
Reuters
8K 18K
93K
1.4MB 1.2GB
Music
515K
91
46M 701MB 0.4GB
Forest
581K
54
30M 490MB 0.2GB
LP
Amazon 926K 335K
2M
28MB
>1TB
Google
2M
2M
3M
25MB
>1TB
QP
Amazon
1M
1M
7M 104MB
>1TB
Google
2M
2M
10M 152MB
>1TB
Gibbs
Paleo
69M 30M 108M
2GB
>1TB
NN
MNIST 120M 800K 120M
2GB
>1TB
Figure 10: Dataset Statistics. NNZ refers to the
Number of Non-zero elements. The # columns is
equal to the number of variables in the model.
(PerCore or PerNode). In PerNode, each NUMA node will
have a full copy of the data. Each node accesses its data
in a different order, which means that the replicas provide
non-redundant statistical information. Statistically, there
are two benefits of FullReplication: (1) averaging different
estimates from each node has a lower variance, and (2) the
estimate at each node has lower variance than in the Shard-
ing case, as each node’s estimate is based on the whole data.
From a hardware efficiency perspective, reads are more fre-
quent from local NUMA memory in PerNode than in Per-
Machine. The PerNode approach dominates the PerCore ap-
proach, as reads from the same node go to the same NUMA
memory. Thus, we do not consider PerCore replication from
this point on.
Tradeoffs. Not surprisingly, we observe that FullReplication
takes more time for each epoch than Sharding. However,
we also observe that FullReplication uses fewer epochs than
Sharding, especially to achieve low error. We illustrate these
two observations by showing the result of running SVM on
Reuters using PerNode in Figure 9.
Statistical Efficiency. FullReplication uses fewer epochs,
especially to low-error tolerance. Figure 9(a) shows the
number of epochs that each strategy takes to converge to a
given loss. We see that, for within 1% of the loss, FullRepli-
cation uses 10× fewer epochs on a two-node machine. This
is because each model replica sees more data than Shard-
ing, and therefore has a better estimate. Because of this
difference in the number of epochs, FullReplication is 5×
faster in wall-clock time than Sharding to converge to 1%
loss. However, we also observe that, at high-error regions,
FullReplication uses more epochs than Sharding and causes a
comparable execution time to a given loss.
Hardware Efficiency. Figure 9(b) shows the time for
each epoch across different machines with different numbers
of nodes. Because we are using the PerNode strategy, which
is the optimal choice for this dataset, the more nodes a ma-
chine has, the slower FullReplication is for each epoch. The
slow-down is roughly consistent with the number of nodes
on each machine. This is not surprising because each epoch
of FullReplication processes more data than Sharding.
4. EXPERIMENTS
We validate that exploiting the tradeoff space that we de-
scribed enables DimmWitted’s orders of magnitude speedup
over state-of-the-art competitor systems. We also validate
that each tradeoff discussed in this paper affects the perfor-
mance of DimmWitted.
4.1 Experiment Setup
We describe the details of our experimental setting.
Datasets and Statistical Models. We validate the per-
formance and quality of DimmWitted on a diverse set of
statistical models and datasets. For statistical models, we
choose five models that are among the most popular models
used in statistical analytics: (1) Support Vector Machine
(SVM), (2) Logistic Regression (LR), (3) Least Squares
Regression (LS), (4) Linear Programming (LP), and (5)
Quadratic Programming (QP). For each model, we choose
datasets with different characteristics, including size, spar-
sity, and under- or over-determination. For SVM, LR, and
LS, we choose four datasets: Reuters4, RCV15, Music6, and
Forest.7 Reuters and RCV1 are datasets for text classifica-
tion that are sparse and underdetermined. Music and Forest
are standard benchmark datasets that are dense and overde-
termined. For QP and LR, we consider a social-network ap-
plication, i.e., network analysis, and use two datasets from
Amazon’s customer data and Google’s Google+ social net-
works.8 Figure 10 shows the dataset statistics.
Metrics. We measure the quality and performance of
DimmWitted and other competitors. To measure the qual-
ity, we follow prior art and use the loss function for all func-
tions. For end-to-end performance, we measure the wall-
clock time it takes for each system to converge to a loss that
is within 100%, 50%, 10%, and 1% of the optimal loss.9
When measuring the wall-clock time, we do not count the
time used for data loading and result outputting for all sys-
tems. We also use other measurements to understand the
details of the tradeoff space, including (1) local LLC request,
(2) remote LLC request, and (3) local DRAM request. We
use Intel Performance Monitoring Units (PMUs) and follow
the manual10 to conduct these experiments.
Experiment Setting. We compare DimmWitted with four
competitor systems: GraphLab [34], GraphChi [28], ML-
lib [47] over Spark [55], and Hogwild! [38]. GraphLab is
a distributed graph processing system that supports a large
range of statistical models. GraphChi is similar to GraphLab
but with a focus on multi-core machines with secondary stor-
age. MLlib is a package of machine learning algorithms im-
plemented over Spark, an in-memory implementation of the
MapReduce framework. Hogwild! is an in-memory lock-
free framework for statistical analytics. We find that all four
systems pick some points in the tradeoff space that we con-
sidered in DimmWitted. In GraphLab and GraphChi, all
models are implemented using stochastic coordinate descent
4archive.ics.uci.edu/ml/datasets/Reuters-21578+
Text+Categorization+Collection
5about.reuters.com/researchandstandards/corpus/
6archive.ics.uci.edu/ml/datasets/YearPredictionMSD
7archive.ics.uci.edu/ml/datasets/Covertype
8snap.stanford.edu/data/
9We obtain the optimal loss by running all systems for one
hour and choosing the lowest.
10software.intel.com/en-us/articles/
performance-monitoring-unit-guidelines
7

Page 8
Dataset
Within 1% of the Optimal Loss
Within 50% of the Optimal Loss
GraphLab
GraphChi
MLlib
Hogwild!
DW
GraphLab
GraphChi
MLlib
Hogwild!
DW
SVM
Reuters
58.9
56.7
15.5
0.1
0.1
13.6
11.2
0.6
0.01
0.01
RCV1
> 300.0
> 300.0
> 300
61.4
26.8
> 300.0
> 300.0
58.0
0.71
0.17
Music
> 300.0
> 300.0
156
33.32
23.7
31.2
27.1
7.7
0.17
0.14
Forest
16.2
15.8
2.70
0.23
0.01
1.9
1.4
0.15
0.03
0.01
LR
Reuters
36.3
34.2
19.2
0.1
0.1
13.2
12.5
1.2
0.03
0.03
RCV1
> 300.0
> 300.0
> 300.0
38.7
19.8
> 300.0
> 300.0
68.0
0.82
0.20
Music
> 300.0
> 300.0
> 300.0
35.7
28.6
30.2
28.9
8.9
0.56
0.34
Forest
29.2
28.7
3.74
0.29
0.03
2.3
2.5
0.17
0.02
0.01
LS
Reuters
132.9
121.2
92.5
4.1
3.2
16.3
16.7
1.9
0.17
0.09
RCV1
> 300.0
> 300.0
> 300
27.5
10.5
> 300.0
> 300.0
32.0
1.30
0.40
Music
> 300.0
> 300.0
221
40.1
25.8
> 300.0
> 300.0
11.2
0.78
0.52
Forest
25.5
26.5
1.01
0.33
0.02
2.7
2.9
0.15
0.04
0.01
LP
Amazon
2.7
2.4
> 120.0
> 120.0
0.94
2.7
2.1
120.0
1.86
0.94
Google
13.4
11.9
> 120.0
> 120.0
12.56
2.3
2.0
120.0
3.04
2.02
QP
Amazon
6.8
5.7
> 120.0
> 120.0
1.8
6.8
5.7
> 120.0
> 120.00
1.50
Google
12.4
10.1
> 120.0
> 120.0
4.3
9.9
8.3
> 120.0
> 120.00
3.70
Figure 11: End-to-End Comparison (time in seconds). The column DW refers to DimmWitted. We take 5
runs on local2 and report the average (standard deviation for all numbers < 5% of the mean). Entries with
> indicate a timeout.
1  
10  
100  
1000  
1%  
10%  
100%  
1  
10  
100  
1000  
1%  
10%  
100%  
0.1  
1  
10  
100  
1000  
1%  
10%  
100%  
0.1  
1  
10  
100  
1%  
10%  
100%  
0.01  
0.1  
1  
10  
1%  
10%  
100%  
0.01  
0.1  
1  
10  
1%  
10%  
100%  
0.1  
1  
10  
100  
1%  
10%  
100%  
0.1  
1  
10  
100  
1%  
10%  
100%  
SVM (RCV1)
Error to Optimal Loss
T
ime (secon
d
s)
(b
)
M
od
el R
ep
lication
SVM (Music)
LP (Amazon)
LP (Google)
PerNode
PerMachine
(a)
A
ccess M
eth
od
S
el.
T
ime (secon
d
s)
Row-wise
Column-wise
PerCore
Figure 12: Tradeoffs in DimmWitted. Missing points timeout in 120 seconds.
(column-wise access); in MLlib and Hogwild!, SVM and LR
are implemented using stochastic gradient descent (row-wise
access). We use implementations that are provided by the
original developers whenever possible. For models with-
out code provided by the developers, we only change the
corresponding gradient function.11
For GraphChi, if the
corresponding model is implemented in GraphLab but not
GraphChi, we follow GraphLab’s implementation.
We run experiments on a variety of architectures. These
machines differ in a range of configurations, including the
number of NUMA nodes, the size of last-level cache (LLC),
and memory bandwidth. See Figure 3 for a summary of
these machines. DimmWitted, Hogwild!, GraphLab, and
GraphChi are implemented using C++, and MLlib/Spark
is implemented using Scala. We tune both GraphLab and
MLlib according to their best practice guidelines.12 For both
11For sparse models, we change the dense vector data struc-
ture in MLlib to a sparse vector, which only improves its
performance.
12MLlib:spark.incubator.apache.org/docs/0.6.0/
tuning.html; GraphLab:
graphlab.org/tutorials-2/
fine-tuning-graphlab-performance/.
For GraphChi,
we tune the memory buffer size to ensure all data fit in
memory and that there are no disk I/Os. We describe more
detailed tuning for MLlib in the full version of this paper.
GraphLab, GraphChi, and MLlib, we try different ways of
increasing locality on NUMA machines, including trying to
use numactl and implementing our own RDD for MLlib;
there is more detail in the full version of this paper. Systems
are compiled with g++ 4.7.2 (-O3), Java 1.7, or Scala 2.9.
4.2 End-to-End Comparison
We validate that DimmWitted outperforms competitor
systems in terms of end-to-end performance and quality.
Note that both MLlib and GraphLab have extra overhead
for fault tolerance, distributing work, and task scheduling.
Our comparison between DimmWitted and these competi-
tors is intended only to demonstrate that existing work for
statistical analytics has not obviated the tradeoffs that we
study here.
Protocol. For each system, we grid search their statistical
parameters, including step size ({100.0,10.0,...,0.0001}) and
mini-batch size for MLlib ({1%, 10%, 50%, 100%}); we al-
ways report the best configuration, which is essentially the
same for each system. We measure the time it takes for each
system to find a solution that is within 1%, 10%, and 50%
of the optimal loss. Figure 11 shows the results for 1% and
50%; the results for 10% are similar. We report end-to-end
numbers from local2, which has two nodes and 24 logical
8

Page 9
cores, as GraphLab does not run on machines with more
than 64 logical cores. Figure 14 shows the DimmWitted’s
choice of point in the tradeoff space on local2.
As shown in Figure 11, DimmWitted always converges
to the given loss in less time than the other competitors.
On SVM and LR, DimmWitted could be up to 10× faster
than Hogwild!, and more than two orders of magnitude
faster than GraphLab and Spark. The difference between
DimmWitted and Hogwild! is greater for LP and QP, where
DimmWitted outperforms Hogwild! by more than two or-
ders of magnitude. On LP and QP, DimmWitted is also up
to 3× faster than GraphLab and GraphChi, and two orders
of magnitude faster than MLlib.
Tradeoff Choices. We dive more deeply into these numbers
to substantiate our claim that there are some points in the
tradeoff space that are not used by GraphLab, GraphChi,
Hogwild!, and MLlib. Each tradeoff selected by our sys-
tem is shown in Figure 14. For example, GraphLab and
GraphChi uses column-wise access for all models, while ML-
lib and Hogwild! use row-wise access for all models and al-
low only PerMachine model replication. These special points
work well for some but not all models. For example, for LP
and QP, GraphLab and GraphChi are only 3× slower than
DimmWitted, which chooses column-wise and PerMachine.
This factor of 3 is to be expected, as GraphLab also allows
distributed access and so has additional overhead. However
there are other points: for SVM and LR, DimmWitted
outperforms GraphLab and GraphChi, because the column-
wise algorithm implemented by GraphLab and GraphChi is
not as efficient as row-wise on the same dataset. DimmWit-
ted outperforms Hogwild! because DimmWitted takes ad-
vantage of model replication, while Hogwild! incurs 11×
more cross-node DRAM requests than DimmWitted; in
contrast, DimmWitted incurs 11× more local DRAM re-
quests than Hogwild! does.
For SVM, LR, and LS, we find that DimmWitted out-
performs MLlib, primarily due to a different point in the
tradeoff space. In particular, MLlib uses batch-gradient-
descent with a PerCore implementation, while DimmWitted
uses stochastic gradient and PerNode. We find that, for the
Forest dataset, DimmWitted takes 60× fewer number of
epochs to converge to 1% loss than MLlib. For each epoch,
DimmWitted is 4× faster. These two factors contribute to
the 240× speed-up of DimmWitted over MLlib on the For-
est dataset (1% loss). MLlib has overhead for scheduling, so
we break down the time that MLlibuses for scheduling and
computation. We find that, for Forest, out of the total 2.7
seconds of execution, MLlib uses 1.8 seconds for computa-
tion and 0.9 seconds for scheduling. We also implemented
a batch-gradient-descent and PerCore implementation inside
DimmWitted to remove these and C++ versus Scala dif-
ferences. The 60× difference in the number of epochs until
convergence still holds, and our implementation is only 3×
faster than MLlib. This implies that the main difference be-
tween DimmWitted and MLlib is the point in the tradeoff
space—not low-level implementation differences.
For LP and QP, DimmWitted outperforms MLlib and
Hogwild! because the row-wise access method implemented
by these systems is not as efficient as column-wise access on
the same data set. GraphLab does have column-wise access,
so DimmWitted outperforms GraphLab and GraphChi be-
cause DimmWitted finishes each epoch up to 3× faster,
SVM
(RCV1)
LR
(RCV1)
LS
(RCV1)
LP
(Google)
QP
(Google)
Parallel
Sum
GraphLab
0.2
0.2
0.2
0.2
0.1
0.9
GraphChi
0.3
0.3
0.2
0.2
0.2
1.0
MLlib
0.2
0.2
0.2
0.1
0.02
0.3
Hogwild!
1.3
1.4
1.3
0.3
0.2
13
DIMMWITTED
5.1
5.2
5.2
0.7
1.3
21
Figure
13:
Comparison
of
Throughput
(GB/seconds) of Different Systems on local2.
Access Methods Model Replication Data Replication
SVM
LR
LS
Reuters
Row-wise
PerNode
FullReplication
RCV1
Music
LP
QP
Amazon
Column-wise
PerMachine
FullReplication
Google
Figure 14: Plans that DimmWitted Chooses in the
Tradeoff Space for Each Dataset on Machine local2.
primarily due to low-level issues. This supports our claims
that the tradeoff space is interesting for analytic engines and
that no one system has implemented all of them.
Throughput. We compare the throughput of different sys-
tems for an extremely simple task: parallel sums. Our im-
plementation of parallel sum follows our implementation of
other statistical models (with a trivial update function),
and uses all cores on a single machine. Figure 13 shows
the throughput on all systems on different models on one
dataset. We see from Figure 13 that DimmWitted achieves
the highest throughput of all the systems. For parallel sum,
DimmWitted is 1.6× faster than Hogwild!, and we find that
DimmWitted incurs 8× fewer LLC cache misses than Hog-
wild!. Compared with Hogwild!, in which all threads write
to a single copy of the sum result, DimmWitted maintains
one single copy of the sum result per NUMA node, so the
workers on one NUMA node do not invalidate the cache
on another NUMA node. When running on only a single
thread, DimmWitted has the same implementation as Hog-
wild!. Compared with GraphLab and GraphChi, DimmWit-
ted is 20× faster, likely due to the overhead of GraphLab
and GraphChi dynamically scheduling tasks and/or main-
taining the graph structure. To compare DimmWitted with
MLlib, which is written in Scala, we implemented a Scala
version, which is 3× slower than C++; this suggests that
the overhead is not just due to the language. If we do not
count the time that MLlibuses for scheduling and only count
the time of computation, we find that DimmWitted is 15×
faster than MLlib.
4.3 Tradeoffs of DimmWitted
We validate that all the tradeoffs described in this paper
have an impact on the efficiency of DimmWitted. We re-
port on a more modern architecture, local4 with four NUMA
sockets, in this section. We describe how the results change
with different architectures.
4.3.1 Access Method Selection
We validate that different access methods have different
performance, and that no single access method dominates
the others. We run DimmWitted on all statistical models
and compare two strategies, row-wise and column-wise. In
9

Page 10
0.01  
0.1  
1  
10  
100  
0.01  
0.1  
1  
10  
100  
R
atio of Execu
tion
T
ime p
er
Ep
och
(row
-w
ise/col-w
ise)
SVM (RCV1)
LP (Amazon)
[e1] [e2] [l2] [l4] [l8]
8x2 8x2 6x2 10x4 8x8
#Cores/Socket # Sockets [Machine Name]
[e1] [e2] [l2] [l4] [l8]
8x2 8x2 6x2 10x4 8x8
Figure 15: Ratio of Execution Time per Epoch (row-
wise/column-wise) on Different Architectures. A
number larger than 1 means that row-wise is slower.
l2 means local2, e1 means ec2.1, etc.
each experiment, we force DimmWitted to use the corre-
sponding access method, but report the best point for the
other tradeoffs. Figure 12(a) shows the results as we mea-
sure the time it takes to achieve each loss. The more strin-
gent loss requirements (1%) are on the left-hand side. The
horizontal line segments in the graph indicate that a model
may reach, say, 50% as quickly (in epochs) as it reaches
100%.
We see from Figure 12(a) that the difference between row-
wise and column-to-row access could be more than 100×
for different models. For SVM on RCV1, row-wise access
converges at least 4× faster to 10% loss and at least 10×
faster to 100% loss. We observe similar phenomena for
Music; compared with RCV1, column-to-row access con-
verges to 50% loss and 100% loss at a 10× slower rate.
With such datasets, the column-to-row access simply re-
quires more reads and writes. This supports the folk wis-
dom that gradient methods are preferable to coordinate de-
scent methods. On the other hand, for LP, column-wise
access dominates: row-wise access does not converge to 1%
loss within the timeout period for either Amazon or Google.
Column-wise access converges at least 10-100× faster than
row-wise access to 1% loss. We observe that LR is similar
to SVM and QP is similar to LP. Thus, no access method
dominates all the others.
The cost of writing and reading are different and is cap-
tured by a parameter that we called α in Section 3.2. We de-
scribe the impact of this factor on the relative performance
of row- and column-wise strategies. Figure 15 shows the
ratio of the time that each strategy uses (row-wise/column-
wise) for SVM (RCV1) and LP (Amazon). We see that,
as the number of sockets on a machine increases, the ratio
of execution time increases, which means that row-wise be-
comes slower relative to column-wise, i.e., with increasing α.
As the write cost captures the cost of a hardware-resolved
conflict, we see that this constant is likely to grow. Thus,
if next-generation architectures increase in the number of
sockets, the cost parameter α and consequently the impor-
tance of this tradeoff are likely to grow.
Cost-based Optimizer. We observed that, for all datasets,
our cost-based optimizer selects row-wise access for SVM,
LR, and LS, and column-wise access for LP and QP. These
choices are consistent with what we observed in Figure 12.
4.3.2 Model Replication
We validate that there is no single strategy for model repli-
cation that dominates the others. We force DimmWitted
to run strategies in PerMachine, PerNode, and PerCore and
choose other tradeoffs by choosing the plan that achieves
the best result. Figure 12(b) shows the results.
We see from Figure 12(b) that the gap between PerMa-
chine and PerNode could be up to 100×. We first observe
that PerNode dominates PerCore on all datasets. For SVM
on RCV1, PerNode converges 10× faster than PerCore to
50% loss, and for other models and datasets, we observe a
similar phenomenon. This is due to the low statistical effi-
ciency of PerCore, as we discussed in Section 3.3. Although
PerCore eliminates write contention inside one NUMA node,
this write contention is less critical. For large models and
machines with small caches, we also observe that PerCore
could spill the cache.
These graphs show that neither PerMachine nor PerNode
dominates the other across all datasets and statistical mod-
els. For SVM on RCV1, PerNode converges 12× faster than
PerMachine to 50% loss. However, for LP on Amazon, Per-
Machine is at least 14× faster than PerNode to converge to
1% loss. For SVM, PerNode converges faster because it has
5× higher throughput than PerMachine, and for LP, PerN-
ode is slower because PerMachine takes at least 10× fewer
epochs to converge to a small loss. One interesting obser-
vation is that, for LP on Amazon, PerMachine and PerNode
do have comparable performance to converge to 10% loss.
Compared with the 1% loss case, this implies that PerN-
ode’s statistical efficiency decreases as the algorithm tries to
achieve a smaller loss. This is not surprising, as one must
reconcile the PerNode estimates.
We observe that the relative performance of PerMachine
and PerNode depends on (1) the number of sockets used on
each machine and (2) the sparsity of the update.
To validate (1), we measure the time that PerNode and
PerMachine take on SVM (RCV1) to converge to 50%
loss on various architectures, and we report the ratio
(PerMachine/PerNode) in Figure 16. We see that PerNode’s
relative performance improves with the number of sockets.
We attribute this to the increased cost of write contention
in PerMachine.
To validate (2), we generate a series of synthetic datasets,
each of which subsamples the elements in each row of the
Music dataset; Figure 16(b) shows the results. When the
sparsity is 1%, PerMachine outperforms PerNode, as each
update touches only one element of the model; thus, the
write contention in PerMachine is not a bottleneck. As the
sparsity increases (i.e., the update becomes denser), we ob-
serve that PerNode outperforms PerMachine.
4.3.3 Data Replication
We validate the impact of different data replication strate-
gies. We run DimmWitted by fixing data replication strate-
gies to FullReplication or Sharding and choosing the best plan
for each other tradeoff. We measure the execution time for
each strategy to converge to a given loss for SVM on the
same dataset, RCV1. We report the ratio of these two
strategies as FullReplication/Sharding in Figure 17(a). We
see that, for the low-error region (e.g., 0.1%), FullReplication
is 1.8-2.5× faster than Sharding. This is because FullReplica-
tion decreases the skew of data assignment to each worker,
so hence each individual model replica can form a more ac-
curate estimate. For the high-error region (e.g., 100%), we
10

Page 11
0.1
1
10
100
0
0.5
1
0.1
1
10
100
(a) Architecture
#Cores/Socket # Sockets
[Machine Name]
R
atio of Execu
tion
T
ime
(P
erM
ach
in
e/P
erN
od
e)
[e1] [e2] [l2] [l4] [l8]
8x2 8x2 6x2 10x4 8x8
(b) Sparsity
Sparsity of Synthetic
Data sets on Music
PerMachine Better
PerNode Better
Figure 16: The Impact of Different Architectures
and Sparsity on Model Replication. A ratio larger
than 1 means that PerNode converges faster than
PerMachine to 50% loss.
(b)
1  
10  
100  
Gibbs
NN
# V
ariab
les/secon
d
(M
illion
)
Classic Choice
DimmWitted
0  
1  
2  
3  
4  
5  
0.1%   1.0%   10.0%   100.0%  
(a)
R
atio of Exec. T
ime
(FullRepl./S
h
ard
in
g
)
FullRepl. Better
Sharding Better
Figure 17: (a) Tradeoffs of Data Replication. A ra-
tio smaller than 1 means that FullReplication is faster.
(b) Performance of Gibbs Sampling and Neural Net-
works Implemented in DimmWitted.
observe that FullReplication appears to be 2-5× slower than
Sharding. We find that, for 100% loss, both FullReplication
and Sharding converge in a single epoch, and Sharding may
therefore be preferred, as it examines less data to complete
that single epoch. In all of our experiments, FullReplication
is never substantially worse and can be dramatically better.
Thus, if there is available memory, the FullReplication data
replication seems to be preferable.
5. EXTENSIONS
We briefly describe how to run Gibbs sampling (which
uses a column-to-row access method) and deep neural net-
works (which uses a row access method). Using the same
tradeoffs, we achieve a significant increase in speed over the
classical implementation choices of these algorithms. A more
detailed description is in the full version of this paper.
5.1 Gibbs Sampling
Gibbs sampling is one of the most popular algorithms
to solve statistical inference and learning over probabilis-
tic graphical models [43]. We briefly describe Gibbs sam-
pling over factor graphs and observe that its main step is a
column-to-row access. A factor graph can be thought of as
a bipartite graph of a set of variables and a set of factors.
To run Gibbs sampling, the main operation is to select a
single variable, and calculate the conditional probability of
this variable, which requires the fetching of all factors that
contain this variable and all assignments of variables con-
nected to these factors. This operation corresponds to the
column-to-row access method. Similar to first-order meth-
ods, recently, a Hogwild! algorithm for Gibbs was estab-
lished [25]. As shown in Figure 17(b), applying the tech-
nique in DimmWitted to Gibbs sampling achieves 4× the
throughput of samples as the PerMachine strategy.
5.2 Deep Neural Networks
Neural networks are one of the most classic machine learn-
ing models [35]; recently, these models have been intensively
revisited by adding more layers [19,29]. A deep neural net-
work contains multiple layers, and each layer contains a set
of neurons (variables). Different neurons connect with each
other only by links across consecutive layers. The value of
one neuron is a function of all the other neurons in the pre-
vious layer and a set of weights. Variables in the last layer
have human labels as training data; the goal of deep neural
network learning is to find the set of weights that maximizes
the likelihood of the human labels. Back-propagation with
stochastic gradient descent is the de facto method of opti-
mizing a deep neural network.
Following LeCun et al. [30], we implement SGD over a
seven-layer neural network with 0.12 billion neurons and 0.8
million parameters using a standard handwriting-recognition
benchmark dataset called MNIST13. Figure 17(b) shows the
number of variables that are processed by DimmWitted
per second. For this application, DimmWitted uses PerN-
ode and FullReplication, and the classical choice made by Le-
Cun is PerMachine and Sharding. As shown in Figure 17(b),
DimmWitted achieves more than an order of magnitude
higher throughput than this classical baseline (to achieve
the same quality as reported in this classical paper).
6. RELATED WORK
We review work in four main areas: statistical analytics,
data mining algorithms, shared-memory multiprocessors op-
timization, and main-memory databases. We include more
extensive related work in the full version of this paper.
Statistical Analytics. There is a trend to integrate sta-
tistical analytics into data processing systems. Database
vendors have recently put out new products in this space, in-
cluding Oracle, Pivotal’s MADlib [23], IBM’s SystemML [21],
and SAP’s HANA. These systems support statistical analyt-
ics in existing data management systems. A key challenge
for statistical analytics is performance.
A handful of data processing frameworks have been de-
veloped in the last few years to support statistical ana-
lytics, including Mahout for Hadoop, MLI for Spark [47],
GraphLab [34], and MADLib for PostgreSQL or Green-
plum [23]. Although these systems increase the perfor-
mance of corresponding statistical analytics tasks signifi-
cantly, we observe that each of them implements one point
in DimmWitted’s tradeoff space. DimmWitted is not a
system; our goal is to study this tradeoff space.
Data Mining Algorithms. There is a large body of
data mining literature regarding how to optimize various al-
gorithms to be more architecturally aware [39,56,57]. Zaki
et al. [39,57] study the performance of a range of different al-
gorithms, including associated rule mining and decision tree
on shared-memory machines, by improving memory locality
and data placement in the granularity of cachelines, and de-
creasing the cost of coherent maintenance between multiple
CPU caches. Ghoting et al. [20] optimize the cache behavior
of frequent pattern mining using novel cache-conscious tech-
niques, including spatial and temporal locality, prefetching,
13yann.lecun.com/exdb/mnist/
11

Page 12
and tiling. Jin et al. [24] discuss tradeoffs in replication and
locking schemes for K-means, association rule mining, and
neural nets. This work considers the hardware efficiency of
the algorithm, but not statistical efficiency, which is the fo-
cus of DimmWitted. In addition, Jin et al. do not consider
lock-free execution, a key aspect of this paper.
Shared-memory Multiprocessor Optimization. Per-
formance optimization on shared-memory multiprocessors
machines is a classical topic. Anderson and Lam [4] and
Carr et al.’s [14] seminal work used complier techniques
to improve locality on shared-memory multiprocessor ma-
chines. DimmWitted’s locality group is inspired by Ander-
son and Lam’s discussion of computation decomposition and
data decomposition. These locality groups are the center-
piece of the Legion project [6]. In recent years, there have
been a variety of domain specific languages (DSLs) to help
the user extract parallelism; two examples of these DSLs in-
clude Galois [36,37] and OptiML [49] for Delite [15]. Our
goals are orthogonal: these DSLs require knowledge about
the trade-offs of the hardware, such as those provided by
our study.
Main-memory Databases. The database community
has recognized that multi-socket, large-memory machines
have changed the data processing landscape, and there has
been a flurry of recent work about how to build in-memory
analytics systems [3,5,16,27,31,40,41,52]. Classical tradeoffs
have been revisited on modern architectures to gain signif-
icant improvement: Balkesen et al. [5], Albutiu et al. [3],
Kim et al. [27], and Li [31] study the tradeoff for joins and
shuffling, respectively. This work takes advantage of modern
architectures, e.g., NUMA and SIMD, to increase memory
bandwidth. We study a new tradeoff space for statistical
analytics in which the performance of the system is affected
by both hardware efficiency and statistical efficiency.
7. CONCLUSION
For statistical analytics on main-memory, NUMA-aware
machines, we studied tradeoffs in access methods, model
replication, and data replication. We found that using novel
points in this tradeoff space can have a substantial bene-
fit: our DimmWitted prototype engine can run at least
one popular task at least 100× faster than other competitor
systems. This comparison demonstrates that this tradeoff
space may be interesting for current and next-generation
statistical analytics systems.
Acknowledgments We would like to thank Arun Kumar, Victor
Bittorf, the Delite team, the Advanced Analytics at Oracle, Green-
plum/Pivotal, and Impala’s Cloudera team for sharing their experi-
ences in building analytics systems. We gratefully acknowledge the
support of the Defense Advanced Research Projects Agency (DARPA)
XDATA Program under No. FA8750-12-2-0335 and the DEFT Pro-
gram under No. FA8750-13-2-0039, the National Science Founda-
tion (NSF) CAREER Award under No. IIS-1353606, the Office of
Naval Research (ONR) under awards No. N000141210041 and No.
N000141310129, the Sloan Research Fellowship, American Family In-
surance, Google, and Toshiba. Any opinions, findings, and conclusion
or recommendations expressed in this material are those of the au-
thors and do not necessarily reflect the view of DARPA, NSF, ONR,
or the US government.
8. REFERENCES
[1] A. Agarwal, O. Chapelle, M. Dudık, and J. Langford. A reliable
effective terascale linear learning system. ArXiv e-prints, 2011.
[2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis.
Weaving relations for cache performance. In VLDB, 2001.
[3] M.-C. Albutiu, A. Kemper, and T. Neumann. Massively
parallel sort-merge joins in main memory multi-core database
systems. PVLDB, pages 1064–1075, 2012.
[4] J. M. Anderson and M. S. Lam. Global optimizations for
parallelism and locality on scalable parallel machines. In PLDI,
pages 112–125, 1993.
[5] C. Balkesen and et al. Multi-core, main-memory joins: Sort vs.
hash revisited. PVLDB, pages 85–96, 2013.
[6] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion:
expressing locality and independence with logical regions. In
SC, page 66, 2012.
[7] N. Bell and M. Garland. Efficient sparse matrix-vector
multiplication on CUDA. Technical report, NVIDIA
Corporation, 2008.
[8] N. Bell and M. Garland. Implementing sparse matrix-vector
multiplication on throughput-oriented processors. In SC, pages
18:1–18:11, 2009.
[9] L. Bergstrom. Measuring NUMA effects with the STREAM
benchmark. ArXiv e-prints, 2011.
[10] C. Boutsidis and et al. Near-optimal coresets for least-squares
regression. IEEE Transactions on Information Theory, 2013.
[11] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel
coordinate descent for l1-regularized loss minimization. In
ICML, pages 321–328, 2011.
[12] G. Buehrer and et al. Toward terabyte pattern mining: An
architecture-conscious solution. In PPoPP, pages 2–12, 2007.
[13] G. Buehrer, S. Parthasarathy, and Y.-K. Chen. Adaptive
parallel graph mining for cmp architectures. In ICDM, pages
97–106, 2006.
[14] S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler
optimizations for improving data locality. In ASPLOS, 1994.
[15] H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya,
and K. Olukotun. A domain-specific approach to heterogeneous
parallelism. In PPOPP, pages 35–46, 2011.
[16] C. Chasseur and J. M. Patel. Design and evaluation of storage
organizations for read-optimized main memory databases.
PVLDB, pages 1474–1485, 2013.
[17] C. T. Chu and et al. Map-reduce for machine learning on
multicore. In NIPS, pages 281–288, 2006.
[18] E. F. D’Azevedo, M. R. Fahey, and R. T. Mills. Vectorized
sparse matrix multiply for compressed row storage format. In
ICCS, pages 99–106, 2005.
[19] J. Dean and et al. Large scale distributed deep networks. In
NIPS, pages 1232–1240, 2012.
[20] A. Ghoting and et al. Cache-conscious frequent pattern mining
on modern and emerging processors. VLDBJ, 2007.
[21] A. Ghoting and et al. SystemML: Declarative machine learning
on MapReduce. In ICDE, pages 231–242, 2011.
[22] Y. He and et al. Rcfile: A fast and space-efficient data
placement structure in mapreduce-based warehouse systems. In
ICDE, pages 1199–1208, 2011.
[23] J. M. Hellerstein and et al. The MADlib analytics library: Or
MAD skills, the SQL. PVLDB, pages 1700–1711, 2012.
[24] R. Jin, G. Yang, and G. Agrawal. Shared memory
parallelization of data mining algorithms: Techniques,
programming interface, and performance. TKDE, 2005.
[25] M. J. Johnson, J. Saunderson, and A. S. Willsky. Analyzing
Hogwild parallel Gaussian Gibbs sampling. In NIPS, 2013.
[26] M.-Y. Kan and H. O. N. Thi. Fast webpage classification using
url features. In CIKM, pages 325–326, 2005.
[27] C. Kim and et al. Sort vs. hash revisited: Fast join
implementation on modern multi-core CPUs. PVLDB, 2009.
[28] A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale
graph computation on just a pc. In OSDI, pages 31–46, 2012.
[29] Q. V. Le and et al. Building high-level features using large
scale unsupervised learning. In ICML, pages 8595–8598, 2012.
[30] Y. LeCun and et al. Gradient-based learning applied to
document recognition. IEEE, pages 2278–2324, 1998.
[31] Y. Li and et al. NUMA-aware algorithms: the case of data
shuffling. In CIDR, 2013.
[32] J. Liu and et al. An asynchronous parallel stochastic coordinate
descent algorithm. ICML, 2014.
12

Page 13
[33] Y. Low and et al. Graphlab: A new framework for parallel
machine learning. In UAI, pages 340–349, 2010.
[34] Y. Low and et al. Distributed GraphLab: A framework for
machine learning in the cloud. PVLDB, pages 716–727, 2012.
[35] T. M. Mitchell. Machine Learning. McGraw-Hill, USA, 1997.
[36] D. Nguyen, A. Lenharth, and K. Pingali. A lightweight
infrastructure for graph analytics. In SOSP, 2013.
[37] D. Nguyen, A. Lenharth, and K. Pingali. Deterministic Galois:
On-demand, portable and parameterless. In ASPLOS, 2014.
[38] F. Niu and et al. Hogwild: A lock-free approach to parallelizing
stochastic gradient descent. In NIPS, pages 693–701, 2011.
[39] S. Parthasarathy, M. J. Zaki, M. Ogihara, and W. Li. Parallel
data mining for association rules on shared memory systems.
Knowl. Inf. Syst., pages 1–29, 2001.
[40] L. Qiao and et al. Main-memory scan sharing for multi-core
CPUs. PVLDB, pages 610–621, 2008.
[41] V. Raman and et al. DB2 with BLU acceleration: So much
more than just a column store. PVLDB, pages 1080–1091, 2013.
[42] P. Richtárik and M. Takác. Parallel coordinate descent
methods for big data optimization. ArXiv e-prints, 2012.
[43] C. P. Robert and G. Casella. Monte Carlo Statistical Methods
(Springer Texts in Statistics). Springer, USA, 2005.
[44] A. Silberschatz, J. L. Peterson, and P. B. Galvin. Operating
System Concepts (3rd Ed.). Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA, 1991.
[45] A. Smola and S. Narayanamurthy. An architecture for parallel
topic models. PVLDB, pages 703–710, 2010.
[46] S. Sonnenburg and et al. The SHOGUN machine learning
toolbox. J. Mach. Learn. Res., pages 1799–1802, 2010.
[47] E. Sparks and et al. MLI: An API for distributed machine
learning. In ICDM, pages 1187–1192, 2013.
[48] S. Sridhar and et al. An approximate, efficient LP solver for LP
rounding. In NIPS, pages 2895–2903, 2013.
[49] A. K. Sujeeth and et al. OptiML: An Implicitly Parallel
Domain-Specific Language for Machine Learning. In ICML,
pages 609–616, 2011.
[50] S. Tatikonda and S. Parthasarathy. Mining tree-structured data
on multicore systems. PVLDB, pages 694–705, 2009.
[51] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed
asynchronous deterministic and stochastic gradient
optimization algorithms. IEEE Transactions on Automatic
Control, pages 803–812, 1986.
[52] S. Tu and et al. Speedy transactions in multicore in-memory
databases. In SOSP, pages 18–32, 2013.
[53] S. Williams and et al. Optimization of sparse matrix-vector
multiplication on emerging multicore platforms. In SC, pages
38:1–38:12, 2007.
[54] X. Yang, S. Parthasarathy, and P. Sadayappan. Fast sparse
matrix-vector multiplication on gpus: Implications for graph
mining. PVLDB, pages 231–242, 2011.
[55] M. Zaharia and et al. Resilient distributed datasets: A
fault-tolerant abstraction for in-memory cluster computing. In
NSDI, 2012.
[56] M. Zaki, C.-T. Ho, and R. Agrawal. Parallel classification for
data mining on shared-memory multiprocessors. In ICDE,
pages 198–205, 1999.
[57] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New
algorithms for fast discovery of association rules. In KDD,
pages 283–286, 1997.
[58] M. Zinkevich and et al. Parallelized stochastic gradient descent.
In NIPS, pages 2595–2603, 2010.
APPENDIX
A. IMPLEMENTATION DETAILS
In DimmWitted, we implement optimizations that are
part of scientific computation and analytics systems. While
these optimizations are not new, they are not universally
implemented in analytics systems. We briefly describes each
optimization and its impact.
Data and Worker Collocation. We observe that different
strategies of locating data and workers affect the perfor-
mance of DimmWitted. One standard technique is to col-
locate the worker and the data on the same NUMA node.
In this way, the worker in each node will pull data from its
own DRAM region, and does not need to occupy the node-
DRAM bandwidth of other nodes. In DimmWitted, we
tried two different placement strategies for data and work-
ers. The first protocol, called OS, relies on the operating
system to allocate data and threads for workers. The oper-
ating system will usually locate data on one single NUMA
node, and worker threads to different NUMA nodes using
heuristics that are not exposed to the user. The second pro-
tocol, called NUMA, evenly distributes worker threads across
NUMA nodes, and for each worker, replicates the data on
the same NUMA node. We find that for SVM on RCV1, the
strategy NUMA can be up to 2× faster than OS. Here are
two reasons for this improvement. First, by locating data
on the same NUMA node to workers, we achieve 1.24× im-
provement on the throughput of reading data. Second, by
not asking the operating system to allocate workers, we ac-
tually have a more balanced allocation of workers on NUMA
nodes.
Dense and Sparse. For statistical analytics workloads, it
is not uncommon for the data matrix A to be sparse, es-
pecially for applications such as information extraction and
text mining. In DimmWitted, we implement two protocols,
Dense and Sparse, which store the data matrix A as a dense
or sparse matrix, respectively. A Dense storage format has
two advantages: (1) if storing a fully dense vector, it requires
1
2 the space as a sparse representation, and (2) Dense is able
to leverage hardware SIMD instructions, which allows mul-
tiple floating point operations to be performed in parallel. A
Sparse storage format can use a BLAS-style scatter-gather
to incorporate SIMD, which can improve cache performance
and memory throughput; this approach has the additional
overhead for the gather operation. We find on a synthetic
dataset in which we vary the sparsity from 0.01 to 1.0, Dense
can be up to 2× faster than Sparse (for sparsity=1.0) while
Sparse can be up to 4× faster than Dense (for sparsity=0.01).
The dense vs. sparse tradeoff might change on newer
CPUs with VGATHERDPD intrinsic designed to specifically
speed up the gather operation. However, our current ma-
chines do not support this intrinsics and how to optimize
sparse and dense computation kernel is orthogonal to the
main goals of this paper.
Row-major and Column-major Storage. There are two
well-studied strategies to store a data matrix A: Row-major
and Column-major storage. Not surprisingly, we observed
that choosing an incorrect data storage strategy can cause
a large slowdown. We conduct a simple experiment where
we multiply a matrix and a vector using row-access method,
13

Page 14
where the matrix is stored in column- and row-major order.
We find that the Column-major could resulting 9× more L1
data load misses than using Row-major for two reasons: (1)
our architectures fetch four doubles in a cacheline, only one
of which is useful for the current operation. The prefetcher
in Intel machines does not prefetch across page boundaries,
and so it is unable to pick up significant portions of the
strided access; (2) On the first access, the Data cache unit
(DCU) prefetcher also gets the next cacheline compound-
ing the problem, and so it runs 8× slower.14
Therefore,
DimmWitted always stores the dataset in a way that is
consistent with the access method—no matter how the in-
put data is stored
B. EXTENDED RELATED WORK
We extend the discussion of related work. We summarize
in Figure 18 a range of related data mining work. A key
difference is that DimmWitted considers both hardware
efficiency and statistical efficiency for statistical analytics
solved by first-order methods.
Data Mining Algorithms. Probably the most related work
is by Jin et al. [24], who consider how to take advantage of
replication and different locking-based schemes with differ-
ent caching behavior and locking granularity to increase the
performance (hardware efficiency performance) for a range
of data mining tasks including K-means, frequent pattern
mining, and neural networks. Ghoting et al. [20] optimize
cache-behavior of frequent pattern mining using novel cache-
conscious techniques, including spatial and temporal local-
ity, prefetching, and tiling. Tatikonda et al. [50] considers
improving the performance of mining tree-structured data
multicore systems by decreasing the spatial and temporal
locality, and the technique they use is by careful study of dif-
ferent granularity and types of task and data chunking. Chu
et al. [17] apply the MapReduce to a large range of statisti-
cal analytics tasks that fit into the statistical query model,
and implements it on a multicore system and shows almost
linear speed-up to the number of cores. Zaki et al. [56] study
how to speed up classification tasks using decision trees on
SMP machines, and their technique takes advantage data
parallelism and task parallelism with lockings. Buehrer and
Parthasarathy et al. [13] study how to build a distributed
system for frequent pattern mining with terabytes of data.
Their focus is to minimize the I/O cost and communication
cost by optimizing the data placement and the number of
passes over the dataset. Buehrer et al. [12] study implement-
ing efficient graph mining algorithms over CMP and SMP
machines with the focus on load balance, memory usage (i.e.,
size), spatial locality, and the tradeoff of pre-computing and
re-computing. Zaki et al. [39,57] study on how to implement
parallel associated rule mining algorithms on shared memory
systems by optimizing reference memory locality and data
placement in the granularity of cachelines. This work also
considers how to minimize the cost of coherent maintenance
between multiple CPU caches. All of these techniques are
related and relevant to our work, but none consider optimiz-
ing first-order methods and the affect of these optimizations
on their efficiency.
14www.intel.com/content/dam/www/
public/us/en/documents/manuals/
64-ia-32-architectures-optimization-manual.pdf
High Performance Computation. The techniques that we
considered in DimmWitted for efficient implementation (Sec-
tion A) are not new, and they are borrowed from a wide
range of literature in high performance computation, database,
and systems. Locality is a classical technique: worker and
data collocation technique has been advocated since at least
90s [4,14] and is a common systems design principle [44].
The role of dense and sparse computation is well studied
in the by the HPC community. For example, efficient com-
putation kernels for matrix-vector and matrix-matrix mul-
tiplication [7,8,18,53]. In this work, we only require dense-
dense and dense-sparse matrix-vector multiplies. There is
recent work on mapping sparse-sparse multiplies to GPUs
and SIMD [54], which is useful for other data mining models
beyond what we consider here.
The row- vs. column-storage has been intensively studied
by database community over traditional relational database [2]
or Hadoop [22]. DimmWitted implements these techniques
to make sure our study of hardware efficiency and statisti-
cal efficiency reflects the status of modern hardware, and we
hope that future development on these topics can be applied
to DimmWitted.
Domain Specific Languages. Domain specific languages
(DSLs) are intended to make it easy for a user to write par-
allel programs by exposing domain-specific patterns. Exam-
ples of such DSLs include Galois [36,37] and OptiML [49]
for Delite [15]. To be effective, DSLs require the knowl-
edge about the trade-off of the target domain to apply their
compilation optimization, and we hope the insights from
DimmWitted can be applied to these DSLs.
Mathematical Optimization. Many statistical analytics tasks
are mathematical optimization problems. Recently, the math-
ematical optimization community has been looking at how
to parallelize optimization problems [32,38,58]. For exam-
ple, Niu et al. [38] for SGD and Shotgun [11] for SCD. A
lock-free asynchronous variant was recently established by
Ji et al. [32].
C. ADDITIONAL EXPERIMENTS
C.1 More Detailed Tuning Information for Spark
We report details of how we tune our Spark installation
for fair comparison. Figure 19 shows the list of parame-
ters that we used to tune Spark. For each combination of
the parameter, we run one experiment for measuring the
throughput using parallel sum, and use it for all other ex-
periments to maximize the performance. For each task, we
try all combinations of step size and batch size.
Statistical Efficiency: Step Size and Batch Size. We ob-
serve that step size and batch size of gradient together has
significant impact on the time that Spark needs to converge.
As shown in Figure 19, for each experiment, we try 28 differ-
ent combinations of these settings (7 step sizes and 4 batch
sizes). We see that these parameters could contribute to
more than 100× in the time to converge to the same loss
on the same dataset! Therefore, as shown in Figure 19, we
tried a large range of these two parameters and pick the best
one to report.
14

Page 15
Target Architecture
Target Application
Target Efficiency
Multicore
NUMA (SMP)
Distributed
Data Mining
Graph Mining
Gradient-based
Hardware
Statistical
Jin et al. [24]
Ghoting et al. [20]
Tatikonda et al. [50]
Chu et al. [17]
Zaki et al. [56]
Buehrer et al. [13]
Buehrer et al. [12]
Zaki et al. [39,57]
Tsitsiklis et al. [51]
Niu et al. [38]
Bradley et al. [11]
GraphChi [28]
GraphLab [33,34]
MLlib [47]
DimmWitted
Figure 18: A Taxonomy of Related Work
Type
Parameters
Values
Statistical
Step size
100, 10, 1, 0.1, 0.01, 0.001, 0.0001
Efficiency
Batch size
100%, 50%, 10%, 1%
Data Replication
1, 2, 3
Serialization
True, False
Hardware
Storage Level
MEMORY ONLY
Efficiency
Compression
True, False
locality.wait
1, 100, 1000, 3000, 10000
SPARK MEM
48g, 24g, 1g
numactl
localloc, interleave, NA
Figure 19: The Set of Parameters We Tried for Tun-
ing Spark
Sources of Overhead in Spark. Spark has overhead in
scheduling the task and provide fault tolerance, both of
which are features that DimmWitted does not support. To
make our comparison as fair as possible, we conduct the fol-
lowing experiments to understand how scheduling and fault
tolerance impact our claims.
We implement our own version of batch-gradient descent
algorithm in DimmWitted by strictly following MLlib’s al-
gorithm in C++. On Forest, we first observe that our
own batch-gradient implementation uses similar numbers of
epochs (within 5%) to converge to 1% loss as MLlib given the
same step size and batch size. Second, for each epoch, our
batch-gradient implementation is 3-7× faster cross different
architectures–this implies that MLlib does have overhead
compared with DimmWitted’s framework. However, our
own batch-gradient implementation is still 20-39× slower
than DimmWitted cross different architectures.
We break down the execution time into the number of
epochs that each system needs to converge and the time
that MLlib used for scheduling and computation. In par-
ticular, we use the Forest dataset as an example. On this
dataset, DimmWitted uses 1 epoch to converge to 1% loss,
while both MLlib and our own C++ implementation use 63
and 64 epochs, respectively. MLlib uses 2.7 seconds for these
64 epochs, and 0.9 seconds of these are used for scheduling,
and other 1.8 seconds are used to enumerate each example,
and calculate the gradient.15 The difference in the number
of epochs to converge implies that the difference between
MLlib and DimmWitted is not caused by low-level imple-
15 We observe similar break down on other datasets except
the smallest dataset, Reuters. On this dataset, the time
used for scheduling is up to 25× of the computation time.
1
3
5
7
9
11
1
3
5
7
9
11
PerCore
PerNode
PerMachine
Delite
# Threads
Speedup
Linear Speedup
Figure 20: Comparison with Delite using LR (Mu-
sic) on local2.
mentations, instead, that MLlib only implements a subset
of points in DimmWitted’s tradeoff space.
Hardware Efficiency. We summarize the impact of param-
eters to the throughput of MLlib. For each out of totally 540
combinations of all seven parameters related to hardware ef-
ficiency, we run the parallel sum to measure the throughput.
We find, not surprisingly, that the parameter SPARK MEM
has significant impact on the throughput–On Music, when
this parameter is set to 48GB, Spark achieves 7× speed-
up over 1GB. This is not surprising because this parameter
sets the amount of RAM that Spark can use. We also find
that, given the SPARK MEM parameter to be 48GB, all
other parameters only have less than 50% difference with
each other. Therefore, in our experiments we always use
SPARK MEM and set other parameters to be the setting
that achieves highest throughput in our experiment on the
corresponding dataset.
C.2 Comparison with Delite
Recently, there have been a trend of using domain specific
language to help user write parallel programs more easily.
We conduct a simple experiment with one popular DSL,
namely Delite [15], to illustrate that the tradeoff we studied
in this paper has the potential to help these DSLs to achieve
higher performance and quality.
We use the official implementation of logistic regression
in Delite [15] and run both DimmWitted and Delite on the
Music dataset using local2. We try our best effort for the
15

Page 16
0.1
1
10
100
0.01
0.1
1
Scale (1x = 0.5B rows, 4B NNZs, 49GB)
T
ime/Ep
och
(secon
d
s)
Figure 21:
Scalability of DimmWitted using
ClueWeb 2009 on local2.
locality of Delite by trying different settings for numactl.
We vary the number of threads that each program can use
and plot the speed-up curve as shown in Figure 20.
First, we see from Figure 20 that different model replica-
tion strategy in DimmWitted has different speed-up behav-
ior. Not surprisingly, PerCore speeds up more linearly than
PerNode and PerMachine. These observations are consistent
with the hardware efficiency that we discussed in this paper.
More interestingly, we see that Delite does not speed-up be-
yond a single socket (i.e., 6 cores). Therefore, by applying
the PerNode strategy in DimmWitted to Delite, we hope
that we can improve the speed-up behavior of Delite as we
illustrated in Figure 20.
C.3 Scalability Experiments
We validate the scalability of DimmWitted by testing it
on larger dataset.
Dataset. We follow Kan et al. [26] to create a dataset that
contains 500 million examples, 100K features for each ex-
ample, and 4 billion non-zero elements by using a Web-scale
data set called ClueWeb.16
ClueWeb contains 500 million
Web pages, and the approach of Kan et al. tries predict the
PageRank score of each Web page by using features from its
URLs by a least squares model.
Result. To validate the scalability of DimmWitted, we
randomly subsampled 1% examples, 10% examples, and 50%
examples to create smaller datasets. We run DimmWit-
ted using the rule-of-thumbs in Figure 14, and measure the
time that DimmWitted used for each epoch. Figure 21
shows the result. We see that on this dataset, the time that
DimmWitted needs to finish a single epoch grows almost
linearly with the number of examples. We believe that this
is caused by the fact that for all sub-sampled datasets and
the whole dataset, the model (100K weights) fits in the LLC
cache.
C.4 Importance Sampling as a Data Replica-
tion Strategy
The Sharding and FullReplication sampling scheme that we
discussed in Section 3 assumes that data tuples are equally
important. However, in statistic analytics, it is not uncom-
mon that some data tuples are more important than others.
One example is the linear leverage score.
16http://lemurproject.org/clueweb09/
0.01  
0.1  
1  
10  
100  
1000  
0.01  
0.1  
1  
Importance0.1  
Importance0.01  
Sharding  
FullReplication  
Time  (seco
n
d
s)  
Error  to  Optimal  Loss  
Figure 22: Important Sampling on Music (local2).
Example C.1
(Linear Leverage Score [10]). For A ∈
R
N×d and b ∈ RN . Define s(i) = aT
i(AT A)
−1
ai, where ai
is the ith row of A. Let ˜A and ˜b be the result of sampling m
rows, where row i is selected with probability proportional to
s(i). Then, for all x ∈ Rd, we have
Pr
[∣
∣Ax
− b2
2
N
m
˜Ax − ˜b2
2
< εAx − b2
2
]
>
1
2
So long as m > 2ε−2d log d.
For general loss functions (e.g., logistic loss), the linear
leverage score calculated in the same way as above does not
necessarily satisfy the property of approximating the loss.
However, we can still use this score as a heuristic to decide
the relative importance of data examples. In DimmWitted,
we consider the following protocol that we called Importance.
Given a dataset A, we calculate the leverage score s(i) of
the ith row as aT
i (AT A)−1ai. The user specifies the error
tolerance ϵ that is acceptable to her, and for each epoch,
DimmWitted samples for each worker 2ε
−2d log d examples
with a probability that is propositional to the leverage score.
This procedure is implemented in DimmWitted as one data
replication strategy.
Experimental Results. We run the above importance sam-
pling on the same data set as Section 4, and validate that
on some datasets the importance sampling scheme can im-
prove the time that DimmWitted needs to converge to a
given loss. Figure 22 shows the results of comparing differ-
ent data replication strategies on Music running on local2,
where Importance0.1 and Importance0.01 uses 0.1 and 0.01
as the error tolerance ϵ, respectively.
We see that, on Music, Importance0.1 is 3x faster than
FullReplication, for 10% loss. This is caused by the fact
that Importance0.1 processes only 10% of the data compared
with FullReplication. However, Importance0.01 is slower
than FullReplication. This is because when the error toler-
ance is lower, the number of samples one needs to draw for
each epoch increases. For Music, Importance0.01 processes
the same amount of tuples than FullReplication.
D. DETAILED DESCRIPTION OF EXTEN-
SIONS
We describe in more details of each extension that we
mentioned in Section 5.
D.1 Gibbs Sampling
Figure 23(a) illustrates a factor graph, which is a bipartite
graph that contains a set of variable, a set of factors, and
16

Page 17
Variable
F
a
c
tor
Current Variable
F
a
c
tor
V
a
ria
ble
Current Variable
(a) Factor Graph
(b) DimmWitted
(c) Deep Neural Networks
Layer 1
Layer 2
Layer n
Figure 23: Illustration of Factor Graph and Deep
Neural Networks in DimmWitted. (a) and (b) show
a factor graph and how DimmWitted represents it
as column-to-row access. (c) shows a deep neural
network, and the de facto approach to solve it is to
run SGD for each layer DimmWitted in a round-
robin fashion.
a set of links between variables and factors. To run Gibbs
sampling over a factor graph, one processes one variable at
a time to calculate the conditional probability for different
assignment of this variable. This involves fetching all con-
nected factors and all current assignments of variables that
connected to these factors. Gibbs sampling then update the
current variable assignment by randomly sampling a value
according to the conditional probability and proceed to the
next random variable. Similar to first order methods, re-
cent theory proves a lock-free protocol to sample multiple
variables at the same time [25]. We also know from classic
statistical theory [43] that one can maintain multiple copy of
the same factor graph, and aggregate the samples produced
on each factor graph at the end of execution.
Figure 23(b) illustrates how DimmWitted models Gibbs
sampling as column-to-row access. We see that each row
corresponding to one factor, each column corresponding to
one variable, and the non-zero elements in the matrix cor-
respond to the link in the factor graph. To process one
variable, DimmWitted fetches one column of the matrix to
get the set of factors, and other columns to get the set of
variables that connect to the same factor.
In DimmWitted, we implement the PerNode strategy for
Gibbs sampling by running one independent chain for each
NUMA node. At the end of sampling, we can use all samples
generated from each NUMA node for estimation. Therefore,
we use throughput, i.e., number of samples generated per
second as the measurement for performance in Section 5.17
In DimmWitted, we implement Gibbs sampling for gen-
eral factor graphs, and compare it with one hand-coded
implementation for topic modeling in GraphLab. We run
all systems on local2 with 100K documents and 20 topics.
We find that on local2, DimmWitted’s implementation is
3.7× faster than GraphLab’s implementation without any
application-specific optimization.
D.2 Deep Neural Networks
Figure 23(c) illustrates a Deep Neural Network as we de-
scribed in Section 5. Stochastic gradient descent is the de
facto algorithm to solve a neural network [30], with one twist
that we will discuss as follows. As shown in Figure 23(c),
17There has been a long historical discussion about the trade-
off between a single deep chain and multiple independent
chains in statistics. This tradeoff is out of the scope of this
paper.
a deep neural network usually contains multiple layers, and
the SGD algorithm needs to be run within each layer, and
process all layers in a round-robin fashion. Therefore, in
DimmWitted, we use the same SGD code path inside each
layer one at a time, and invoke this code path multiple times
to process different layers.
17