arXiv:1403.7550v3 [cs.DB] 7 Jul 2014

Page 1

DimmWitted: A Study of Main-Memory Statistical Analytics

Ce Zhang†‡

Christopher Ré†

†Stanford University

‡University of Wisconsin-Madison

{czhang, chrismre}@cs.stanford.edu

ABSTRACT

We perform the first study of the tradeoff space of access

methods and replication to support statistical analytics us-

ing first-order methods executed in the main memory of a

Non-Uniform Memory Access (NUMA) machine. Statistical

analytics systems differ from conventional SQL-analytics in

the amount and types of memory incoherence that they can

tolerate. Our goal is to understand tradeoffs in accessing the

data in row- or column-order and at what granularity one

should share the model and data for a statistical task. We

study this new tradeoff space and discover that there are

tradeoffs between hardware and statistical efficiency. We

argue that our tradeoff study may provide valuable infor-

mation for designers of analytics engines: for each system

we consider, our prototype engine can run at least one pop-

ular task at least 100× faster. We conduct our study across

five architectures using popular models, including SVMs, lo-

gistic regression, Gibbs sampling, and neural networks.

1. INTRODUCTION

Statistical data analytics is one of the hottest topics in

data-management research and practice. Today, even small

organizations have access to machines with large main mem-

ories (via Amazon’s EC2) or for purchase at $5/GB. As a

result, there has been a flurry of activity to support main-

memory analytics in both industry (Google Brain, Impala,

and Pivotal) and research (GraphLab, and MLlib). Each

of these systems picks one design point in a larger tradeoff

space. The goal of this paper is to define and explore this

space. We find that today’s research and industrial systems

under-utilize modern commodity hardware for analytics—

sometimes by two orders of magnitude. We hope that our

study identifies some useful design points for the next gen-

eration of such main-memory analytics systems.

Throughout, we use the term statistical analytics to refer

to those tasks that can be solved by first-order methods–a

class of iterative algorithms that use gradient information;

these methods are the core algorithm in systems such as ML-

lib, GraphLab, and Google Brain. Our study examines an-

alytics on commodity multi-socket, multi-core, non-uniform

memory access (NUMA) machines, which are the de facto

standard machine configuration and thus a natural target for

an in-depth study. Moreover, our experience with several

enterprise companies suggests that, after appropriate pre-

processing, a large class of enterprise analytics problems fit

into the main memory of a single, modern machine. While

this architecture has been recently studied for traditional

SQL-analytics systems [16], it has not been studied for sta-

tistical analytics systems.

Statistical analytics systems are different from traditional

SQL-analytics systems. In comparison to traditional SQL-

analytics, the underlying methods are intrinsically robust to

error. On the other hand, traditional statistical theory does

not consider which operations can be efficiently executed.

This leads to a fundamental tradeoff between statistical ef-

ficiency (how many steps are needed until convergence to

a given tolerance) and hardware efficiency (how efficiently

those steps can be carried out).

To describe such tradeoffs more precisely, we describe the

setup of the analytics tasks that we consider in this paper.

The input data is a matrix in RN×d and the goal is to find

a vector x ∈ Rd that minimizes some (convex) loss function,

say the logistic loss or the hinge loss (SVM). Typically, one

makes several complete passes over the data while updating

the model; we call each such pass an epoch. There may be

some communication at the end of the epoch, e.g., in bulk-

synchronous parallel systems such as Spark. We identify

three tradeoffs that have not been explored in the literature:

(1) access methods for the data, (2) model replication, and

(3) data replication. Current systems have picked one point

in this space; we explain each space and discover points

that have not been previously considered. Using these new

points, we can perform 100× faster than previously explored

points in the tradeoff space for several popular tasks.

Access Methods. Analytics systems access (and store) data

in either row-major or column-major order. For example,

systems that use stochastic gradient descent methods (SGD)

access the data row-wise; examples include MADlib [23]

in Impala and Pivotal, Google Brain [29], and MLlib in

Spark [47]; and stochastic coordinate descent methods (SCD)

access the data column-wise; examples include GraphLab [34],

Shogun [46], and Thetis [48]. These methods have essen-

tially identical statistical efficiency, but their wall-clock per-

formance can be radically different due to hardware effi-

arXiv:1403.7550v3 [cs.DB] 7 Jul 2014

Page 2

Node 2

L3 Cache

RAM

Core3 Core4

(a) Memory Model for DimmWitted

Data A

Model x

Machine

Node 1

L3 Cache

RAM

Core1 Core2

(d) NUMA Architecture

(c) Access Methods

Row-wise

Col.-wise Col.-to-row

Read set of data

Write set of model

(b) Pseudocode of SGD

Procedure:

m <-- initial model

while not converge

foreach row z in A:

m <-- m – α grad(z, m)

test convergence

Input: Data matrix A, stepsize α

Gradient function grad.

Output: Model m.

Epoch

L1/L2 L1/L2 L1/L2 L1/L2

Figure 1: Illustration of (a) DimmWitted’s Memory Model, (b) Pseudocode for SGD, (c) Different Statistical

Methods in DimmWitted and Their Access Patterns, and (d) NUMA Architecture.

ciency. However, this tradeoff has not been systematically

studied. To study this tradeoff, we introduce a storage ab-

straction that captures the access patterns of popular statis-

tical analytics tasks and a prototype called DimmWitted.

In particular, we identify three access methods that are used

in popular analytics tasks, including standard supervised

machine learning models such as SVMs, logistic regression,

and least squares; and more advanced methods such as neu-

ral networks and Gibbs sampling on factor graphs. For dif-

ferent access methods for the same problem, we find that the

time to converge to a given loss can differ by up to 100×.

We also find that no access method dominates all oth-

ers, so an engine designer may want to include both access

methods. To show that it may be possible to support both

methods in a single engine, we develop a simple cost model

to choose among these access methods. We describe a sim-

ple cost model that selects a nearly optimal point in our

data sets, models, and different machine configurations.

Data and Model Replication. We study two sets of trade-

offs: the level of granularity, and the mechanism by which

mutable state and immutable data are shared in analytics

tasks. We describe the tradeoffs we explore in both (1) mu-

table state sharing, which we informally call model replica-

tion, and (2) data replication.

(1) Model Replication. During execution, there is some

state that the task mutates (typically an update to the

model). We call this state, which may be shared among

one or more processors, a model replica. We consider three

different granularities at which to share model replicas:

• The PerCore approach treats a NUMA machine as a

distributed system in which every core is treated as an

individual machine, e.g., in bulk-synchronous models

such as MLlib on Spark or event-driven systems such as

GraphLab. These approaches are the classical shared-

nothing and event-driven architectures, respectively.

In PerCore, the part of the model that is updated by

each core is only visible to that core until the end of

an epoch. This method is efficient and scalable from

a hardware perspective, but it is less statistically ef-

ficient, as there is only coarse-grained communication

between cores.

• The PerMachine approach acts as if each processor has

uniform access to memory. This approach is taken in

Hogwild! and Google Downpour [19]. In this method,

the hardware takes care of the coherence of the shared

state. The PerMachine method is statistically efficient

due to high communication rates, but it may cause

contention in the hardware, which may lead to subop-

timal running times.

• A natural hybrid is PerNode; this method uses the

fact that PerCore communication through the last-level

cache (LLC) is dramatically faster than communica-

tion through remote main memory. This method is

novel; for some models, PerNode can be an order of

magnitude faster.

Because model replicas are mutable, a key question is how

often should we synchronize model replicas? We find that it

is beneficial to synchronize the models as much as possible—

so long as we do not impede throughput to data in main

memory. A natural idea, then, is to use PerMachine sharing,

in which the hardware is responsible for synchronizing the

replicas. However, this decision can be suboptimal, as the

cache-coherence protocol may stall a processor to preserve

coherence, but this information may not be worth the cost

of a stall from a statistical efficiency perspective. We find

that the PerNode method, coupled with a simple technique

to batch writes across sockets, can dramatically reduce com-

munication and processor stalls. The PerNode method can

result in an over 10× runtime improvement. This technique

depends on the fact that we do not need to maintain the

model consistently: we are effectively delaying some updates

to reduce the total number of updates across sockets (which

lead to processor stalls).

(2) Data Replication. The data for analytics is immutable,

so there are no synchronization issues for data replication.

The classical approach is to partition the data to take ad-

vantage of higher aggregate memory bandwidth. However,

each partition may contain skewed data, which may slow

convergence. Thus, an alternate approach is to replicate the

data fully (say, per NUMA node). In this approach, each

node accesses that node’s data in a different order, which

means that the replicas provide non-redundant statistical

information; in turn, this reduces the variance of the esti-

mates based on the data in each replicate. We find that for

some tasks, fully replicating the data four ways can converge

to the same loss almost 4× faster than the sharding strategy.

Summary of Contributions. We are the first to study the

three tradeoffs listed above for main-memory statistical an-

alytics systems. These tradeoffs are not intended to be an

exhaustive set of optimizations, but they demonstrate our

Page 3

main conceptual point: treating NUMA machines as dis-

tributed systems or SMP is suboptimal for statistical ana-

lytics. We design a storage manager, DimmWitted, that

shows it is possible to exploit these ideas on real data sets.

Finally, we evaluate our techniques on multiple real datasets,

models, and architectures.

2. BACKGROUND

In this section, we describe the memory model for

DimmWitted, which provides a unified memory model to

implement popular analytics methods. Then, we recall some

basic properties of modern NUMA architectures.

Data for Analytics. The data for an analytics task is a

pair (A, x), which we call the data and the model, respec-

tively. For concreteness, we consider a matrix A ∈ RN×d.

In machine learning parlance, each row is called an exam-

ple. Thus, N is often the number of examples and d is often

called the dimension of the model. There is also a model,

typically a vector x ∈ Rd. The distinction is that the data

A is read-only, while the model vector, x, will be updated

during execution. From the perspective of this paper, the

important distinction we make is that data is an immutable

matrix, while the model (or portions of it) are mutable data.

First-Order Methods for Analytic Algorithms. DimmWit-

ted considers a class of popular algorithms called first-order

methods. Such algorithms make several passes over the data;

we refer to each such pass as an epoch. A popular exam-

ple algorithm is stochastic gradient descent (SGD), which

is widely used by web-companies, e.g., Google Brain [29]

and VowPal Wabbit [1], and in enterprise systems such as

Pivotal, Oracle, and Impala. Pseudocode for this method

is shown in Figure 1(b). During each epoch, SGD reads a

single example z; it uses the current value of the model and

z to estimate the derivative; and it then updates the model

vector with this estimate. It reads each example in this loop.

After each epoch, these methods test convergence (usually

by computing or estimating the norm of the gradient); this

computation requires a scan over the complete dataset.

2.1 Memory Models for Analytics

We design DimmWitted’s memory model to capture the

trend in recent high-performance sampling and statistical

methods. There are two aspects to this memory model: the

coherence level and the storage layout.

Coherence Level. Classically, memory systems are coher-

ent: reads and writes are executed atomically. For analytics

systems, we say that a memory model is coherent if reads

and writes of the entire model vector are atomic. That is,

access to the model is enforced by a critical section. How-

ever, many modern analytics algorithms are designed for an

incoherent memory model. The Hogwild! method showed

that one can run such a method in parallel without locking

but still provably converge. The Hogwild! memory model

relies on the fact that writes of individual components are

atomic, but it does not require that the entire vector be

updated atomically. However, atomicity at the level of the

cacheline is provided by essentially all modern processors.

Empirically, these results allow one to forgo costly locking

(and coherence) protocols. Similar algorithms have been

Algorithm

Access Method

Implementation

Stochastic Gradient Descent

Row-wise

MADlib, Spark, Hogwild!

Stochastic Coordinate Descent

Column-wise

GraphLab, Shogun, Thetis

Column-to-row

Figure 2: Algorithms and Their Access Methods.

proposed for other popular methods, including Gibbs sam-

pling [25,45], stochastic coordinate descent (SCD) [42,46],

and linear systems solvers [48]. This technique was applied

by Dean et al. [19] to solve convex optimization problems

with billions of elements in a model. This memory model is

distinct from the classical, fully coherent database execution.

The DimmWitted prototype allows us to specify that a

region of memory is coherent or not. This region of memory

may be shared by one or more processors. If the memory

is only shared per thread, then we can simulate a shared-

nothing execution. If the memory is shared per machine, we

can simulate Hogwild!.

Access Methods. We identify three distinct access paths

used by modern analytics systems, which we call row-wise,

column-wise, and column-to-row. They are graphically il-

lustrated in Figure 1(c). Our prototype supports all three

access methods. All of our methods perform several epochs,

that is, passes over the data. However, the algorithm may

iterate over the data row-wise or column-wise.

• In row-wise access, the system scans each row of the

table and applies a function that takes that row, ap-

plies a function to it, and then updates the model.

This method may write to all components of the

model. Popular methods that use this access method

include stochastic gradient descent, gradient descent,

and higher-order methods (such as l-BFGS).

• In column-wise access, the system scans each column j

of the table. This method reads just the j component

of the model. The write set of the method is typically

a single component of the model. This method is used

by stochastic coordinate descent.

• In column-to-row access, the system iterates conceptu-

ally over the columns. This method is typically applied

to sparse matrices. When iterating on column j, it

will read all rows in which column j is non-zero. This

method also updates a single component of the model.

This method is used by non-linear support vector ma-

chines in GraphLab and is the de facto approach for

Gibbs sampling.

DimmWitted is free to iterate over rows or columns in es-

sentially any order (although typically some randomness in

the ordering is desired). Figure 2 classifies popular imple-

mentations by their access method.

2.2 Architecture of NUMA Machines

We briefly describe the architecture of a modern NUMA

machine. As illustrated in Figure 1(d), a NUMA machine

contains multiple NUMA nodes. Each node has multiple

cores and processor caches, including the L3 cache. Each

Page 4

ork

RAM

6GB/s

QPI 11GB/s

Name

(abbrv.)

#Node

#Cores/

Node

RAM/

Node (GB)

CPU

Clock (GHz)

LLC

(MB)

local2 (l2)

2.6

local4 (l4)

2.0

local8 (l8)

128

2.6

ec2.1 (e1)

122

2.6

ec2.2 (e2)

2.6

local2

ork

RAM

6GB/s

Figure 3: Summary of Machines and Memory Band-

width on local2 Tested with STREAM [9].

Data A

achine

…

Data

Replica

Model

Replica

Worker

Read

Update

Execution Plan

Model x

timizer

Figure 4: Illustration of DimmWitted’s Engine.

node is directly connected to a region of DRAM. NUMA

nodes are connected to each other by buses on the main

board; in our case, this connection is the Intel Quick Path

Interconnects (QPIs), which has a bandwidth as high as

25.6GB/s.1 To access DRAM regions of other NUMA nodes,

data is transferred across NUMA nodes using the QPI. These

NUMA architectures are cache coherent, and the coherency

actions use the QPI. Figure 3 describes the configuration

of each machine that we use in this paper. Machines con-

trolled by us have names with the prefix “local”; the other

machines are Amazon EC2 configurations.

3. THE DIMMWITTED ENGINE

We describe the tradeoff space that DimmWitted’s op-

timizer considers, namely (1) access method selection, (2)

model replication, and (3) data replication. To help un-

derstand the statistical-versus-hardware tradeoff space, we

present some experimental results in a Tradeoffs paragraph

within each subsection. We describe implementation details

for DimmWitted in the full version of this paper.

3.1 System Overview

We describe analytics tasks in DimmWitted and the ex-

ecution model of DimmWitted given an analytics task.

System Input. For each analytics task that we study, we

assume that the user provides data A ∈ RN×d and an initial

model that is a vector of length d. In addition, for each

access method listed above, there is a function of an ap-

propriate type that solves the same underlying model. For

example, we provide both a row- and column-wise way of

solving a support vector machine. Each method takes two

arguments; the first is a pointer to a model.

1www.intel.com/content/www/us/

en/io/quickpath-technology/

quick-path-interconnect-introduction-paper.html

Tradeoff

Strategies

Existing Systems

Access

Methods

Row-wise

SP, HW

Column-wise

Column-to-row

Model

Replication

Per Core

GL, SP

Per Node

Per Machine

Data

Replication

Sharding

GL, SP, HW

Full Replication

Figure 5: A Summary of DimmWitted’s Tradeoffs

and Existing Systems (GraphLab (GL), Hogwild!

(HW), Spark (SP)).

• frow captures the the row-wise access method, and its

second argument is the index of a single row.

• fcol captures the column-wise access method, and its

second argument is the index of a single column.

• fctr captures the column-to-row access method, and

its second argument is a pair of one column index and

a set of row indexes. These rows correspond to the

non-zero entries in a data matrix for a single column.2

Each of the functions modifies the model to which they re-

ceive a pointer in place. However, in our study, frow can

modify the whole model, while fcol and fctr only modify a

single variable of the model. We call the above tuple of func-

tions a model specification. Note that a model specification

contains either fcol or fctr but typically not both.

Execution. Given a model specification, our goal is to gen-

erate an execution plan. An execution plan, schematically

illustrated in Figure 4, specifies three things for each CPU

core in the machine: (1) a subset of the data matrix to op-

erate on, (2) a replica of the model to update, and (3) the

access method used to update the model. We call the set of

replicas of data and models locality groups, as the replicas

are described physically; i.e., they correspond to regions of

memory that are local to particular NUMA nodes, and one

or more workers may be mapped to each locality group. The

data assigned to distinct locality groups may overlap. We

use DimmWitted’s engine to explore three tradeoffs:

(1) Access methods, in which we can select between

either the row or column method to access the data.

(2) Model replication, in which we choose how to create

and assign replicas of the model to each worker. When

a worker needs to read or write the model, it will read

or write the model replica that it is assigned.

(3) Data replication, in which we choose a subset of data

tuples for each worker. The replicas may be overlap-

ping, disjoint, or some combination.

Figure 5 summarizes the tradeoff space. In each section,

we illustrate the tradeoff along two axes, namely (1) the

statistical efficiency, i.e., the number of epochs it takes to

converge, and (2) hardware efficiency, the time that each

method takes to finish a single epoch.

2Define S(j) = {i : aij = 0}. For a column j, the input to

fctr is a pair (j, S(j)).

Page 5

Algorithm

Read

Write (Dense) Write (Sparse)

Row-wise

Column-wise

Column-to-row

∑

Figure 6: Per Epoch Execution Cost of Row- and

Column-wise Access. The Write column is for a single

model replica. Given a dataset A ∈ RN×d, let ni be the

number of non-zero elements ai.

0.01

0.1

ime

/Ep

(se

Cost Ratio

Row-wise

Column-wise

(a) Number of Epochs to Converge

(b) Time for Each Epoch

100

SVM1 SVM2 LP1 LP2

# Ep

Models and Data Sets

Column-wise

Row-wise

Figure 7: Illustration of the Method Selection

Tradeoff. (a) These four datasets are RCV1, Reuters,

Amazon, and Google, respectively. (b) The “cost ratio”

is defined as the ratio of costs estimated for row-wise and

column-wise methods: (1+α) ∑i ni/(∑i n2

i +αd), where ni

is the number of non-zero elements of ith row of A and

α is the cost ratio between writing and reads. We set

α = 10 to plot this graph.

3.2 Access Method Selection

In this section, we examine each access method: row-wise,

column-wise, and column-to-row. We find that the execu-

tion time of an access method depends more on hardware

efficiency than on statistical efficiency.

Tradeoffs. We consider the two tradeoffs that we use for

a simple cost model (Figure 6). Let ni be the number of

non-zeros in row i; when we store the data as sparse vec-

tors/matrices in CSR format, the number of reads in a row-

wise access method is ∑N

i=1 ni. Since each example is likely

to be written back in a dense write, we perform dN writes

per epoch. Our cost model combines these two costs lin-

early with a factor α that accounts for writes being more

expensive, on average, because of contention. The factor α

is estimated at installation time by measuring on a small set

of datasets. The parameter α is in 4 to 12 and grows with

the number of sockets; e.g., for local2, α ≈ 4, and for local8,

α ≈ 12. Thus, α may increase in the future.

Statistical Efficiency. We observe that each access

method has comparable statistical efficiency. To illustrate

this, we run all methods on all of our datasets and report

the number of epochs that one method converges to a given

error to the optimal loss, and Figure 7(a) shows the result

on four datasets with 10% error. We see that the gap in the

number of epochs across different methods is small (always

within 50% of each other).

Hardware Efficiency. Different access methods can

change the time per epoch by up to a factor of 10×, and

there is a cross-over point. To see this, we run both meth-

ods on a series of synthetic datasets where we control the

number of non-zero elements per row by subsampling each

row on the Music dataset (see Section 4 for more details).

For each subsampled dataset, we plot the cost ratio on the

x-axis, and we plot their actual running time per epoch in

Figure 7(b). We see a cross-over point on the time used

per epoch: when the cost ratio is small, row-wise outper-

forms column-wise by 6×, as the column-wise method reads

more data; on the other hand, when the ratio is large, the

column-wise method outperforms the row-wise method by

3×, as the column-wise method has lower write contention.

We observe similar cross-over points on our other datasets.

Cost-based Optimizer. DimmWitted estimates the exe-

cution time of different access methods using the number of

bytes that each method reads and writes in one epoch, as

shown in Figure 6. For writes, it is slightly more complex:

for models such as SVM, each gradient step in row-wise ac-

cess only updates the coordinates where the input vector

contains non-zero elements. We call this scenario a sparse

update; otherwise, it is a dense update.

DimmWitted needs to estimate the ratio of the cost of

reads to writes. To do this, it runs a simple benchmark

dataset. We find that, for all the eight datasets, five statis-

tical models, and five machines that we use in the experi-

ments, the cost model is robust to this parameter: as long

as writes are 4× to 100× more expensive than reading, the

cost model makes the correct decision between row-wise and

column-wise access.

3.3 Model Replication

In DimmWitted, we consider three model replication

strategies. The first two strategies, namely PerCore and

PerMachine, are similar to traditional shared-nothing and

shared-memory architecture, respectively. We also consider

a hybrid strategy, PerNode, designed for NUMA machines.

3.3.1 Granularity of Model Replication

The difference between the three model replication strate-

gies is the granularity of replicating a model. We first de-

scribe PerCore and PerMachine and their relationship with

other existing systems (Figure 5). We then describe PerN-

ode, a simple, novel hybrid strategy that we designed to

leverage the structure of NUMA machines.

PerCore. In the PerCore strategy, each core maintains a mu-

table state, and these states are combined to form a new

version of the model (typically at the end of each epoch).

This is essentially a shared-nothing architecture; it is imple-

mented in Impala, Pivotal, and Hadoop-based frameworks.

PerCore is popularly implemented by state-of-the-art sta-

tistical analytics frameworks such as Bismarck, Spark, and

GraphLab. There are subtle variations to this approach: in

Bismarck’s implementation, each worker processes a parti-

tion of the data, and its model is averaged at the end of

each epoch; Spark implements a minibatch-based approach

in which parallel workers calculate the gradient based on

examples, and then gradients are aggregated by a single

thread to update the final model; GraphLab implements

an event-based approach where each different task is dy-

namically scheduled to satisfy the given consistency require-

ment. In DimmWitted, we implement PerCore in a way

that is similar to Bismarck, where each worker has its own

model replica, and each worker is responsible for updating

Page 6

100

1000

10000

10%

100%

PerCore

# Ep

Error to Optimal Loss

(a) Number of Epochs to Converge

PerNode

PerMachine

(b) Time for Each Epoch

0.1

ime

/Ep

(se

)

Model Replication Strategies

PerMachine

PerCore

PerNode

Figure 8: Illustration of Model Replication.

its replica.3

As we will show in the experiment section,

DimmWitted’s implementation is 3-100× faster than either

GraphLab and Spark. Both systems have additional sources

of overhead that DimmWitted does not, e.g., for fault tol-

erance in Spark and a distributed environment in both. We

are not making an argument about the relative merits of

these features in applications, only that they would obscure

the tradeoffs that we study in this paper.

PerMachine. In the PerMachine strategy, there is a single

model replica that all workers update during execution. Per-

Machine is implemented in Hogwild! and Google’s Down-

pour. Hogwild! implements a lock-free protocol, which

forces the hardware to deal with coherence. Although differ-

ent writers may overwrite each other and readers may have

dirty reads, Niu et al. [38] prove that Hogwild! converges.

PerNode. The PerNode strategy is a hybrid of PerCore and

PerMachine. In PerNode, each NUMA node has a single

model replica that is shared among all cores on that node.

Model Synchronization. Deciding how often the replicas

synchronize is key to the design. In Hadoop-based and

Bismarck-based models, they synchronize at the end of each

epoch. This is a shared-nothing approach that works well in

user-defined aggregations. However, we consider finer gran-

ularities of sharing. In DimmWitted, we chose to have one

thread that periodically reads models on all other cores, av-

erages their results, and updates each replica.

One key question for model synchronization is how fre-

quently should the model be synchronized? Intuitively, we

might expect that more frequent synchronization will lower

the throughput; on the other hand, the more frequently we

synchronize, the fewer number of iterations we might need

to converge. However, in DimmWitted, we find that the

optimal choice is to communicate as frequently as possi-

ble. The intuition is that the QPI has staggering band-

width (25GB/s) compared to the small amount of data we

are shipping (megabytes). As a result, in DimmWitted, we

implement an asynchronous version of the model averaging

protocol: a separate thread averages models, with the effect

of batching many writes together across the cores into one

write, reducing the number of stalls.

3We implemented MLlib’s minibatch in DimmWitted. We

find that the Hogwild!-like implementation always dom-

inates the minibatch implementation.

DimmWitted’s

column-wise implementation for PerMachine is similar to

GraphLab, with the only difference that DimmWitted does

not schedule the task in an event-driven way.

100

10000

10%

100%

Sharding

FullReplication

(a) Number of Epochs to Converge

(b) Time for Each Epoch

0.0001

0.001

0.01

local2 local4 local8

Sharding

FullReplication

ime

/Ep

(se

# Ep

Error to Optimal Loss

Different Machines

Figure 9: Illustration of Data Replication.

Tradeoffs. We observe that PerNode is more hardware ef-

ficient, as it takes less time to execute an epoch than Per-

Machine; PerMachine might use fewer number of epochs to

converge than PerNode.

Statistical Efficiency. We observe that PerMachine usu-

ally takes fewer epochs to converge to the same loss com-

pared to PerNode, and PerNode uses fewer number of epochs

than PerCore. To illustrate this observation, Figure 8(a)

shows the number of epochs that each strategy requires to

converge to a given loss for SVM (RCV1). We see that

PerMachine always uses the least number of epochs to con-

verge to a given loss: intuitively, the single model replica

has more information at each step, which means that there

is less redundant work. We observe similar phenomena when

comparing PerCore and PerNode.

Hardware Efficiency. We observe that PerNode uses

much less time to execute an epoch than PerMachine. To

illustrate the difference in the time that each model replica-

tion strategy uses to finish one epoch, we show in Figure 8(b)

the execution time of three strategies on SVM (RCV1). We

see that PerNode is 23× faster than PerMachine and that Per-

Core is 1.5× faster than PerNode. PerNode takes advantage

of the locality provided by the NUMA architecture. Using

PMUs, we find that PerMachine incurs 11× more cross-node

DRAM requests than PerNode.

Rule of Thumb. For SGD-based models, PerNode usually

gives optimal results, while for SCD-based models, PerMa-

chine does. Intuitively, this is caused by the fact that SGD

has a denser update pattern than SCD, so, PerMachine suf-

fers from hardware efficiency.

3.4 Data Replication

In DimmWitted, each worker processes a subset of data

and then updates its model replica. To assign a subset of

data to each worker, we consider two strategies.

Sharding. Sharding is a popular strategy implemented in sys-

tems such as Hogwild!, Spark, and Bismarck, in which the

dataset is partitioned, and each worker only works on its par-

tition of data. When there is a single model replica, Sharding

avoids wasted computation, as each tuple is processed once

per epoch. However, when there are multiple model repli-

cas, Sharding might increase the variance of the estimate

we form on each node, lowering the statistical efficiency. In

DimmWitted, we implement Sharding by randomly parti-

tioning the rows (resp. columns) of a data matrix for the

row-wise (resp. column-wise) access method. In column-to-

row access, we also replicate other rows that are needed.

FullReplication. A simple alternative to Sharding is FullRepli-

cation, in which we replicate the whole dataset many times

Page 7

Model Dataset #Row #Col. NNZ

Size

(Sparse)

Size

(Dense)

Sparse

SVM

RCV1

781K 47K

60M 914MB 275GB

✔

Reuters

8K 18K

93K

1.4MB 1.2GB

✔

Music

515K

46M 701MB 0.4GB

Forest

581K

30M 490MB 0.2GB

Amazon 926K 335K

28MB

>1TB

✔

Google

25MB

>1TB

✔

Amazon

7M 104MB

>1TB

✔

Google

10M 152MB

>1TB

✔

Gibbs

Paleo

69M 30M 108M

2GB

>1TB

✔

MNIST 120M 800K 120M

2GB

>1TB

✔

Figure 10: Dataset Statistics. NNZ refers to the

Number of Non-zero elements. The # columns is

equal to the number of variables in the model.

(PerCore or PerNode). In PerNode, each NUMA node will

have a full copy of the data. Each node accesses its data

in a different order, which means that the replicas provide

non-redundant statistical information. Statistically, there

are two benefits of FullReplication: (1) averaging different

estimates from each node has a lower variance, and (2) the

estimate at each node has lower variance than in the Shard-

ing case, as each node’s estimate is based on the whole data.

From a hardware efficiency perspective, reads are more fre-

quent from local NUMA memory in PerNode than in Per-

Machine. The PerNode approach dominates the PerCore ap-

proach, as reads from the same node go to the same NUMA

memory. Thus, we do not consider PerCore replication from

this point on.

Tradeoffs. Not surprisingly, we observe that FullReplication

takes more time for each epoch than Sharding. However,

we also observe that FullReplication uses fewer epochs than

Sharding, especially to achieve low error. We illustrate these

two observations by showing the result of running SVM on

Reuters using PerNode in Figure 9.

Statistical Efficiency. FullReplication uses fewer epochs,

especially to low-error tolerance. Figure 9(a) shows the

number of epochs that each strategy takes to converge to a

given loss. We see that, for within 1% of the loss, FullRepli-

cation uses 10× fewer epochs on a two-node machine. This

is because each model replica sees more data than Shard-

ing, and therefore has a better estimate. Because of this

difference in the number of epochs, FullReplication is 5×

faster in wall-clock time than Sharding to converge to 1%

loss. However, we also observe that, at high-error regions,

FullReplication uses more epochs than Sharding and causes a

comparable execution time to a given loss.

Hardware Efficiency. Figure 9(b) shows the time for

each epoch across different machines with different numbers

of nodes. Because we are using the PerNode strategy, which

is the optimal choice for this dataset, the more nodes a ma-

chine has, the slower FullReplication is for each epoch. The

slow-down is roughly consistent with the number of nodes

on each machine. This is not surprising because each epoch

of FullReplication processes more data than Sharding.

4. EXPERIMENTS

We validate that exploiting the tradeoff space that we de-

scribed enables DimmWitted’s orders of magnitude speedup

over state-of-the-art competitor systems. We also validate

that each tradeoff discussed in this paper affects the perfor-

mance of DimmWitted.

4.1 Experiment Setup

We describe the details of our experimental setting.

Datasets and Statistical Models. We validate the per-

formance and quality of DimmWitted on a diverse set of

statistical models and datasets. For statistical models, we

choose five models that are among the most popular models

used in statistical analytics: (1) Support Vector Machine

(SVM), (2) Logistic Regression (LR), (3) Least Squares

Regression (LS), (4) Linear Programming (LP), and (5)

Quadratic Programming (QP). For each model, we choose

datasets with different characteristics, including size, spar-

sity, and under- or over-determination. For SVM, LR, and

LS, we choose four datasets: Reuters4, RCV15, Music6, and

Forest.7 Reuters and RCV1 are datasets for text classifica-

tion that are sparse and underdetermined. Music and Forest

are standard benchmark datasets that are dense and overde-

termined. For QP and LR, we consider a social-network ap-

plication, i.e., network analysis, and use two datasets from

Amazon’s customer data and Google’s Google+ social net-

works.8 Figure 10 shows the dataset statistics.

Metrics. We measure the quality and performance of

DimmWitted and other competitors. To measure the qual-

ity, we follow prior art and use the loss function for all func-

tions. For end-to-end performance, we measure the wall-

clock time it takes for each system to converge to a loss that

is within 100%, 50%, 10%, and 1% of the optimal loss.9

When measuring the wall-clock time, we do not count the

time used for data loading and result outputting for all sys-

tems. We also use other measurements to understand the

details of the tradeoff space, including (1) local LLC request,

(2) remote LLC request, and (3) local DRAM request. We

use Intel Performance Monitoring Units (PMUs) and follow

the manual10 to conduct these experiments.

Experiment Setting. We compare DimmWitted with four

competitor systems: GraphLab [34], GraphChi [28], ML-

lib [47] over Spark [55], and Hogwild! [38]. GraphLab is

a distributed graph processing system that supports a large

range of statistical models. GraphChi is similar to GraphLab

but with a focus on multi-core machines with secondary stor-

age. MLlib is a package of machine learning algorithms im-

plemented over Spark, an in-memory implementation of the

MapReduce framework. Hogwild! is an in-memory lock-

free framework for statistical analytics. We find that all four

systems pick some points in the tradeoff space that we con-

sidered in DimmWitted. In GraphLab and GraphChi, all

models are implemented using stochastic coordinate descent

4archive.ics.uci.edu/ml/datasets/Reuters-21578+

Text+Categorization+Collection

5about.reuters.com/researchandstandards/corpus/

6archive.ics.uci.edu/ml/datasets/YearPredictionMSD

7archive.ics.uci.edu/ml/datasets/Covertype

8snap.stanford.edu/data/

9We obtain the optimal loss by running all systems for one

hour and choosing the lowest.

10software.intel.com/en-us/articles/

performance-monitoring-unit-guidelines

Page 8

Dataset

Within 1% of the Optimal Loss

Within 50% of the Optimal Loss

GraphLab

GraphChi

MLlib

Hogwild!

GraphLab

GraphChi

MLlib

Hogwild!

SVM

Reuters

58.9

56.7

15.5

0.1

13.6

11.2

0.6

0.01

RCV1

> 300.0

> 300

61.4

26.8

> 300.0

58.0

0.71

0.17

Music

> 300.0

156

33.32

23.7

31.2

27.1

7.7

0.17

0.14

Forest

16.2

15.8

2.70

0.23

0.01

1.9

1.4

0.15

0.03

0.01

Reuters

36.3

34.2

19.2

0.1

13.2

12.5

1.2

0.03

RCV1

> 300.0

38.7

19.8

> 300.0

68.0

0.82

0.20

Music

> 300.0

35.7

28.6

30.2

28.9

8.9

0.56

0.34

Forest

29.2

28.7

3.74

0.29

0.03

2.3

2.5

0.17

0.02

0.01

Reuters

132.9

121.2

92.5

4.1

3.2

16.3

16.7

1.9

0.17

0.09

RCV1

> 300.0

> 300

27.5

10.5

> 300.0

32.0

1.30

0.40

Music

> 300.0

221

40.1

25.8

> 300.0

11.2

0.78

0.52

Forest

25.5

26.5

1.01

0.33

0.02

2.7

2.9

0.15

0.04

0.01

Amazon

2.7

2.4

> 120.0

0.94

2.7

2.1

120.0

1.86

0.94

Google

13.4

11.9

> 120.0

12.56

2.3

2.0

120.0

3.04

2.02

Amazon

6.8

5.7

> 120.0

1.8

6.8

5.7

> 120.0

> 120.00

1.50

Google

12.4

10.1

> 120.0

4.3

9.9

8.3

> 120.0

> 120.00

3.70

Figure 11: End-to-End Comparison (time in seconds). The column DW refers to DimmWitted. We take 5

runs on local2 and report the average (standard deviation for all numbers < 5% of the mean). Entries with

> indicate a timeout.

100

1000

10%

100%

100

1000

10%

100%

0.1

100

1000

10%

100%

0.1

100

10%

100%

0.01

0.1

10%

100%

0.01

0.1

10%

100%

0.1

100

10%

100%

0.1

100

10%

100%

SVM (RCV1)

Error to Optimal Loss

ime (secon

)

el R

lication

SVM (Music)

LP (Amazon)

LP (Google)

PerNode

PerMachine

(a)

ccess M

eth

el.

ime (secon

Row-wise

Column-wise

PerCore

Figure 12: Tradeoffs in DimmWitted. Missing points timeout in 120 seconds.

(column-wise access); in MLlib and Hogwild!, SVM and LR

are implemented using stochastic gradient descent (row-wise

access). We use implementations that are provided by the

original developers whenever possible. For models with-

out code provided by the developers, we only change the

corresponding gradient function.11

For GraphChi, if the

corresponding model is implemented in GraphLab but not

GraphChi, we follow GraphLab’s implementation.

We run experiments on a variety of architectures. These

machines differ in a range of configurations, including the

number of NUMA nodes, the size of last-level cache (LLC),

and memory bandwidth. See Figure 3 for a summary of

these machines. DimmWitted, Hogwild!, GraphLab, and

GraphChi are implemented using C++, and MLlib/Spark

is implemented using Scala. We tune both GraphLab and

MLlib according to their best practice guidelines.12 For both

11For sparse models, we change the dense vector data struc-

ture in MLlib to a sparse vector, which only improves its

performance.

12MLlib:spark.incubator.apache.org/docs/0.6.0/

tuning.html; GraphLab:

graphlab.org/tutorials-2/

fine-tuning-graphlab-performance/.

For GraphChi,

we tune the memory buffer size to ensure all data fit in

memory and that there are no disk I/Os. We describe more

detailed tuning for MLlib in the full version of this paper.

GraphLab, GraphChi, and MLlib, we try different ways of

increasing locality on NUMA machines, including trying to

use numactl and implementing our own RDD for MLlib;

there is more detail in the full version of this paper. Systems

are compiled with g++ 4.7.2 (-O3), Java 1.7, or Scala 2.9.

4.2 End-to-End Comparison

We validate that DimmWitted outperforms competitor

systems in terms of end-to-end performance and quality.

Note that both MLlib and GraphLab have extra overhead

for fault tolerance, distributing work, and task scheduling.

Our comparison between DimmWitted and these competi-

tors is intended only to demonstrate that existing work for

statistical analytics has not obviated the tradeoffs that we

study here.

Protocol. For each system, we grid search their statistical

parameters, including step size ({100.0,10.0,...,0.0001}) and

mini-batch size for MLlib ({1%, 10%, 50%, 100%}); we al-

ways report the best configuration, which is essentially the

same for each system. We measure the time it takes for each

system to find a solution that is within 1%, 10%, and 50%

of the optimal loss. Figure 11 shows the results for 1% and

50%; the results for 10% are similar. We report end-to-end

numbers from local2, which has two nodes and 24 logical

Page 9

cores, as GraphLab does not run on machines with more

than 64 logical cores. Figure 14 shows the DimmWitted’s

choice of point in the tradeoff space on local2.

As shown in Figure 11, DimmWitted always converges

to the given loss in less time than the other competitors.

On SVM and LR, DimmWitted could be up to 10× faster

than Hogwild!, and more than two orders of magnitude

faster than GraphLab and Spark. The difference between

DimmWitted and Hogwild! is greater for LP and QP, where

DimmWitted outperforms Hogwild! by more than two or-

ders of magnitude. On LP and QP, DimmWitted is also up

to 3× faster than GraphLab and GraphChi, and two orders

of magnitude faster than MLlib.

Tradeoff Choices. We dive more deeply into these numbers

to substantiate our claim that there are some points in the

tradeoff space that are not used by GraphLab, GraphChi,

Hogwild!, and MLlib. Each tradeoff selected by our sys-

tem is shown in Figure 14. For example, GraphLab and

GraphChi uses column-wise access for all models, while ML-

lib and Hogwild! use row-wise access for all models and al-

low only PerMachine model replication. These special points

work well for some but not all models. For example, for LP

and QP, GraphLab and GraphChi are only 3× slower than

DimmWitted, which chooses column-wise and PerMachine.

This factor of 3 is to be expected, as GraphLab also allows

distributed access and so has additional overhead. However

there are other points: for SVM and LR, DimmWitted

outperforms GraphLab and GraphChi, because the column-

wise algorithm implemented by GraphLab and GraphChi is

not as efficient as row-wise on the same dataset. DimmWit-

ted outperforms Hogwild! because DimmWitted takes ad-

vantage of model replication, while Hogwild! incurs 11×

more cross-node DRAM requests than DimmWitted; in

contrast, DimmWitted incurs 11× more local DRAM re-

quests than Hogwild! does.

For SVM, LR, and LS, we find that DimmWitted out-

performs MLlib, primarily due to a different point in the

tradeoff space. In particular, MLlib uses batch-gradient-

descent with a PerCore implementation, while DimmWitted

uses stochastic gradient and PerNode. We find that, for the

Forest dataset, DimmWitted takes 60× fewer number of

epochs to converge to 1% loss than MLlib. For each epoch,

DimmWitted is 4× faster. These two factors contribute to

the 240× speed-up of DimmWitted over MLlib on the For-

est dataset (1% loss). MLlib has overhead for scheduling, so

we break down the time that MLlibuses for scheduling and

computation. We find that, for Forest, out of the total 2.7

seconds of execution, MLlib uses 1.8 seconds for computa-

tion and 0.9 seconds for scheduling. We also implemented

a batch-gradient-descent and PerCore implementation inside

DimmWitted to remove these and C++ versus Scala dif-

ferences. The 60× difference in the number of epochs until

convergence still holds, and our implementation is only 3×

faster than MLlib. This implies that the main difference be-

tween DimmWitted and MLlib is the point in the tradeoff

space—not low-level implementation differences.

For LP and QP, DimmWitted outperforms MLlib and

Hogwild! because the row-wise access method implemented

by these systems is not as efficient as column-wise access on

the same data set. GraphLab does have column-wise access,

so DimmWitted outperforms GraphLab and GraphChi be-

cause DimmWitted finishes each epoch up to 3× faster,

SVM

(RCV1)

(Google)

Parallel

Sum

GraphLab

0.2

0.1

0.9

GraphChi

0.3

0.2

1.0

MLlib

0.2

0.1

0.02

0.3

Hogwild!

1.3

1.4

1.3

0.3

0.2

DIMMWITTED

5.1

5.2

0.7

1.3

Figure

13:

Comparison

Throughput

(GB/seconds) of Different Systems on local2.

Access Methods Model Replication Data Replication

SVM

Reuters

Row-wise

PerNode

FullReplication

RCV1

Music

Amazon

Column-wise

PerMachine

FullReplication

Google

Figure 14: Plans that DimmWitted Chooses in the

Tradeoff Space for Each Dataset on Machine local2.

primarily due to low-level issues. This supports our claims

that the tradeoff space is interesting for analytic engines and

that no one system has implemented all of them.

Throughput. We compare the throughput of different sys-

tems for an extremely simple task: parallel sums. Our im-

plementation of parallel sum follows our implementation of

other statistical models (with a trivial update function),

and uses all cores on a single machine. Figure 13 shows

the throughput on all systems on different models on one

dataset. We see from Figure 13 that DimmWitted achieves

the highest throughput of all the systems. For parallel sum,

DimmWitted is 1.6× faster than Hogwild!, and we find that

DimmWitted incurs 8× fewer LLC cache misses than Hog-

wild!. Compared with Hogwild!, in which all threads write

to a single copy of the sum result, DimmWitted maintains

one single copy of the sum result per NUMA node, so the

workers on one NUMA node do not invalidate the cache

on another NUMA node. When running on only a single

thread, DimmWitted has the same implementation as Hog-

wild!. Compared with GraphLab and GraphChi, DimmWit-

ted is 20× faster, likely due to the overhead of GraphLab

and GraphChi dynamically scheduling tasks and/or main-

taining the graph structure. To compare DimmWitted with

MLlib, which is written in Scala, we implemented a Scala

version, which is 3× slower than C++; this suggests that

the overhead is not just due to the language. If we do not

count the time that MLlibuses for scheduling and only count

the time of computation, we find that DimmWitted is 15×

faster than MLlib.

4.3 Tradeoffs of DimmWitted

We validate that all the tradeoffs described in this paper

have an impact on the efficiency of DimmWitted. We re-

port on a more modern architecture, local4 with four NUMA

sockets, in this section. We describe how the results change

with different architectures.

4.3.1 Access Method Selection

We validate that different access methods have different

performance, and that no single access method dominates

the others. We run DimmWitted on all statistical models

and compare two strategies, row-wise and column-wise. In

Page 10

0.01

0.1

100

0.01

0.1

100

atio of Execu

tion

ime p

och

(row

-w

ise/col-w

ise)

SVM (RCV1)

LP (Amazon)

[e1] [e2] [l2] [l4] [l8]

8x2 8x2 6x2 10x4 8x8

#Cores/Socket ✕ # Sockets [Machine Name]

[e1] [e2] [l2] [l4] [l8]

8x2 8x2 6x2 10x4 8x8

Figure 15: Ratio of Execution Time per Epoch (row-

wise/column-wise) on Different Architectures. A

number larger than 1 means that row-wise is slower.

l2 means local2, e1 means ec2.1, etc.

each experiment, we force DimmWitted to use the corre-

sponding access method, but report the best point for the

other tradeoffs. Figure 12(a) shows the results as we mea-

sure the time it takes to achieve each loss. The more strin-

gent loss requirements (1%) are on the left-hand side. The

horizontal line segments in the graph indicate that a model

may reach, say, 50% as quickly (in epochs) as it reaches

100%.

We see from Figure 12(a) that the difference between row-

wise and column-to-row access could be more than 100×

for different models. For SVM on RCV1, row-wise access

converges at least 4× faster to 10% loss and at least 10×

faster to 100% loss. We observe similar phenomena for

Music; compared with RCV1, column-to-row access con-

verges to 50% loss and 100% loss at a 10× slower rate.

With such datasets, the column-to-row access simply re-

quires more reads and writes. This supports the folk wis-

dom that gradient methods are preferable to coordinate de-

scent methods. On the other hand, for LP, column-wise

access dominates: row-wise access does not converge to 1%

loss within the timeout period for either Amazon or Google.

Column-wise access converges at least 10-100× faster than

row-wise access to 1% loss. We observe that LR is similar

to SVM and QP is similar to LP. Thus, no access method

dominates all the others.

The cost of writing and reading are different and is cap-

tured by a parameter that we called α in Section 3.2. We de-

scribe the impact of this factor on the relative performance

of row- and column-wise strategies. Figure 15 shows the

ratio of the time that each strategy uses (row-wise/column-

wise) for SVM (RCV1) and LP (Amazon). We see that,

as the number of sockets on a machine increases, the ratio

of execution time increases, which means that row-wise be-

comes slower relative to column-wise, i.e., with increasing α.

As the write cost captures the cost of a hardware-resolved

conflict, we see that this constant is likely to grow. Thus,

if next-generation architectures increase in the number of

sockets, the cost parameter α and consequently the impor-

tance of this tradeoff are likely to grow.

Cost-based Optimizer. We observed that, for all datasets,

our cost-based optimizer selects row-wise access for SVM,

LR, and LS, and column-wise access for LP and QP. These

choices are consistent with what we observed in Figure 12.

4.3.2 Model Replication

We validate that there is no single strategy for model repli-

cation that dominates the others. We force DimmWitted

to run strategies in PerMachine, PerNode, and PerCore and

choose other tradeoffs by choosing the plan that achieves

the best result. Figure 12(b) shows the results.

We see from Figure 12(b) that the gap between PerMa-

chine and PerNode could be up to 100×. We first observe

that PerNode dominates PerCore on all datasets. For SVM

on RCV1, PerNode converges 10× faster than PerCore to

50% loss, and for other models and datasets, we observe a

similar phenomenon. This is due to the low statistical effi-

ciency of PerCore, as we discussed in Section 3.3. Although

PerCore eliminates write contention inside one NUMA node,

this write contention is less critical. For large models and

machines with small caches, we also observe that PerCore

could spill the cache.

These graphs show that neither PerMachine nor PerNode

dominates the other across all datasets and statistical mod-

els. For SVM on RCV1, PerNode converges 12× faster than

PerMachine to 50% loss. However, for LP on Amazon, Per-

Machine is at least 14× faster than PerNode to converge to

1% loss. For SVM, PerNode converges faster because it has

5× higher throughput than PerMachine, and for LP, PerN-

ode is slower because PerMachine takes at least 10× fewer

epochs to converge to a small loss. One interesting obser-

vation is that, for LP on Amazon, PerMachine and PerNode

do have comparable performance to converge to 10% loss.

Compared with the 1% loss case, this implies that PerN-

ode’s statistical efficiency decreases as the algorithm tries to

achieve a smaller loss. This is not surprising, as one must

reconcile the PerNode estimates.

We observe that the relative performance of PerMachine

and PerNode depends on (1) the number of sockets used on

each machine and (2) the sparsity of the update.

To validate (1), we measure the time that PerNode and

PerMachine take on SVM (RCV1) to converge to 50%

loss on various architectures, and we report the ratio

(PerMachine/PerNode) in Figure 16. We see that PerNode’s

relative performance improves with the number of sockets.

We attribute this to the increased cost of write contention

in PerMachine.

To validate (2), we generate a series of synthetic datasets,

each of which subsamples the elements in each row of the

Music dataset; Figure 16(b) shows the results. When the

sparsity is 1%, PerMachine outperforms PerNode, as each

update touches only one element of the model; thus, the

write contention in PerMachine is not a bottleneck. As the

sparsity increases (i.e., the update becomes denser), we ob-

serve that PerNode outperforms PerMachine.

4.3.3 Data Replication

We validate the impact of different data replication strate-

gies. We run DimmWitted by fixing data replication strate-

gies to FullReplication or Sharding and choosing the best plan

for each other tradeoff. We measure the execution time for

each strategy to converge to a given loss for SVM on the

same dataset, RCV1. We report the ratio of these two

strategies as FullReplication/Sharding in Figure 17(a). We

see that, for the low-error region (e.g., 0.1%), FullReplication

is 1.8-2.5× faster than Sharding. This is because FullReplica-

tion decreases the skew of data assignment to each worker,

so hence each individual model replica can form a more ac-

curate estimate. For the high-error region (e.g., 100%), we

Page 11

0.1

100

0.5

0.1

100

(a) Architecture

#Cores/Socket ✕ # Sockets

[Machine Name]

atio of Execu

tion

ime

erM

ach

e/P

erN

[e1] [e2] [l2] [l4] [l8]

8x2 8x2 6x2 10x4 8x8

(b) Sparsity

Sparsity of Synthetic

Data sets on Music

PerMachine Better

PerNode Better

Figure 16: The Impact of Different Architectures

and Sparsity on Model Replication. A ratio larger

than 1 means that PerNode converges faster than

PerMachine to 50% loss.

(b)

100

Gibbs

# V

ariab

les/secon

illion

)

Classic Choice

DimmWitted

0.1% 1.0% 10.0% 100.0%

(a)

atio of Exec. T

ime

(FullRepl./S

ard

)

FullRepl. Better

Sharding Better

Figure 17: (a) Tradeoffs of Data Replication. A ra-

tio smaller than 1 means that FullReplication is faster.

(b) Performance of Gibbs Sampling and Neural Net-

works Implemented in DimmWitted.

observe that FullReplication appears to be 2-5× slower than

Sharding. We find that, for 100% loss, both FullReplication

and Sharding converge in a single epoch, and Sharding may

therefore be preferred, as it examines less data to complete

that single epoch. In all of our experiments, FullReplication

is never substantially worse and can be dramatically better.

Thus, if there is available memory, the FullReplication data

replication seems to be preferable.

5. EXTENSIONS

We briefly describe how to run Gibbs sampling (which

uses a column-to-row access method) and deep neural net-

works (which uses a row access method). Using the same

tradeoffs, we achieve a significant increase in speed over the

classical implementation choices of these algorithms. A more

detailed description is in the full version of this paper.

5.1 Gibbs Sampling

Gibbs sampling is one of the most popular algorithms

to solve statistical inference and learning over probabilis-

tic graphical models [43]. We briefly describe Gibbs sam-

pling over factor graphs and observe that its main step is a

column-to-row access. A factor graph can be thought of as

a bipartite graph of a set of variables and a set of factors.

To run Gibbs sampling, the main operation is to select a

single variable, and calculate the conditional probability of

this variable, which requires the fetching of all factors that

contain this variable and all assignments of variables con-

nected to these factors. This operation corresponds to the

column-to-row access method. Similar to first-order meth-

ods, recently, a Hogwild! algorithm for Gibbs was estab-

lished [25]. As shown in Figure 17(b), applying the tech-

nique in DimmWitted to Gibbs sampling achieves 4× the

throughput of samples as the PerMachine strategy.

5.2 Deep Neural Networks

Neural networks are one of the most classic machine learn-

ing models [35]; recently, these models have been intensively

revisited by adding more layers [19,29]. A deep neural net-

work contains multiple layers, and each layer contains a set

of neurons (variables). Different neurons connect with each

other only by links across consecutive layers. The value of

one neuron is a function of all the other neurons in the pre-

vious layer and a set of weights. Variables in the last layer

have human labels as training data; the goal of deep neural

network learning is to find the set of weights that maximizes

the likelihood of the human labels. Back-propagation with

stochastic gradient descent is the de facto method of opti-

mizing a deep neural network.

Following LeCun et al. [30], we implement SGD over a

seven-layer neural network with 0.12 billion neurons and 0.8

million parameters using a standard handwriting-recognition

benchmark dataset called MNIST13. Figure 17(b) shows the

number of variables that are processed by DimmWitted

per second. For this application, DimmWitted uses PerN-

ode and FullReplication, and the classical choice made by Le-

Cun is PerMachine and Sharding. As shown in Figure 17(b),

DimmWitted achieves more than an order of magnitude

higher throughput than this classical baseline (to achieve

the same quality as reported in this classical paper).

6. RELATED WORK

We review work in four main areas: statistical analytics,

data mining algorithms, shared-memory multiprocessors op-

timization, and main-memory databases. We include more

extensive related work in the full version of this paper.

Statistical Analytics. There is a trend to integrate sta-

tistical analytics into data processing systems. Database

vendors have recently put out new products in this space, in-

cluding Oracle, Pivotal’s MADlib [23], IBM’s SystemML [21],

and SAP’s HANA. These systems support statistical analyt-

ics in existing data management systems. A key challenge

for statistical analytics is performance.

A handful of data processing frameworks have been de-

veloped in the last few years to support statistical ana-

lytics, including Mahout for Hadoop, MLI for Spark [47],

GraphLab [34], and MADLib for PostgreSQL or Green-

plum [23]. Although these systems increase the perfor-

mance of corresponding statistical analytics tasks signifi-

cantly, we observe that each of them implements one point

in DimmWitted’s tradeoff space. DimmWitted is not a

system; our goal is to study this tradeoff space.

Data Mining Algorithms. There is a large body of

data mining literature regarding how to optimize various al-

gorithms to be more architecturally aware [39,56,57]. Zaki

et al. [39,57] study the performance of a range of different al-

gorithms, including associated rule mining and decision tree

on shared-memory machines, by improving memory locality

and data placement in the granularity of cachelines, and de-

creasing the cost of coherent maintenance between multiple

CPU caches. Ghoting et al. [20] optimize the cache behavior

of frequent pattern mining using novel cache-conscious tech-

niques, including spatial and temporal locality, prefetching,

13yann.lecun.com/exdb/mnist/

Page 12

and tiling. Jin et al. [24] discuss tradeoffs in replication and

locking schemes for K-means, association rule mining, and

neural nets. This work considers the hardware efficiency of

the algorithm, but not statistical efficiency, which is the fo-

cus of DimmWitted. In addition, Jin et al. do not consider

lock-free execution, a key aspect of this paper.

Shared-memory Multiprocessor Optimization. Per-

formance optimization on shared-memory multiprocessors

machines is a classical topic. Anderson and Lam [4] and

Carr et al.’s [14] seminal work used complier techniques

to improve locality on shared-memory multiprocessor ma-

chines. DimmWitted’s locality group is inspired by Ander-

son and Lam’s discussion of computation decomposition and

data decomposition. These locality groups are the center-

piece of the Legion project [6]. In recent years, there have

been a variety of domain specific languages (DSLs) to help

the user extract parallelism; two examples of these DSLs in-

clude Galois [36,37] and OptiML [49] for Delite [15]. Our

goals are orthogonal: these DSLs require knowledge about

the trade-offs of the hardware, such as those provided by

our study.

Main-memory Databases. The database community

has recognized that multi-socket, large-memory machines

have changed the data processing landscape, and there has

been a flurry of recent work about how to build in-memory

analytics systems [3,5,16,27,31,40,41,52]. Classical tradeoffs

have been revisited on modern architectures to gain signif-

icant improvement: Balkesen et al. [5], Albutiu et al. [3],

Kim et al. [27], and Li [31] study the tradeoff for joins and

shuffling, respectively. This work takes advantage of modern

architectures, e.g., NUMA and SIMD, to increase memory

bandwidth. We study a new tradeoff space for statistical

analytics in which the performance of the system is affected

by both hardware efficiency and statistical efficiency.

7. CONCLUSION

For statistical analytics on main-memory, NUMA-aware

machines, we studied tradeoffs in access methods, model

replication, and data replication. We found that using novel

points in this tradeoff space can have a substantial bene-

fit: our DimmWitted prototype engine can run at least

one popular task at least 100× faster than other competitor

systems. This comparison demonstrates that this tradeoff

space may be interesting for current and next-generation

statistical analytics systems.

Acknowledgments We would like to thank Arun Kumar, Victor

Bittorf, the Delite team, the Advanced Analytics at Oracle, Green-

plum/Pivotal, and Impala’s Cloudera team for sharing their experi-

ences in building analytics systems. We gratefully acknowledge the

support of the Defense Advanced Research Projects Agency (DARPA)

XDATA Program under No. FA8750-12-2-0335 and the DEFT Pro-

gram under No. FA8750-13-2-0039, the National Science Founda-

tion (NSF) CAREER Award under No. IIS-1353606, the Office of

Naval Research (ONR) under awards No. N000141210041 and No.

N000141310129, the Sloan Research Fellowship, American Family In-

surance, Google, and Toshiba. Any opinions, findings, and conclusion

or recommendations expressed in this material are those of the au-

thors and do not necessarily reflect the view of DARPA, NSF, ONR,

or the US government.

8. REFERENCES

[1] A. Agarwal, O. Chapelle, M. Dudık, and J. Langford. A reliable

effective terascale linear learning system. ArXiv e-prints, 2011.

[2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis.

Weaving relations for cache performance. In VLDB, 2001.

[3] M.-C. Albutiu, A. Kemper, and T. Neumann. Massively

parallel sort-merge joins in main memory multi-core database

systems. PVLDB, pages 1064–1075, 2012.

[4] J. M. Anderson and M. S. Lam. Global optimizations for

parallelism and locality on scalable parallel machines. In PLDI,

pages 112–125, 1993.

[5] C. Balkesen and et al. Multi-core, main-memory joins: Sort vs.

hash revisited. PVLDB, pages 85–96, 2013.

[6] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion:

expressing locality and independence with logical regions. In

SC, page 66, 2012.

[7] N. Bell and M. Garland. Efficient sparse matrix-vector

multiplication on CUDA. Technical report, NVIDIA

Corporation, 2008.

[8] N. Bell and M. Garland. Implementing sparse matrix-vector

multiplication on throughput-oriented processors. In SC, pages

18:1–18:11, 2009.

[9] L. Bergstrom. Measuring NUMA effects with the STREAM

benchmark. ArXiv e-prints, 2011.

[10] C. Boutsidis and et al. Near-optimal coresets for least-squares

regression. IEEE Transactions on Information Theory, 2013.

[11] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel

coordinate descent for l1-regularized loss minimization. In

ICML, pages 321–328, 2011.

[12] G. Buehrer and et al. Toward terabyte pattern mining: An

architecture-conscious solution. In PPoPP, pages 2–12, 2007.

[13] G. Buehrer, S. Parthasarathy, and Y.-K. Chen. Adaptive

parallel graph mining for cmp architectures. In ICDM, pages

97–106, 2006.

[14] S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler

optimizations for improving data locality. In ASPLOS, 1994.

[15] H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya,

and K. Olukotun. A domain-specific approach to heterogeneous

parallelism. In PPOPP, pages 35–46, 2011.

[16] C. Chasseur and J. M. Patel. Design and evaluation of storage

organizations for read-optimized main memory databases.

PVLDB, pages 1474–1485, 2013.

[17] C. T. Chu and et al. Map-reduce for machine learning on

multicore. In NIPS, pages 281–288, 2006.

[18] E. F. D’Azevedo, M. R. Fahey, and R. T. Mills. Vectorized

sparse matrix multiply for compressed row storage format. In

ICCS, pages 99–106, 2005.

[19] J. Dean and et al. Large scale distributed deep networks. In

NIPS, pages 1232–1240, 2012.

[20] A. Ghoting and et al. Cache-conscious frequent pattern mining

on modern and emerging processors. VLDBJ, 2007.

[21] A. Ghoting and et al. SystemML: Declarative machine learning

on MapReduce. In ICDE, pages 231–242, 2011.

[22] Y. He and et al. Rcfile: A fast and space-efficient data

placement structure in mapreduce-based warehouse systems. In

ICDE, pages 1199–1208, 2011.

[23] J. M. Hellerstein and et al. The MADlib analytics library: Or

MAD skills, the SQL. PVLDB, pages 1700–1711, 2012.

[24] R. Jin, G. Yang, and G. Agrawal. Shared memory

parallelization of data mining algorithms: Techniques,

programming interface, and performance. TKDE, 2005.

[25] M. J. Johnson, J. Saunderson, and A. S. Willsky. Analyzing

Hogwild parallel Gaussian Gibbs sampling. In NIPS, 2013.

[26] M.-Y. Kan and H. O. N. Thi. Fast webpage classification using

url features. In CIKM, pages 325–326, 2005.

[27] C. Kim and et al. Sort vs. hash revisited: Fast join

implementation on modern multi-core CPUs. PVLDB, 2009.

[28] A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale

graph computation on just a pc. In OSDI, pages 31–46, 2012.

[29] Q. V. Le and et al. Building high-level features using large

scale unsupervised learning. In ICML, pages 8595–8598, 2012.

[30] Y. LeCun and et al. Gradient-based learning applied to

document recognition. IEEE, pages 2278–2324, 1998.

[31] Y. Li and et al. NUMA-aware algorithms: the case of data

shuffling. In CIDR, 2013.

[32] J. Liu and et al. An asynchronous parallel stochastic coordinate

descent algorithm. ICML, 2014.

Page 13

[33] Y. Low and et al. Graphlab: A new framework for parallel

machine learning. In UAI, pages 340–349, 2010.

[34] Y. Low and et al. Distributed GraphLab: A framework for

machine learning in the cloud. PVLDB, pages 716–727, 2012.

[35] T. M. Mitchell. Machine Learning. McGraw-Hill, USA, 1997.

[36] D. Nguyen, A. Lenharth, and K. Pingali. A lightweight

infrastructure for graph analytics. In SOSP, 2013.

[37] D. Nguyen, A. Lenharth, and K. Pingali. Deterministic Galois:

On-demand, portable and parameterless. In ASPLOS, 2014.

[38] F. Niu and et al. Hogwild: A lock-free approach to parallelizing

stochastic gradient descent. In NIPS, pages 693–701, 2011.

[39] S. Parthasarathy, M. J. Zaki, M. Ogihara, and W. Li. Parallel

data mining for association rules on shared memory systems.

Knowl. Inf. Syst., pages 1–29, 2001.

[40] L. Qiao and et al. Main-memory scan sharing for multi-core

CPUs. PVLDB, pages 610–621, 2008.

[41] V. Raman and et al. DB2 with BLU acceleration: So much

more than just a column store. PVLDB, pages 1080–1091, 2013.

[42] P. Richtárik and M. Takác. Parallel coordinate descent

methods for big data optimization. ArXiv e-prints, 2012.

[43] C. P. Robert and G. Casella. Monte Carlo Statistical Methods

(Springer Texts in Statistics). Springer, USA, 2005.

[44] A. Silberschatz, J. L. Peterson, and P. B. Galvin. Operating

System Concepts (3rd Ed.). Addison-Wesley Longman

Publishing Co., Inc., Boston, MA, USA, 1991.

[45] A. Smola and S. Narayanamurthy. An architecture for parallel

topic models. PVLDB, pages 703–710, 2010.

[46] S. Sonnenburg and et al. The SHOGUN machine learning

toolbox. J. Mach. Learn. Res., pages 1799–1802, 2010.

[47] E. Sparks and et al. MLI: An API for distributed machine

learning. In ICDM, pages 1187–1192, 2013.

[48] S. Sridhar and et al. An approximate, efficient LP solver for LP

rounding. In NIPS, pages 2895–2903, 2013.

[49] A. K. Sujeeth and et al. OptiML: An Implicitly Parallel

Domain-Specific Language for Machine Learning. In ICML,

pages 609–616, 2011.

[50] S. Tatikonda and S. Parthasarathy. Mining tree-structured data

on multicore systems. PVLDB, pages 694–705, 2009.

[51] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed

asynchronous deterministic and stochastic gradient

optimization algorithms. IEEE Transactions on Automatic

Control, pages 803–812, 1986.

[52] S. Tu and et al. Speedy transactions in multicore in-memory

databases. In SOSP, pages 18–32, 2013.

[53] S. Williams and et al. Optimization of sparse matrix-vector

multiplication on emerging multicore platforms. In SC, pages

38:1–38:12, 2007.

[54] X. Yang, S. Parthasarathy, and P. Sadayappan. Fast sparse

matrix-vector multiplication on gpus: Implications for graph

mining. PVLDB, pages 231–242, 2011.

[55] M. Zaharia and et al. Resilient distributed datasets: A

fault-tolerant abstraction for in-memory cluster computing. In

NSDI, 2012.

[56] M. Zaki, C.-T. Ho, and R. Agrawal. Parallel classification for

data mining on shared-memory multiprocessors. In ICDE,

pages 198–205, 1999.

[57] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New

algorithms for fast discovery of association rules. In KDD,

pages 283–286, 1997.

[58] M. Zinkevich and et al. Parallelized stochastic gradient descent.

In NIPS, pages 2595–2603, 2010.

APPENDIX

A. IMPLEMENTATION DETAILS

In DimmWitted, we implement optimizations that are

part of scientific computation and analytics systems. While

these optimizations are not new, they are not universally

implemented in analytics systems. We briefly describes each

optimization and its impact.

Data and Worker Collocation. We observe that different

strategies of locating data and workers affect the perfor-

mance of DimmWitted. One standard technique is to col-

locate the worker and the data on the same NUMA node.

In this way, the worker in each node will pull data from its

own DRAM region, and does not need to occupy the node-

DRAM bandwidth of other nodes. In DimmWitted, we

tried two different placement strategies for data and work-

ers. The first protocol, called OS, relies on the operating

system to allocate data and threads for workers. The oper-

ating system will usually locate data on one single NUMA

node, and worker threads to different NUMA nodes using

heuristics that are not exposed to the user. The second pro-

tocol, called NUMA, evenly distributes worker threads across

NUMA nodes, and for each worker, replicates the data on

the same NUMA node. We find that for SVM on RCV1, the

strategy NUMA can be up to 2× faster than OS. Here are

two reasons for this improvement. First, by locating data

on the same NUMA node to workers, we achieve 1.24× im-

provement on the throughput of reading data. Second, by

not asking the operating system to allocate workers, we ac-

tually have a more balanced allocation of workers on NUMA

nodes.

Dense and Sparse. For statistical analytics workloads, it

is not uncommon for the data matrix A to be sparse, es-

pecially for applications such as information extraction and

text mining. In DimmWitted, we implement two protocols,

Dense and Sparse, which store the data matrix A as a dense

or sparse matrix, respectively. A Dense storage format has

two advantages: (1) if storing a fully dense vector, it requires

2 the space as a sparse representation, and (2) Dense is able

to leverage hardware SIMD instructions, which allows mul-

tiple floating point operations to be performed in parallel. A

Sparse storage format can use a BLAS-style scatter-gather

to incorporate SIMD, which can improve cache performance

and memory throughput; this approach has the additional

overhead for the gather operation. We find on a synthetic

dataset in which we vary the sparsity from 0.01 to 1.0, Dense

can be up to 2× faster than Sparse (for sparsity=1.0) while

Sparse can be up to 4× faster than Dense (for sparsity=0.01).

The dense vs. sparse tradeoff might change on newer

CPUs with VGATHERDPD intrinsic designed to specifically

speed up the gather operation. However, our current ma-

chines do not support this intrinsics and how to optimize

sparse and dense computation kernel is orthogonal to the

main goals of this paper.

Row-major and Column-major Storage. There are two

well-studied strategies to store a data matrix A: Row-major

and Column-major storage. Not surprisingly, we observed

that choosing an incorrect data storage strategy can cause

a large slowdown. We conduct a simple experiment where

we multiply a matrix and a vector using row-access method,

Page 14

where the matrix is stored in column- and row-major order.

We find that the Column-major could resulting 9× more L1

data load misses than using Row-major for two reasons: (1)

our architectures fetch four doubles in a cacheline, only one

of which is useful for the current operation. The prefetcher

in Intel machines does not prefetch across page boundaries,

and so it is unable to pick up significant portions of the

strided access; (2) On the first access, the Data cache unit

(DCU) prefetcher also gets the next cacheline compound-

ing the problem, and so it runs 8× slower.14

Therefore,

DimmWitted always stores the dataset in a way that is

consistent with the access method—no matter how the in-

put data is stored

B. EXTENDED RELATED WORK

We extend the discussion of related work. We summarize

in Figure 18 a range of related data mining work. A key

difference is that DimmWitted considers both hardware

efficiency and statistical efficiency for statistical analytics

solved by first-order methods.

Data Mining Algorithms. Probably the most related work

is by Jin et al. [24], who consider how to take advantage of

replication and different locking-based schemes with differ-

ent caching behavior and locking granularity to increase the

performance (hardware efficiency performance) for a range

of data mining tasks including K-means, frequent pattern

mining, and neural networks. Ghoting et al. [20] optimize

cache-behavior of frequent pattern mining using novel cache-

conscious techniques, including spatial and temporal local-

ity, prefetching, and tiling. Tatikonda et al. [50] considers

improving the performance of mining tree-structured data

multicore systems by decreasing the spatial and temporal

locality, and the technique they use is by careful study of dif-

ferent granularity and types of task and data chunking. Chu

et al. [17] apply the MapReduce to a large range of statisti-

cal analytics tasks that fit into the statistical query model,

and implements it on a multicore system and shows almost

linear speed-up to the number of cores. Zaki et al. [56] study

how to speed up classification tasks using decision trees on

SMP machines, and their technique takes advantage data

parallelism and task parallelism with lockings. Buehrer and

Parthasarathy et al. [13] study how to build a distributed

system for frequent pattern mining with terabytes of data.

Their focus is to minimize the I/O cost and communication

cost by optimizing the data placement and the number of

passes over the dataset. Buehrer et al. [12] study implement-

ing efficient graph mining algorithms over CMP and SMP

machines with the focus on load balance, memory usage (i.e.,

size), spatial locality, and the tradeoff of pre-computing and

re-computing. Zaki et al. [39,57] study on how to implement

parallel associated rule mining algorithms on shared memory

systems by optimizing reference memory locality and data

placement in the granularity of cachelines. This work also

considers how to minimize the cost of coherent maintenance

between multiple CPU caches. All of these techniques are

related and relevant to our work, but none consider optimiz-

ing first-order methods and the affect of these optimizations

on their efficiency.

14www.intel.com/content/dam/www/

public/us/en/documents/manuals/

64-ia-32-architectures-optimization-manual.pdf

High Performance Computation. The techniques that we

considered in DimmWitted for efficient implementation (Sec-

tion A) are not new, and they are borrowed from a wide

range of literature in high performance computation, database,

and systems. Locality is a classical technique: worker and

data collocation technique has been advocated since at least

90s [4,14] and is a common systems design principle [44].

The role of dense and sparse computation is well studied

in the by the HPC community. For example, efficient com-

putation kernels for matrix-vector and matrix-matrix mul-

tiplication [7,8,18,53]. In this work, we only require dense-

dense and dense-sparse matrix-vector multiplies. There is

recent work on mapping sparse-sparse multiplies to GPUs

and SIMD [54], which is useful for other data mining models

beyond what we consider here.

The row- vs. column-storage has been intensively studied

by database community over traditional relational database [2]

or Hadoop [22]. DimmWitted implements these techniques

to make sure our study of hardware efficiency and statisti-

cal efficiency reflects the status of modern hardware, and we

hope that future development on these topics can be applied

to DimmWitted.

Domain Specific Languages. Domain specific languages

(DSLs) are intended to make it easy for a user to write par-

allel programs by exposing domain-specific patterns. Exam-

ples of such DSLs include Galois [36,37] and OptiML [49]

for Delite [15]. To be effective, DSLs require the knowl-

edge about the trade-off of the target domain to apply their

compilation optimization, and we hope the insights from

DimmWitted can be applied to these DSLs.

Mathematical Optimization. Many statistical analytics tasks

are mathematical optimization problems. Recently, the math-

ematical optimization community has been looking at how

to parallelize optimization problems [32,38,58]. For exam-

ple, Niu et al. [38] for SGD and Shotgun [11] for SCD. A

lock-free asynchronous variant was recently established by

Ji et al. [32].

C. ADDITIONAL EXPERIMENTS

C.1 More Detailed Tuning Information for Spark

We report details of how we tune our Spark installation

for fair comparison. Figure 19 shows the list of parame-

ters that we used to tune Spark. For each combination of

the parameter, we run one experiment for measuring the

throughput using parallel sum, and use it for all other ex-

periments to maximize the performance. For each task, we

try all combinations of step size and batch size.

Statistical Efficiency: Step Size and Batch Size. We ob-

serve that step size and batch size of gradient together has

significant impact on the time that Spark needs to converge.

As shown in Figure 19, for each experiment, we try 28 differ-

ent combinations of these settings (7 step sizes and 4 batch

sizes). We see that these parameters could contribute to

more than 100× in the time to converge to the same loss

on the same dataset! Therefore, as shown in Figure 19, we

tried a large range of these two parameters and pick the best

one to report.

Page 15

Target Architecture

Target Application

Target Efficiency

Multicore

NUMA (SMP)

Distributed

Data Mining

Graph Mining

Gradient-based

Hardware

Statistical

Jin et al. [24]

Ghoting et al. [20]

Tatikonda et al. [50]

Chu et al. [17]

Zaki et al. [56]

Buehrer et al. [13]

Buehrer et al. [12]

Zaki et al. [39,57]

Tsitsiklis et al. [51]

Niu et al. [38]

Bradley et al. [11]

GraphChi [28]

GraphLab [33,34]

MLlib [47]

DimmWitted

Figure 18: A Taxonomy of Related Work

Type

Parameters

Values

Statistical

Step size

100, 10, 1, 0.1, 0.01, 0.001, 0.0001

Efficiency

Batch size

100%, 50%, 10%, 1%

Data Replication

1, 2, 3

Serialization

True, False

Hardware

Storage Level

MEMORY ONLY

Efficiency

Compression

True, False

locality.wait

1, 100, 1000, 3000, 10000

SPARK MEM

48g, 24g, 1g

numactl

localloc, interleave, NA

Figure 19: The Set of Parameters We Tried for Tun-

ing Spark

Sources of Overhead in Spark. Spark has overhead in

scheduling the task and provide fault tolerance, both of

which are features that DimmWitted does not support. To

make our comparison as fair as possible, we conduct the fol-

lowing experiments to understand how scheduling and fault

tolerance impact our claims.

We implement our own version of batch-gradient descent

algorithm in DimmWitted by strictly following MLlib’s al-

gorithm in C++. On Forest, we first observe that our

own batch-gradient implementation uses similar numbers of

epochs (within 5%) to converge to 1% loss as MLlib given the

same step size and batch size. Second, for each epoch, our

batch-gradient implementation is 3-7× faster cross different

architectures–this implies that MLlib does have overhead

compared with DimmWitted’s framework. However, our

own batch-gradient implementation is still 20-39× slower

than DimmWitted cross different architectures.

We break down the execution time into the number of

epochs that each system needs to converge and the time

that MLlib used for scheduling and computation. In par-

ticular, we use the Forest dataset as an example. On this

dataset, DimmWitted uses 1 epoch to converge to 1% loss,

while both MLlib and our own C++ implementation use 63

and 64 epochs, respectively. MLlib uses 2.7 seconds for these

64 epochs, and 0.9 seconds of these are used for scheduling,

and other 1.8 seconds are used to enumerate each example,

and calculate the gradient.15 The difference in the number

of epochs to converge implies that the difference between

MLlib and DimmWitted is not caused by low-level imple-

15 We observe similar break down on other datasets except

the smallest dataset, Reuters. On this dataset, the time

used for scheduling is up to 25× of the computation time.

PerCore

PerNode

PerMachine

Delite

# Threads

Speedup

Linear Speedup

Figure 20: Comparison with Delite using LR (Mu-

sic) on local2.

mentations, instead, that MLlib only implements a subset

of points in DimmWitted’s tradeoff space.

Hardware Efficiency. We summarize the impact of param-

eters to the throughput of MLlib. For each out of totally 540

combinations of all seven parameters related to hardware ef-

ficiency, we run the parallel sum to measure the throughput.

We find, not surprisingly, that the parameter SPARK MEM

has significant impact on the throughput–On Music, when

this parameter is set to 48GB, Spark achieves 7× speed-

up over 1GB. This is not surprising because this parameter

sets the amount of RAM that Spark can use. We also find

that, given the SPARK MEM parameter to be 48GB, all

other parameters only have less than 50% difference with

each other. Therefore, in our experiments we always use

SPARK MEM and set other parameters to be the setting

that achieves highest throughput in our experiment on the

corresponding dataset.

C.2 Comparison with Delite

Recently, there have been a trend of using domain specific

language to help user write parallel programs more easily.

We conduct a simple experiment with one popular DSL,

namely Delite [15], to illustrate that the tradeoff we studied

in this paper has the potential to help these DSLs to achieve

higher performance and quality.

We use the official implementation of logistic regression

in Delite [15] and run both DimmWitted and Delite on the

Music dataset using local2. We try our best effort for the

Page 16

0.1

100

0.01

0.1

Scale (1x = 0.5B rows, 4B NNZs, 49GB)

ime/Ep

och

(secon

Figure 21:

Scalability of DimmWitted using

ClueWeb 2009 on local2.

locality of Delite by trying different settings for numactl.

We vary the number of threads that each program can use

and plot the speed-up curve as shown in Figure 20.

First, we see from Figure 20 that different model replica-

tion strategy in DimmWitted has different speed-up behav-

ior. Not surprisingly, PerCore speeds up more linearly than

PerNode and PerMachine. These observations are consistent

with the hardware efficiency that we discussed in this paper.

More interestingly, we see that Delite does not speed-up be-

yond a single socket (i.e., 6 cores). Therefore, by applying

the PerNode strategy in DimmWitted to Delite, we hope

that we can improve the speed-up behavior of Delite as we

illustrated in Figure 20.

C.3 Scalability Experiments

We validate the scalability of DimmWitted by testing it

on larger dataset.

Dataset. We follow Kan et al. [26] to create a dataset that

contains 500 million examples, 100K features for each ex-

ample, and 4 billion non-zero elements by using a Web-scale

data set called ClueWeb.16

ClueWeb contains 500 million

Web pages, and the approach of Kan et al. tries predict the

PageRank score of each Web page by using features from its

URLs by a least squares model.

Result. To validate the scalability of DimmWitted, we

randomly subsampled 1% examples, 10% examples, and 50%

examples to create smaller datasets. We run DimmWit-

ted using the rule-of-thumbs in Figure 14, and measure the

time that DimmWitted used for each epoch. Figure 21

shows the result. We see that on this dataset, the time that

DimmWitted needs to finish a single epoch grows almost

linearly with the number of examples. We believe that this

is caused by the fact that for all sub-sampled datasets and

the whole dataset, the model (100K weights) fits in the LLC

cache.

C.4 Importance Sampling as a Data Replica-

tion Strategy

The Sharding and FullReplication sampling scheme that we

discussed in Section 3 assumes that data tuples are equally

important. However, in statistic analytics, it is not uncom-

mon that some data tuples are more important than others.

One example is the linear leverage score.

16http://lemurproject.org/clueweb09/

0.01

0.1

100

1000

0.01

0.1

Importance0.1

Importance0.01

Sharding

FullReplication

Time (seco

Error to Optimal Loss

Figure 22: Important Sampling on Music (local2).

Example C.1

(Linear Leverage Score [10]). For A ∈

N×d and b ∈ RN . Define s(i) = aT

i(AT A)

−1

ai, where ai

is the ith row of A. Let ˜A and ˜b be the result of sampling m

rows, where row i is selected with probability proportional to

s(i). Then, for all x ∈ Rd, we have

[∣

∣

∣Ax

− b2

2 −

˜Ax − ˜b2

∣

< εAx − b2

]

So long as m > 2ε−2d log d.

For general loss functions (e.g., logistic loss), the linear

leverage score calculated in the same way as above does not

necessarily satisfy the property of approximating the loss.

However, we can still use this score as a heuristic to decide

the relative importance of data examples. In DimmWitted,

we consider the following protocol that we called Importance.

Given a dataset A, we calculate the leverage score s(i) of

the ith row as aT

i (AT A)−1ai. The user specifies the error

tolerance ϵ that is acceptable to her, and for each epoch,

DimmWitted samples for each worker 2ε

−2d log d examples

with a probability that is propositional to the leverage score.

This procedure is implemented in DimmWitted as one data

replication strategy.

Experimental Results. We run the above importance sam-

pling on the same data set as Section 4, and validate that

on some datasets the importance sampling scheme can im-

prove the time that DimmWitted needs to converge to a

given loss. Figure 22 shows the results of comparing differ-

ent data replication strategies on Music running on local2,

where Importance0.1 and Importance0.01 uses 0.1 and 0.01

as the error tolerance ϵ, respectively.

We see that, on Music, Importance0.1 is 3x faster than

FullReplication, for 10% loss. This is caused by the fact

that Importance0.1 processes only 10% of the data compared

with FullReplication. However, Importance0.01 is slower

than FullReplication. This is because when the error toler-

ance is lower, the number of samples one needs to draw for

each epoch increases. For Music, Importance0.01 processes

the same amount of tuples than FullReplication.

D. DETAILED DESCRIPTION OF EXTEN-

SIONS

We describe in more details of each extension that we

mentioned in Section 5.

D.1 Gibbs Sampling

Figure 23(a) illustrates a factor graph, which is a bipartite

graph that contains a set of variable, a set of factors, and

Page 17

Variable

tor

Current Variable

tor

ria

ble

Current Variable

(a) Factor Graph

(b) DimmWitted

Layer 1

Layer 2

Layer n

…

Figure 23: Illustration of Factor Graph and Deep

Neural Networks in DimmWitted. (a) and (b) show

a factor graph and how DimmWitted represents it

as column-to-row access. (c) shows a deep neural

network, and the de facto approach to solve it is to

run SGD for each layer DimmWitted in a round-

robin fashion.

a set of links between variables and factors. To run Gibbs

sampling over a factor graph, one processes one variable at

a time to calculate the conditional probability for different

assignment of this variable. This involves fetching all con-

nected factors and all current assignments of variables that

connected to these factors. Gibbs sampling then update the

current variable assignment by randomly sampling a value

according to the conditional probability and proceed to the

next random variable. Similar to first order methods, re-

cent theory proves a lock-free protocol to sample multiple

variables at the same time [25]. We also know from classic

statistical theory [43] that one can maintain multiple copy of

the same factor graph, and aggregate the samples produced

on each factor graph at the end of execution.

Figure 23(b) illustrates how DimmWitted models Gibbs

sampling as column-to-row access. We see that each row

corresponding to one factor, each column corresponding to

one variable, and the non-zero elements in the matrix cor-

respond to the link in the factor graph. To process one

variable, DimmWitted fetches one column of the matrix to

get the set of factors, and other columns to get the set of

variables that connect to the same factor.

In DimmWitted, we implement the PerNode strategy for

Gibbs sampling by running one independent chain for each

NUMA node. At the end of sampling, we can use all samples

generated from each NUMA node for estimation. Therefore,

we use throughput, i.e., number of samples generated per

second as the measurement for performance in Section 5.17

In DimmWitted, we implement Gibbs sampling for gen-

eral factor graphs, and compare it with one hand-coded

implementation for topic modeling in GraphLab. We run

all systems on local2 with 100K documents and 20 topics.

We find that on local2, DimmWitted’s implementation is

3.7× faster than GraphLab’s implementation without any

application-specific optimization.

D.2 Deep Neural Networks

Figure 23(c) illustrates a Deep Neural Network as we de-

scribed in Section 5. Stochastic gradient descent is the de

facto algorithm to solve a neural network [30], with one twist

that we will discuss as follows. As shown in Figure 23(c),

17There has been a long historical discussion about the trade-

off between a single deep chain and multiple independent

chains in statistics. This tradeoff is out of the scope of this

paper.

a deep neural network usually contains multiple layers, and

the SGD algorithm needs to be run within each layer, and

process all layers in a round-robin fashion. Therefore, in

DimmWitted, we use the same SGD code path inside each

layer one at a time, and invoke this code path multiple times

to process different layers.