Parallel Computing For Data Science With Examples in R, C++ and CUDA (PDFDrive)
Parallel Computing For Data Science With Examples in R, C++ and CUDA (PDFDrive)
The R Series
Parallel Computing for Data Science: With Examples in R, C++ and CUDA is
one of the first parallel computing books to concentrate exclusively on parallel data
With Examples in
structures, algorithms, software tools, and applications in data science. It includes
examples not only from the classic “n observations, p variables” matrix format but
also from time series, network graph models, and numerous other structures com-
mon in data science. The book also discusses software packages that span more
than one type of hardware and can be used from more than one type of program-
ming language.
R, C++ and CUDA
Features
• Focuses on applications in the data sciences, including statistics, data mining,
and machine learning
• Discusses structures common in data science, such as network data models
• Emphasizes general principles throughout, such as avoiding factors that
reduce the speed of parallel programs
• Covers the main types of computing platforms: multicore, cluster, and
graphics processing unit (GPU)
• Explains how the Thrust package eases the programming of multicore
machines and GPUs and enables the same code to be used on either platform
• Provides code for the examples on the author’s web page
Norman Matloff
K20322
w w w. c rc p r e s s . c o m
Series Editors
John M. Chambers Torsten Hothorn
Department of Statistics Division of Biostatistics
Stanford University University of Zurich
Stanford, California, USA Switzerland
Norman Matloff
University of California, Davis
USA
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a photo-
copy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface xix
Bio xxiii
vii
viii CONTENTS
Index 311
Preface
Thank you for your interest in this book. I’ve very much enjoyed writing
it, and I hope it turns out to become very useful to you. To set the stage,
there are a few general points of information I wish to present.
Goals:
This book hopefully will live up to its title—Parallel Computing for Data
Science. Unlike almost every other book I’m aware of on parallel comput-
ing, you will not find a single example here dealing with solving partial
differential equations and other applications to physics. This book really is
devoted to applications in data science—whether you define that term to
be statistics, data mining, machine learning, pattern recognition, analytics,
or whatever.1
This means more than simply that the book’s examples involve applications
chosen from the data science field. It also means that the data structures,
algorithms and so on reflect this orientation. This will range from the classic
“n observations, p variables” matrix format to time series to network graph
models to various other structures common in data science.
While the book is chock full of examples, it aims to emphasize general
principles. Accordingly, after presenting an introductory code example in
Chapter 1 (general principles are meaningless without real examples to tie
them to), I devote Chapter 2 not so much as how to write parallel code, as
to explaining what the general factors are that can rob a parallel program
of speed. This is a crucial chapter, referred to constantly in the succeeding
chapters. Indeed, one can regard the entire book as addressing the plight
of the poor guy described at the beginning of Chapter 2:
1 Ironically, I myself am not a big fan of the term data science, but it does encompass
these various views, and highlight the point that this book is about data, not physics.
xix
xx PREFACE
The use of the word computing in the book’s title reflects the fact that the
book’s main focus is indeed computation. This is in contrast to parallel
data processing, such as in distributed file storage exemplified by that of
Hadoop, though a chapter is devoted to such settings.
The main types of computing platforms covered are multicore, cluster and
GPU. In addition, there is considerable coverage of Thrust, a wonderful tool
that greatly eases the programming of multicore machines and GPUs—and
simultaneously, in the sense that the same code is usable on either platform!
I believe readers will find this material especially valuable.
One thing this book is not, is a user manual. Though it uses specific tools
throughout, such as R’s parallel and Rmpi packages, OpenMP, CUDA
and so on, this is for the sake of concreteness. The book will give the
reader a solid introduction to these tools, but is not a compendium of
all the different function arguments, environment options and so on. The
intent is that the reader, upon completing this book, will be well-poised
to learn more about these tools, and most importantly, to write effective
parallel code in various other languages, be it Python, Julia or whatever.
Necessary background:
If you consider yourself reasonably adept in using R, you should find most
of this book quite accessible. A few sections do use C/C++, and prior
background in those languages is needed if you wish to read those sections
in full detail. However, even without knowing C/C++ well. you should still
find that material fairly readable, and of considerable value. Appendices
summarizing R for C programmers, and introducing C to R people, are
included.
You should be familiar with basic math operations with matrices, mainly
multiplication and addition. Occasionally some more advanced operations
will be used, such as inversion (and its cousins, such as QR methods) and
diagonalization, which are presented in Appendix A.
Machines:
Except when stated otherwise, all timing examples in this book were run
on a 16-core Ubuntu machine, with hyperthreading degree 2. I generally
PREFACE xxi
Much gratitude goes to the internal reviewers, David Giles, Mike Hannon
and Michael Kane. I am especially grateful to my old friend Mike Hannon,
who provided amazingly detailed feedback. Thanks go also to John Kimmel,
Executive Editor for Statistics at Chapman and Hall, who has been highly
supportive since the beginning.
My wife Gamis and my daughter Laura both have a contagious sense of
humor and zest for life that greatly improve everything I do.
Author’s Biography
Dr. Matloff was born in Los Angeles, and grew up in East Los Angeles and
the San Gabriel Valley. He has a PhD in pure mathematics from UCLA,
specializing in probability theory and statistics. He has published numerous
papers in computer science and statistics, with current research interests in
parallel processing, statistical computing, and regression methodology. He
is on the editorial board of the Journal of Statistical Software.
Professor Matloff is a former appointed member of IFIP Working Group
11.3, an international committee concerned with database software security,
established under UNESCO. He was a founding member of the UC Davis
Department of Statistics, and participated in the formation of the UCD
Computer Science Department as well. He is a recipient of the campus-wide
Distinguished Teaching Award and Distinguished Public Service Award at
UC Davis.
xxiii
Chapter 1
Introduction to Parallel
Processing in R
In settings in which you really need to maximize execution speed, you may
wish to resort to writing in a compiled language such as C/C++, which we
will indeed do occasionally in this book. However, the extra speed that may
be attained via the compiled language typically is just not worth the effort.
In other words, we have an analogy to the Pretty Good Privacy security
system:
1
2 CHAPTER 1. INTRODUCTION
In many cases just “pretty fast” is quite good enough. The extra
speed we might attain by moving from R to C/C++ does not
justify the possibly much longer time needed to write, debug
and maintain code at that level.
This of course is the reason for the popularity of the various parallel R pack-
ages. They fulfill a desire to code parallel operations yet still stay in R. For
example, the Rmpi package provides an R connection to the Message Pass-
ing Interface (MPI), a very widely used parallel processing system in which
applications are normally written in C/C++ or FORTRAN.1 Rmpi gives
analysts the opportunity to take advantage of MPI while staying within R.
But as an alternative to Rmpi that also uses MPI, R users could write their
application code in C/C++, calling MPI functions, and then interface R to
the resulting C /C++function. But in doing so, they would be foregoing
the coding convenience and rich package available in R. So, most opt for
using MPI only via the Rmpi interface, not directly in C/C++.
The aim of this book is to provide a general treatment of parallel processing
in data science. The fact that R provides a rich set of powerful, high-level
data and statistical operations means that examples in R will be shorter
and simpler than they would typically be in other languages. This enables
the reader to truly focus on the parallel computation methods themselves,
rather than be distracted by having to wade through the details of, say,
intricate nested loops. Not only is this useful from a learning point of view,
but also it will make it easy to adapt the code and techniques presented
here to other languages, such as Python or Julia.
1.1.2 “R+X”
1 For brevity, I’ll usually not mention FORTRAN, as it is not used as much in data
science.
1.2. A NOTE ON MACHINES 3
Three types of machines will be used for illustration in this book: multicore
systems, clusters and graphics processing units (GPUs). As noted in the
preface, I am not targeting the book to those fortunate few who have access
to supercomputers (though the methods presented here do apply to such
machines). Instead, it is assumed that most readers will have access to
more modest systems, say multicore with 4-16 cores, or clusters with nodes
numbering in the dozens, or a single GPU that may not be the absolute
latest model.
Most of the multicore examples in this book were run on a 32-core system
on which I seldom used all the cores (as I was a guest user). The timing
experiments usually start with a small number of cores, say 2 or 4.
As to clusters, my coverage of “message-passing” software was typically run
on the multicore system, though occasionally on a real cluster to demon-
strate the effects of overhead.
The GPU examples here were typically run on modest hardware.
Again, the same methods as used here do apply to the more formidable
systems, such as the behemoth supercomputers with multiple GPUs and so
on. Tweaking is typically needed for such systems, but this is beyond the
scope of this book.
Python. The GIL is still there, but Python now has ways around it.
4 CHAPTER 1. INTRODUCTION
Intel Xeon Phi overtake GPUs? And will languages keep pace with the
advances in hardware?
For this reason, software packages that span more than one type of hard-
ware, and can be used from more than one type of programming language,
have great appeal. The Thrust package, the topic of Chapter 7 and some
of the later sections, epitomizes this notion. The same Thrust code can
run on either multicore or GPU platforms, and since it is C++ based, it is
accessible from R or most other languages. In short, Thrust allows us to
“hedge our bets” when we develop parallel code.
Message-passing software systems, such as R’s snow, Rmpi and pbdR,
have much the same advantage, as they can run on either multicore ma-
chines or clusters.
sum = 0.0
for i = 1,...,n-1
for j = i+1,,...,n
sum = sum + g(obs.i, obs.j)
With nested loops like this, you’ll find in this book that it is generally easier
to parallelize the outer loop rather than the inner one. If we have a dual
core machine, for instance, we could assign one core to handle some values
of i in the above code and the other core to handle the rest. Ultimately
we’ll do that here, but let’s first take a step back and think about this
setting.
1.4. EXTENDED EXAMPLE: MUTUAL WEB OUTLINKS 5
We generate random 1s and 0s, and call the function. Here’s a sample run:
> sim ( 5 0 0 , 5 0 0 )
u s e r system e l a p s e d
106.111 0.030 106.659
than having to translate the R repeatedly for each line of the loop, at each
iteration.
In the code for mutoutser() above, the inner loops can be rewritten as a
matrix product, as we will see below, and that will turn out to eliminate
two of our loops.3
To see the matrix formulation, suppose we have this matrix:
0 1 0 0 1
1 0 0 1 1
0 1 0 1 0
(1.1)
1 1 1 0 0
1 1 1 0 1
1·1+0·1+0·1+1·0+1·0=1 (1.2)
But that is merely the inner product of rows i and j of the matrix! In other
words, it’s
l i n k s [ i , ] %∗% l i n k s [ j , ]
But there’s more. Again consider the case in which i is 2. The same
reasoning as above shows that the entire computation for all j and k, i.e.,
the two innermost loops, can be written as
1
0 1 0 1 0
0
1
1 1 1 0 0 0 = 1 (1.3)
1 1 1 0 1 1 2
1
The matrix on the left is the portion of our original matrix below row 2,
and the vector on the right is row 2 itself.
Those numbers, 1, 1 and 2, are the results we would get from running the
code with i equal to 2 and j equal to 3, 4 and 5. (Check this yourself to get
a better grasp of how this works.)
3 In R, a matrix is a special case of a vector, so we are indeed using vectorization here,
as promised.
1.4. EXTENDED EXAMPLE: MUTUAL WEB OUTLINKS 7
> sim1 ( 5 0 0 , 5 0 0 )
u s e r system e l a p s e d
1.443 0.044 1.496
Wonderful! Nevertheless, that is still only for the very small 500-site case.
Let’s run it for 2000:
> sim1 ( 2 0 0 0 , 2 0 0 0 )
u s e r system e l a p s e d
92.378 1.002 94.071
The most popular tools for parallel R are snow, multicore, foreach and
Rmpi. Since the first two of these are now part of the R core in a package
named parallel, it is easiest to use one of them for our introductory mate-
rial in this chapter, rather than having the user install another package at
this point.
8 CHAPTER 1. INTRODUCTION
Our set of choices is further narrowed by the fact that multicore runs
only on Unix-family (e.g., Linux and Mac) platforms, not Windows. Ac-
cordingly, at this early point in the book, we will focus on snow.
As noted, an old contributed package for R, snow, was later made part of
the R base, in the latter’s parallel package (with slight modifications). We
will make frequent use of this part of that package, so we need a short name
for it. “The portion of parallel adapted from snow” would be anything
but short. So, we’ll just call it snow.
Here is the overview of how snow operates: All four of the popular packages
cited above, including snow, typically employ a scatter/gather paradigm:
We have multiple instances of R running at the same time, either on several
machines in a cluster, or on a multicore machine. We’ll refer to one of
the instances as the manager, with the rest being workers. The parallel
computation then proceeds as follows:
1.4.5.1 Code
library ( p a r a l l e l )
# s e t up c l u s t e r o f nworkers w o r k e r s on
# m u l t i c o r e machine
i n i t m c <− function ( nworkers ) {
makeCluster ( nworkers )
}
# s e t up a c l u s t e r on machines s p e c i f i e d ,
# one worker p e r machine
i n i t c l s <− function ( w o r k e r s ) {
makeCluster ( s p e c=w o r k e r s )
}
10 CHAPTER 1. INTRODUCTION
1.4.5.2 Timings
Before explaining how this code works, let’s see if it yields a speed improve-
ment. I ran on the same machine used earlier, but in this case with two
workers, i.e., on two cores. Here are the results:
> c l 2 <− i n i t m c ( 2 )
> snowsim ( 2 0 0 0 , 2 0 0 0 , c l 2 )
u s e r system e l a p s e d
0.237 0.047 80.348
So, hyperthreading did yield further improvement, raising our speedup fac-
tor to 1.34. Note, though, that now there is even further disparity between
the 4.00 speedup we might hope to get with four workers. As noted, these
issues will arise frequently in this book; the sources of overhead will be
discussed, and remedies presented.
There is another reason why our speedups above are not so impressive: Our
code is fundamentally unfair—it makes some workers do more work than
1.4. EXTENDED EXAMPLE: MUTUAL WEB OUTLINKS 11
others. This is known as a load balancing problem, one of the central issues
in the parallel processing field. We’ll address this in a refined version in
Chapter 3.
So, how does all this work? Let’s dissect the code.
Even though snow and multicore are now part of R via the parallel
package, the package is not automatically loaded. So we need to take care
of this first, by placing a line
library ( p a r a l l e l )
at the top of our source file (if all these functions are in one file), or simply
execute the above library() call on the command line.
Or, we can insert a line
require ( p a r a l l e l )
will be executed for each worker, with ichunk being different for each
worker.
12 CHAPTER 1. INTRODUCTION
Our function mutoutpar() wraps the overall process, dividing into the i
values into chunks and calling doichunk() on each one. It thus parallelizes
the outer loop of the serial code.
mutoutpar <− function ( c l s , l n k s ) {
nr <− nrow( l n k s )
clusterExport ( cls , ” lnks ”)
i c h u n k s <− 1 : ( nr −1)
t o t s <− c l u s t e r A p p l y ( c l s , ichunks , doichunk )
Reduce (sum, t o t s ) / nr
}
To get an overview of that function, note that the main actions consist of
the follwing calls to snow and R functions:
Here are the details: Even before calling mutoutpar(), we set up our snow
cluster:
makeCluster ( nworkers )
I create a 4-node snow cluster (for 4 workers) and save its information in
an R object cls (of class “cluster”), which will be used in my subsequent
calls to snow functions.
There is one component in cls for each worker. So after the above call,
running
1.4. EXTENDED EXAMPLE: MUTUAL WEB OUTLINKS 13
length ( c l s )
prints out 4.
We can also run snow on a physical cluster of machines, i.e., several ma-
chines connected via a network. Calling the above function initcls() ar-
ranges this. In my department, for example, we have student lab machines
named pc1, pc2 and so on, so for instance
c l 2 <− i n i t c l s ( c ( ” pc28 ” , ” pc29 ” ) )
This sends our data matrix lnks to all the workers in cls.
An important point to note is that clusterExport() by default requires
the transmitted data to be global in the manager’s work space. It is then
placed in the global work space of each worker (without any alternative
option offered). To meet this requirement, I made lnks global back when I
created this data in snowsim(), using the superassignment operator <<−:
l n k s <<− matrix ( sample ( 0 : 1 , ( nr∗nc ) , replace=TRUE) ,
nrow=nr )
Here are the details of the clusterApply() call. Let’s refer to that second
argument of clusterApply(), in this case ichunks, as the “work assign-
ment” argument, as it parcels out work to workers.
To keep things simple in this introductory example, we have just a single i
value for each “chunk”:
i c h u n k s <− 1 : ( nr −1)
t o t s <− c l u s t e r A p p l y ( c l s , ichunks , doichunk )
This would have been fine if tots had been a vector, but it’s a list, hence
our use of R’s Reduce() function. Here Reduce() will apply the sum()
function to each element of the list tots, yielding the grand sum as desired.
You’ll find use of Reduce() common with functions in packages like snow,
which typically return values in lists.
This is a good time to point out that many parallel R packages require the
1.4. EXTENDED EXAMPLE: MUTUAL WEB OUTLINKS 15
“Why Is My Program So
Slow?”: Obstacles to
Speed
17
18 CHAPTER 2. OBSTACLES TO SPEED
There are a number of issues of this sort that occur generally enough to be
collected into this chapter, as an “early warning” of issues that can arise.
This is just an overview, with details coming in subsequent chapters, but
being forewarned of the problems will make it easier to recognize them as
they are encountered.
Scorecards, scorecards! You can’t tell the players without the scorecards!—
old chant of scorecard vendors at baseball games
The foot bone connected to the ankle bone, The ankle bone connected to the
shin bone...—from the children’s song, “Dem Bones”
The reason our unfortunate analyst in the preceding section was surprised
that his code ran more slowly on the parallel machine was almost certainly
due to a lack of understanding of the underlying hardware and systems
software. While one certainly need not understand the hardware on an
electronics level, a basic knowledge of “what is connected to what” is es-
sential.
In this section, we’ll present overviews of the major hardware issues, and
of the two parallel hardware technologies the reader is mostly likely to
2.2. PERFORMANCE AND HARDWARE STRUCTURES 19
I emphasize the “household item” aspect above, to stress that these are not
esoteric architectures, though of course scale can vary widely from what
you have at home to far more sophisticated and expensive systems, with
quite a bit in between.
The terms shared-memory and networked above give clues as to the ob-
stacles to computational speed that arise, which are key. So, we will first
discuss the high-level workings of these two hardware structures, in Sections
2.3 and 2.4.
We’ll then explain how they apply to the overhead issue with our two basic
platform types, multicore (Section 2.5.1.1) and cluster (Section 2.5.1.2).
We’ll cover just enough details to illustrate the performance issues discussed
later in this chapter, and return for further details in later chapters.
1 What about clouds? A cloud consists of multicore machines and clusters too, but
one generally must install software for the purpose of controlling which program runs on
which machines. But your two-node home system is still a cluster.
20 CHAPTER 2. OBSTACLES TO SPEED
2.3.1 Caches
A device commonly used to deal with slow memory access is a cache. This
is a small but fast chunk of memory that is located on or near the processor
2.3. MEMORY BASICS 21
chip. For this purpose, memory is divided into blocks, say of 64 bytes each.
Memory address 1200, for instance, would be in block 18, since 1200/64 is
equal to 18 plus a fraction. (The first block is called Block 0.)
The cache is divided into lines, each the size of a memory block. At any
given time, the cache contains local copies of some blocks of memory, with
the specific choice of blocks being dynamic—at some times the cache will
contain copies of some memory blocks, while a bit later it may contain
copies of some other blocks.3
If we are lucky, in most cases, the memory word that the processor wishes
to access (i.e., the variable in the programmer’s code she wishes to access)
already has a copy in its cache—a cache hit. If this is a read access (of
x in our little example above), then it’s great—we avoid the slow memory
access.
On the other hand, in the case of a write access (to y above), if the requested
word is currently in the cache, that’s nice too, as it saves us the long trip to
memory (if we do not “write through” and update memory right away, as
we are assuming here). But it does produce a discrepancy between the given
word in memory and its copy in the cache. In the cache architecture we are
discussing here, that discrepancy is tolerated, and eventually resolved when
the block in question is “evicted,” as we will see below. (With a multicore
machine, cache operation becomes more complicated, as typically each core
will have its own cache, thus potentially causing severe discrepancies. This
will be discussed in Section 2.5.1.1.)
If in a read or write access the desired memory word is not currently in the
cache, this is termed a cache miss. This is fairly expensive. When it occurs,
the entire block containing the requested word must be brought into the
cache. In other words, we must access many words of memory, not just one.
Moreover, usually a block currently in the cache must be evicted to make
room for the new one being brought in. If the old block had been written
to at all, we must now write that entire block back to memory, to update
the latter.4
So, though we save memory access time when we have a cache hit, we incur
a substantial penalty at a miss. Good cache design can make it so that
the penalty is incurred only rarely. When a read miss occurs, the hardware
makes “educated guesses” as to which blocks are least likely to be needed
again in the near future, and evicts one of these. It usually guesses well, so
3 What follows below is a description of a common cache design. There are many
particular words were affected. Thus the entire block must be written.
22 CHAPTER 2. OBSTACLES TO SPEED
that cache hit rates are typically well above 90%. Note carefully, though,
that this can be affected by the way we code. This will be discussed in
future chapters.
A machine will typically have two or more levels of cache. The one in or
next to the CPU is called the L1, or Level 1 cache. Then there may be an
L2 cache, a “cache for the cache.” If the desired item is not found in the L1
cache, the CPU will then search the L2 cache before resorting to accessing
the item in memory.
Though it won’t arise much in our context, we should at least briefly dis-
cuss virtual memory. Consider our example above, in which our program
contained variables x and y. Say these are assigned to addresses 200 and
8888, respectively. Fine, but what if another program is also running on the
machine? The compiler/interpreter may have assigned one of its variables,
say g, to address 200. How do we resolve this?
The standard solution is to make the address 200 (and all others) only
“virtual.” It may be, for instance, that x from the first program is actually
stored in physical address 7260. The program will still say x is at word
200, but the hardware will translate 200 to 7260 as the program executes.
If g in the second program is actually in word 6548, the hardware will
replace 200 by 6548 every time the program requests access to word 200.
The hardware has a table to do these lookups, one table for each program
currently running on the machine, with the table being maintained by the
operating system.
Virtual memory systems break memory into pages, say of 4096 bytes each,
analogous to cache blocks. Usually, only some of your program’s pages are
resident in memory at any given time, with the remainder of the pages out
on disk. If your program needs some memory word not currently resident—
a page fault, analogous to a cache miss—the hardware senses this, and
transfers control to the operating system. The OS must bring in the re-
quested page from disk, an extremely expensive operation in terms of time,
due to the fact that a disk drive is mechanical rather than electronic like
RAM.5 Thus page faults can really slow down program speed, and again as
with the cache case, you may be able to reduce page faults through careful
design of your code.
5 Some more expensive drives, known as Solid State Drives (SSDs), are in fact elec-
tronic.
2.4. NETWORK BASICS 23
Both cache misses and page faults are enemies of good performance, so it
would be nice to monitor them.
This actually can be done in the case of page faults. As noted, a page
fault triggers a jump to the OS, which can thus record it. In Unix-family
systems, the time command gives not only run time but also a count of
page faults.
By contrast, cache misses are handled purely in hardware, thus not record-
able by the OS. But one might try to gauge the cache behavior of a program
by using the number of page faults as a proxy. There are also simulators,
such as valgrind, which can be used to measure cache performance.
A single Ethernet (or other similar system), say within a building, is called
a network. The Internet is simply the interconnection of many networks–
millions of them.
Say you direct the browser on your computer to go to the Cable Network
News (CNN) home page, and you are located in San Francisco. Since CNN
is headquartered in Atlanta, packets of information will go from San Fran-
cisco to Atlanta. (Actually, they may not go that far, since Internet service
providers (ISPs) often cache Web pages, but let’s suppose that doesn’t
occur.) Actually, a packet’s journey will be rather complicated:
24 CHAPTER 2. OBSTACLES TO SPEED
• Your browser program will write your Web request to a socket. The
latter is not a physical object, but rather a software interface from
your program to the network.
• The socket software will form a packet from your request, which will
then go through several layers of the network protocol stack in your
OS. Along the way, the packet will grow, as more information is being
added, but also it will split into multiple, smaller packets.
• Your packets will wend their way across the country, being sent from
one network to the next.6
• When your packets reach a CNN computer, they will now work their
way up the levels of the OS, finally reaching the Web server program.
Getting there is half the fun—old saying, regarding the pleasures of traveling
The speed of a communications channel—whether between processor cores
and memory in shared-memory platforms, or between network nodes in a
cluster of machines—is measured in terms of latency, the end-to-end travel
time for a single bit, and bandwidth, the number of bits per second that we
can pump onto the channel.
To make the notions a little more concrete, consider the San Francisco Bay
Bridge, a long, multilane structure for which westbound drivers pay a toll.
The notion of latency would describe the time it takes for a car to drive
from one end of the bridge to the other. (For simplicity, assume they all go
the same speed.) By contrast, the bandwidth would be the number of cars
exiting from the toll booths per unit time. We can reduce the latency by
raising the speed limit on the bridge, while we could increase the bandwidth
by adding more lanes and more toll booths.
6 Run the traceroute command on your machine to see the exact path, though this
l + n/b (2.1)
Of course, this assumes that there are no other messages contending for the
communication channel.
Clearly there are numerous delays in networks, including the less-obvious
ones incurred in traversing the layers of the OS. Such traversal involves
copying the packet from layer to layer, and in cases of interest in this book,
such copying can involve huge matrices and thus take a lot of time.
Though parallel computation is typically done within a network rather than
across networks as above, many of those delays are still there. So, network
speeds are much, much lower than processor speeds, both in terms of latency
and bandwidth.
The latency in even a fast network such as Infiniband is on the order of
microseconds, i.e., millionths of a second, which is eons compared to the
nanosecond level of execution time for a machine instruction in a processor.
(Beware of a network that is said to be fast but turns out only to have high
bandwidth, not also low latency.)
Latency and bandwidth issues arise in shared-memory systems too. Con-
sider GPUs, for instance. In most applications, there is a lot of data trans-
fer between the CPU and the GPU, with attendant potential for slowdown.
Latency, for example, is the time for a single bit to go from the CPU to the
GPU, or vice versa.
One way to ameliorate the slowdown from long latency delays is latency
hiding. The basic idea is to try to do other useful work while a communi-
cation having long latency is pending. This approach is used, for instance,
in the use of nonblocking I/O in message-passing systems (Section 8.7.1) to
deal with network latency, and in GPUs (Chapter 6) to deal with memory
latency.
Multicore machines have become standard on the desktop (even in the cell
phone!), and many data scientists have access to computer clusters. What
are the performance issues on these platforms? The next two sections
26 CHAPTER 2. OBSTACLES TO SPEED
provide an overview.
2.5.1.1 Multicore
• There are memory banks, the Ms, in which your program and data
reside during execution.7
popular, but with the recent popularity of GPUs, the word banks has come back into
favor.
2.5. LATENCY AND BANDWIDTH 27
memory becomes the vehicle for communication between the various pro-
cesses.
Your program consists of a number of machine language instructions. (If
you write in an interpreted language such as R, the interpreter itself consists
of such instructions.) As the processors execute your program, they will
fetch the instructions from memory.
As noted earlier, your data—the variables in your program—is stored in
memory. The machine instructions fetch the data from memory as needed,
so that it can be processed, e.g., summed, in the processors.
Until recently, ordinary PCs sold at your local electronics store followed
the model in Figure 2.1 but with only one P. Multiprocessor systems en-
abled parallel computation, but cost hundreds of thousands of dollars. But
then it became standard for systems to have a multicore form. This means
that there are multiple Ps, but with the important distinction that they
are all on a single chip (each P is one core), making for inexpensive sys-
tems.8 Whether on a single chip or not, having multiple Ps sets up parallel
computation, and is known as the shared-memory paradigm, for obvious
reasons.
By the way, why are there multiple Ms in Figure 2.1? To improve memory
performance, the system is set up so that memory is partitioned into several
banks (typically there are the same number of Ms as Ps). This enables
us to not only do computation on a parallel basis—several Ps working on
different pieces of a problem in parallel—but also to do memory access
in parallel—several memory accesses being active in parallel, in different
banks. This amortizes the memory access penalty. Of course, if more than
one P happens to need to access the same M at about the same time, we
lose this parallelism.
As you can see, a potential bottleneck is the bus. When more than one P
needs to access memory at a time, even if to different banks, attempting
to place memory access requests on the bus, all but one of them will need
to wait. This bus contention can cause significant slowdown. Much more
elaborate systems, featuring multiple communications channels to memory
rather than just a bus, have also been developed and serve to ameliorate
the bottleneck issue. Most readers of this book, however, are more likely
to use a multicore system on a single memory bus.
You can see now why efficient memory access is such a crucial factor in
achieving high performance. There is one more tool to handle this that is
8 Terminology is not standardized, unfortunately. It is common to refer to that chip
as “the” processor, even though there actually are multiple processors inside.
28 CHAPTER 2. OBSTACLES TO SPEED
vital to discuss here: Use of caches. Note the plural; in Figure 2.1, there is
usually a C in between each P and the bus.
As with uniprocessor systems, caching can bring a big win in performance.
In fact, the potential is even greater with a multiprocessor system, since
caching will now bring the additional benefit of reducing bus contention.
Unfortunately, it also produces a new problem, cache coherency, as follows.9
Consider what happens upon a write hit, i.e., a write to a location for which
a local cache copy exists. For instance, consider code such as
x = 28;
with x having address 200. This code might be executed at a time when
there is a copy of word 200 in that processor’s cache. The problem is that
other caches may also have a copy of this word, so they are now invalid
for that block. (Recall that validity is defined only at the block level; if all
words in a block but one are valid, the whole block is considered invalid.)
The hardware must now inform those other caches that their copies of this
block are invalid.
The hardware does so via the bus, thus incurring an expensive bus opera-
tion. Moreover, the next time this word (or for that matter, any word in
this block) is requested at one of the other caches, there will be a cache
miss, again an expensive event.
Once again, proper coding on the programmer’s part can sometimes ame-
liorate the cache coherency problem.
A final point on multicore structure: Even on a uniprocessor machine, one
generally has multiple programs running concurrently. You might have your
browser busy downloading a file, say, while at the same time you are using a
photo processing application. With just a single processor, these programs
will actually take turns running; each one will run for a short time, say 50
milliseconds, then hand off the processor to the next program, in a cyclic
manner. (You as the user probably won’t be aware of this directly, but you
may notice the system as a whole slowing down.) Note by the way that if a
program is doing a lot of input/output (e.g., file access), it is effectively idle
during I/O times; as soon as it starts an I/O operation, it will relinquish
the processor.
By contrast, on a multicore machine, you can have multiple programs run-
ning physically simultaneously (though of course they will still take turns
if there are more of them than there are cores).
9 As noted earlier, there are variations of the structure described here, but this one is
typical.
2.6. THREAD SCHEDULING 29
2.5.1.2 Clusters
These are much simpler to describe, though with equally thorny perfor-
mance obstacles.
The term cluster simply refers to a set of independent processing elements
(PEs) or nodes that are connected by a local area network, such as the
common Ethernet or the high-performance Infiniband. Each PE consists
of a CPU and some RAM. The PE could be a full desktop computer,
including keyboard, disk drive and monitor, but if it is used primarily for
parallel computation, then just one monitor, keyboard and so on suffice for
the entire system. A cluster may also have a special operating system, to
coordinate assigning of user programs to PEs.
We will have one computational process per PE (unless each PE is a multi-
core system, as is common). Communication between the processes occurs
via the network. The latter aspect, of course, is where the major problems
occur.
Say you have a threaded program, for example with four threads and a
machine with four cores. Then the four threads will run physically simul-
taneously (if there are no other programs competing with them). That of
course is the entire point, to achieve parallelism, but there is more to it
than that.
Modern operating systems for general-purpose computers use timesharing:
Unseen by the users, programs are taking turns (timeslices) using the cores
of the machine. Say for instance that Manny and Moe are using a university
30 CHAPTER 2. OBSTACLES TO SPEED
computer named Jack, with Manny sitting at the console and Moe logged in
remotely. For concreteness, say Manny is running an R program, and Moe
is running something in Python, with both currently having long-running
computations in progress.
Assume first that this is a single-core machine. Then only one program
can run at a time. Manny’s program will run for a while, but after a set
amount of time, the hardware timer will issue an interrupt, causing a jump
to another program. That program has been configured to be the operating
system. The OS will look on its process table to find another program in
ready state, meaning runnable (as opposed to say, suspended while awaiting
keyboard input). Assuming there are no other processes, Moe’s program
will now get a turn. This transition from Manny to Moe is called a context
switch. Moe’s program will run for a while, then another interrupt comes,
and Manny will get another turn, and so on.
Now suppose it is a dual-core machine. Here Manny and Moe’s programs
will run more or less continuously, in parallel, though with periodic down-
times due to the interrupts and attendant brief OS runs.
But suppose Moe’s code is threaded, running two threads. Now we will have
three threads—Moe’s two and Manny’s one (even a non-threaded program
consists of one thread)—competing to use three cores. Moe’s two threads
will sometimes run in parallel with each other but sometimes not. Instead
of a 2X speedup, Moe is getting about 1.5X.10
There are also possible cache issues. When a thread starts a new turn, it
may be on a different core than that used in the last turn. If there is a
separate cache for each core, the cache as the new core probably contains
little if anything useful to this thread. Thus there will be a lot of cache
misses for a while in this timeslice. There may be a remedy in the form of
setting processor affinity; see Section 5.9.4.
By the way, what happens when one of those programs finishes its com-
putation and returns to the user prompt, e.g., > in the case of Manny’s
R program? R will then be waiting for Manny’s keyboard input. But the
OS won’t wait, and the OS does in fact get involved. R is trying to read
from the keyboard, and to do this it calls a C library function, which in
turn makes a call to a function in the OS. The OS, realizing that it may be
quite a while before Manny types, will mark his entry in the process table
as being in sleep state. When he finally does hit a key, the keyboard sends
an interrupt,11 causing the OS to run, and the latter will mark his program
as now being back in ready state, and it will eventually get another turn.
10 Even the 2X figure assumes that Moe’s code was load balanced in the first place,
To make this concrete, let’s measure times for the mutual outlinks problem
(Section 1.4), with larger and larger numbers of processes.
Here I ran on a shared-memory machine consisting of four processor chips,
each of which has eight cores. This gives us a 32-core system, and I ran
the mutual outlinks problem with values of nc, the number of cores, equal
to 2, 4, 6, 8, 10, 12, 16, 24, 28 and 32. The problem size was 1000 rows by
1000 columns. The times are plotted in Figure 2.2.
Here we see a classical U-shaped pattern: As we throw more and more
processes on the problem, it helps in the early stages, but performance
actually degrades after a certain point. The latter phenomenon is probably
due to the communications overhead we discussed earlier, in this case bus
contention and the like.12
12 Though the processes are independent and do not share memory, they do share the
bus.
32 CHAPTER 2. OBSTACLES TO SPEED
By the way, for each of our nc workers, we had one invocation of R running
on the machine. There was also an additional invocation, for the manager.
However, this is not a performance issue in this case, as the manager spends
most of its time idle, waiting for the workers.
With all this talk of physical obstacles to overcome, such as memory access
time, it’s important also to raise the question as to whether the application
itself is very parallelizable in the first place. One measure of that is “big
O” notation.
In our mutual outlinks example with an n × n adjacency matrix, we need
to do on average n/2 sum operations per row, with n rows, thus n · n/2
operations in all. In parallel processing circles, the key question asked about
hardware, software, algorithms and so on is, “Does it scale?”, meaning,
Does the run time grow manageably as the problem size grows?
2.10. DATA SERIALIZATION 33
We see above that the run time of the mutual outlinks problem grows
proportionally to the square of the problem size, in this case the number of
websites. (Dividing by 2 doesn’t affect this growth rate.) We write this as
O(n2 ), known colloquially as “big O” notation. When applied to analysis
of run time, we say that it measures the time complexity.
Ironically, applications that are manageable often are poor candidates for
parallel processing, due to overhead playing a greater role in such problems.
An application with O(n) time complexity, for instance, may present a
challenge. We will return to this notion at various points in this book.
Some parallel R packages, e.g., snow, that send data through a network
serialize the data, meaning to convert it to ASCII form. The data must
then be unserialized on the receiving end. This creates a delay, which may
or may not be serious but must be taken into consideration.
The term embarrassingly parallel is heard often in talk about parallel pro-
gramming. It is a central topic, hence deserving of having a separate section
devoted to it.
Putting aside the fact that this computation can be done with R’s sum()
function, the point is that for each i, the computation needs the previous
value of total.
With this restriction of independent iterations, it would seem that we have
an embarrassingly parallel class of applications. In terms of programma-
bility, it is true. Using snow, for example in the mutual Web links code in
Section 1.4.5, we simply called clusterApply() on the range of i that we
had had in our serial loop:
i c h u n k s <− 1 : ( nr −1)
35
36 CHAPTER 3. SCHEDULING
This distributed the various iterations for execution by the workers. So,
isn’t it equally simple for any for loop?
The answer is No, because different iterations may have widely different
times. If we are not careful, we can end up with a serious load balance
issue. In fact, this was even the case in the mutual Web links code above—
for larger values of i, the function doichunk() has less work to do: In the
(serial) code in Listing 1.4.1, page 6, the matrix multiplication involves a
matrix with n-i rows at iteration i.
This can cause big load balancing problems if we are not careful as to
how we assign iterations to workers, i.e., how we do the loop scheduling.
Moreover, we typically don’t know the loop iteration times in advance, so
the problem of efficient loop scheduling is even more difficult. Methods to
address these issues will be the thrust of this chapter.
Suppose we have k processes and many loop iterations. Suppose too that
we do not know beforehand how much time each loop iteration will take.
Common types of loop scheduling are the following:
Note that while static and dynamic scheduling are mutually exclusive, one
can do chunking and reverse scheduling with either.
3.1. GENERAL NOTIONS OF LOOP SCHEDULING 37
• Schedule I: Dole out the loop iterations in Round Robin, i.e., cyclic
order—assign A to P1 , B to P2 and C to P1 , statically..
• Schedule II: Dole out the loop iterations dynamically, one at a time,
as execution progresses. Let us suppose we do this in reverse order,
i.e., C, B and A, because we suspect that their loop iteration times
decrease in this order. (The relevance of this will be seen below.)
Now suppose loop iterations A, B and C have execution times of 10, 20 and
40, respectively. Let’s see how early we would finish the full loop iteration
set, and how much wasted idleness we would have, under both schedules.
In Schedule I, when P1 finishes loop iteration A at time 10, it starts C,
finishing the latter at time 50. P2 finishes at time 20, and then sits idle
during time 20-50.
Under Schedule II, there may be some randomness in terms of which of
P1 and P2 gets loop iteration C. Say it is P1 . P1 will execute only loop
iteration C, never having a chance to do more. P2 will do B, then pick
up A and perform that loop iteration. The overall loop iteration set will
be completed at time 40, with only 10 units of idle time. In other words,
Schedule II outperforms Schedule I, both in terms of how soon we complete
the project and how much idle time we must tolerate.
By the way, note that a static version of Schedule II, still using the (C,B,A)
order, would in this case have the same poor performance as Schedule I.
There are two aspects, though, which we must consider:
• Schedule II, and any dynamic method, may exact a substantial over-
head penalty. In snow, for instance, there would need to be commu-
nication between a worker and the manager, in order for the worker to
determine which task is next assigned to it. Static scheduling doesn’t
have this drawback.
38 CHAPTER 3. SCHEDULING
Only one line of the code from Section 1.4.5 will be changed, but for con-
venience let’s see it all in one piece:
doichunk <− function ( ichunk ) {
t o t <− 0
nr <− nrow( l n k s ) # l n k s g l o b a l a t worker
f o r ( i i n ichunk ) {
tmp <− l n k s [ ( i +1): nr , ] %∗% l n k s [ i , ]
t o t <− t o t + sum( tmp )
}
tot
}
Reduce (sum, t o t s ) / nr
}
As before, our function mutoutpar() divides the i values into chunks, but
now they are real chunks, not one i value per chunk as before. It does so
via the snow function clusterSplit():
mutoutpar <− function ( c l s , l n k s ) {
nr <− nrow( l n k s ) # l n k s g l o b a l a t manager
clusterExport ( cls , ” lnks ”)
i c h u n k s <− c l u s t e r S p l i t ( c l s , 1 : ( nr −1))
t o t s <− c l u s t e r A p p l y ( c l s , ichunks , doichunk )
Reduce (sum, t o t s ) / nr
}
So, what does clusterSplit() do? Say lnks has 500 rows and we have 4
workers. The goal here is to partition the row numbers 1,2,...,500 into 4
equal (or roughly equal) subsets, which will serve as the chunks of indices
for each worker to process. Clearly, the result should be 1-125, 126-250,
251-375 and 376-500, which will correspond to the values of i in the outer
for loop in our serial code, Listing 1.4.1. Worker 1 will process the outer
loop iterations for i = 1,2,...,125, and so on.
Let’s check this. To save space below, let’s try it on a smaller example,
1,2,...,50, on the cluster cls:
> clusterSplit ( cls ,1:50)
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
[[2]]
[ 1 ] 14 15 16 17 18 19 20 21 22 23 24 25
[[3]]
[ 1 ] 26 27 28 29 30 31 32 33 34 35 36 37
[[4]]
[ 1 ] 38 39 40 41 42 43 44 45 46 47 48 49 50
So, again thinking of the case of 500 rows and 4 workers, the code
i c h u n k s <− c l u s t e r S p l i t ( c l s , 1 : ( nr −1))
t o t s <− c l u s t e r A p p l y ( c l s , ichunks , doichunk )
will send the chunk 1:125 to the first worker, 126:250 to the second, 251:375
to the third, and 375:499 to the fourth. The return list, assigned to tots,
will now consist of four elements, rather than 499 as before.
Again, the only change from the previous version of this code was to add
real chunks. This ought to help, because it allows us to better leverage
the fact that R can do matrix multiplication fairly quickly. Let’s see if this
turns out to be the case. Here is a brief timing experiment using 8 cores
on our usual 32-core machine, on a 1000 × 1000 problem:
chunking? time
no 9.062
yes 6.264
Indeed, we got a speed improvement of about 30%.
reader should not hesitate to make liberal use of for loops when the main
advantage of non-loop code would be code compactness. In particular, use
of apply() typically does not bring a speed improvement, and though we
use it frequently in this book, the reader may prefer to stick with good
old-fashioned loops instead.
(a) For each set of predictors, we could perform the regression compu-
tation for that set in parallel. For instance, all the processes would
work in concert in computing the model using predictors 2 and 5.
Option (a) has problems. For a given set of m predictors, we must first com-
pute various sums of squares and products. Each sum has n summands, and
there are O(m2 ) sums, making for a computational complexity of O(nm2 ).
(Recall that this notation was introduced in Section 2.9.) Then a matrix
inversion (or equivalent operation, such as QR factorization) must be done,
with complexity O(m3 ).1
Unfortunately, matrix inversion is not an embarrassingly parallel operation,
and though many good methods have been developed, it is much easier here
to go the route of option (b). The latter is embarrassingly parallel, and in
fact involves a loop.
Below is a snow implementation of doing this in parallel. It finds the
adjusted R2 value for all models in which the predictor set has size at most
k. The user can opt for either static or dynamic scheduling, or reverse the
order of iterations, and can specify a (constant) chunk size.
# r e g r e s s e s r e s p o n s e v a r i a b l e Y column a g a i n s t
# a l l p o s s i b l e s u b s e t s o f t h e Xi p r e d i c t o r v a r i a b l e s ,
# w i t h s u b s e t s i z e up t h r o u g h k ; r e t u r n s t h e
# a d j u s t e d R−s q u a r e d v a l u e f o r each s u b s e t
# s c h e d u l i n g parameters :
#
# static ( clusterApply ())
# dynamic ( c l u s t e r A p p l y L B ( ) )
# reverse the order of the t a s k s
# chunk s i z e ( i n dynamic c a s e )
# arguments :
# c l s : Snow c l u s t e r
# x : m a t r i x o f p r e d i c t o r s , one p e r column
# y : vector of the response v a r i a b l e
# k : max s i z e o f p r e d i c t o r s e t
# r e v e r s e : TRUE means r e v e r s e t h e o r d e r
# of the i t e r a t i o n s
# dyn : TRUE means dynamic s c h e d u l i n g
# c h u n k s i z e : s c h e d u l i n g chunk s i z e
1 In the QR case, the complexity may be O(m2 ), depending on exactly what is being
computed.
3.4. EXAMPLE: ALL POSSIBLE REGRESSIONS 43
# return value :
# R matrix , showing a d j u s t e d R−s q u a r e d v a l u e s ,
# i n d e x e d by p r e d i c t o r s e t
# g e n e r a t e a l l nonempty s u b s e t s o f 1 . . p o f s i z e <= k ;
# r e t u r n s an R l i s t , one e l e m e n t p e r p r e d i c t o r s e t ,
# i n t h e form o f a v e c t o r o f i n d i c e s
g e n a l l c o m b s <− function ( p , k ) {
a l l c o m b s <− l i s t ( )
for ( i in 1 : k ) {
tmp <− combn ( 1 : p , i )
a l l c o m b s <− c ( a l l c o m b s , m a t r i x t o l i s t ( tmp , r c =2))
}
allcombs
}
44 CHAPTER 3. SCHEDULING
# f i n d t h e a d j u s t e d R−s q u a r e d v a l u e s f o r t h e g i v e n
# p r e d i c t o r s e t o n e p s e t ; r e t u r n v a l u e w i l l be t h e
# a d j . R2 v a l u e , f o l l o w e d by t h e p r e d i c t o r s e t
# i n d i c e s , w i t h 0 s as f i l l e r −−f o r c o n v e n i e n c e , a l l
# v e c t o r s r e t u r n e d by c a l l s t o d o 1 p s e t ( ) have
# l e n g t h k +1; e . g . f o r k = 4 , ( 0 . 2 8 , 1 , 3 , 0 , 0 ) would
# mean t h e p r e d i c t o r s e t c o n s i s t i n g o f columns 1 and
# 3 o f x , w i t h an R2 v a l u e o f 0 . 2 8
d o 1 p s e t <− function ( o n e p s e t , x , y ) {
slm <− summary(lm( y ˜ x [ , o n e p s e t ] ) )
n0s <− ncol ( x ) − length ( o n e p s e t )
c ( slm$ a d j . r . squared , o n e p s e t , rep ( 0 , n0s ) )
}
# p r e d i c t o r s e t seems b e s t
s n o w t e s t <− function ( c l s , n , p , k , c h u n k s i z e =1,
dyn=F , r v r s=F) {
gendata ( n , p )
snowapr ( c l s , x , y , k , r v r s , dyn , c h u n k s i z e )
}
> snowtest(c8,100,4,2)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.21941625 1 0 0 0
[2,] 0.05960716 2 0 0 0
[3,] 0.11090411 3 0 0 0
[4,] 0.15092073 4 0 0 0
[5,] 0.26576805 1 2 0 0
[6,] 0.35730378 1 3 0 0
[7,] 0.32840075 1 4 0 0
[8,] 0.17534962 2 3 0 0
[9,] 0.20841665 2 4 0 0
[10,] 0.27900555 3 4 0 0
As noted in Section 3.3, parallel code does tend to involve a lot of detail,
so it is important to keep in mind the overall strategy of the algorithm. In
the case at hand here, the strategy is as follows:
Note that our approach here is consistent with the discussion in Section
1.1, i.e., to have our code leverage the power of R: Each worker calls the R
linear model function lm().
46 CHAPTER 3. SCHEDULING
Our main function snowapr() will first call genallcombs() which, as its
name implies, will generate all the combinations of predictor variables, one
combination per list element:
> genallcombs(4,2)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 1 2
[[6]]
[1] 1 3
[[7]]
[1] 1 4
[[8]]
[1] 2 3
[[9]]
[1] 2 4
[[10]]
[1] 3 4
3.4. EXAMPLE: ALL POSSIBLE REGRESSIONS 47
For example, the last list element says that one of the combinations is (3,4),
corresponding to the model with predictors 3 and 4, i.e., columns 3 and 4
of x.
Thus, the list allcombs is our task list, one task per element of the list.
As mentioned, the basic idea is simple: We distribute these tasks, 10 of
them in this case, to the workers. Each worker then runs regressions on
each of its assigned combinations, and returns the results to the manager,
which coalesces them.
3.4.4.2 Chunking
In the above example, tasks will be (1,3,5,7,9). Our code will interpret
these numbers as the starting indices of the various chunks, with for exam-
ple 3 meaning the chunk starting at the third combination, i.e., the third
element of allcombs. Since our chunk size is 2 in this example, the chunk
will consist of the third and fourth combinations in allcombs: This chunk
will consist of two single-predictor models, one using predictor 3 and the
other using predictor 4.
Let us name our two workers P1 and P2 , and suppose we use static schedul-
ing, the default for snow. The package implements scheduling in a Round
Robin manner. Recalling that our vector tasks is (1,3,5,7,9), we see that
1 will be assigned to P1 , 3 will be assigned to P2 , 5 will be assigned to P1 ,
and so on. Again, note that assigning 3 to P2 , for instance, means that
combinations 3 and 4 will be handled by that worker, since our chunk size
is 2.
In our call to snowapr(), we would set chunksize to 2 and set dyn to
FALSE, as we are using static scheduling. We are not reversing the order
of tasks, so we set rvrs to FALSE.
In the dynamic case, at first the assignment will match the static case, with
P1 getting combinations 1 and 2, and P2 being assigned 3 and 4. After that,
though, things are unpredictable. The manager could assign combinations
5 and 6 to either P1 or P2 , depending on which worker finishes its initial
48 CHAPTER 3. SCHEDULING
combinations first. It’s a “first come, first served” kind of setup. The
snow package includes a variant of clusterApply() that does dynamic
scheduling, named clusterApplyLB() (“LB” for “load balance”).
As seen in the toy example in Section 3.1, it may be advantageous to
schedule iterations in reverse order. This is requested by setting reverse
to TRUE. Since iteration times are clearly increasing in this application,
we should consider using this option.
(and the paired call to clusterApplyLB(), which works the same way).
As mentioned, tasks will be (1,3,5,7,9), each element of which will be fed
into the function dochunk() by a worker. P1 , as noted, will do this for the
elements 1, 5 and 9, resulting in three calls to dochunk() being made by
P1 . In those calls, psetsstart will be set to 1, 5 and 9, respectively.
Note that we’ve written our function dochunk() to have five arguments.
The first one will come from a portion of tasks, as explained above. The
value of that argument will be different for each worker. But the other four
arguments will be taken from the items that follow dochunk in the call
out <− c l u s t e r A p p l y ( c l s , t a s k s , dochunk , x , y , a l l c o m b s ,
chunksize )
The values of these arguments will be the same for all workers. The snow
function clusterApply() is structured this way, i.e., with all arguments
following the worker function (dochunk() in this case) being assigned in
common by all workers.
For convenience, here is a copy of the code of relevance right now:
dochunk <− function ( p s e t s s t a r t , x , y , a l l c o m b s ,
chunksize ) {
ncombs <− nrow( a l l c o m b s )
l a s t t a s k <− min( p s e t s s t a r t+c h u n k s i z e −1,ncombs )
t ( sapply ( a l l c o m b s [ p s e t s s t a r t : l a s t t a s k ] ,
do1pset , x , y ) )
}
3.4. EXAMPLE: ALL POSSIBLE REGRESSIONS 49
d o 1 p s e t <− function ( o n e p s e t , x , y ) {
slm <− summary(lm( y ˜ x [ , o n e p s e t ] ) )
n0s <− ncol ( x ) − length ( o n e p s e t )
c ( slm$ a d j . r . squared , o n e p s e t , rep ( 0 , n0s ) )
}
And here again is (part of) what we found earlier for allcombs:
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
...
t ( sapply ( a l l c o m b s [ p s e t s s t a r t : l a s t t a s k ] , do1pset , x , y ) )
3.4.4.5 Wrapping Up
> system.time(apr(x,y,3))
user system elapsed
35.070 0.132 35.295
> system.time(snowapr(c2,x,y,3))
user system elapsed
31.006 5.028 77.447
3.4. EXAMPLE: ALL POSSIBLE REGRESSIONS 51
This is awful! Instead of cutting the run time in half, using two processes
actually doubled the time. This is a great example of the problems that
overhead can bring.
Let’s see if dynamic scheduling helps:
> system.time(snowapr(c2,x,y,3,dyn=T))
user system elapsed
33.370 4.844 64.543
A little better, but still slower than the serial version. Maybe chunking will
help?
> system.time(snowapr(c2,x,y,3,dyn=T,chunk=10))
user system elapsed
2.904 0.572 22.753
> system.time(snowapr(c2,x,y,3,dyn=T,chunk=25))
user system elapsed
1.340 0.240 19.677
> system.time(snowapr(c2,x,y,3,dyn=T,chunk=50))
user system elapsed
0.652 0.128 19.692
Ah! That’s more like it. It’s not quite clear from this limited experiment
what chunk size is best, but all of the above sizes worked well.
How about an eight-process snow cluster?
> system.time(snowapr(c8,x,y,3,dyn=T,chunk=10))
user system elapsed
3.861 0.568 7.542
> system.time(snowapr(c8,x,y,3,dyn=T,chunk=15))
user system elapsed
2.592 0.284 6.828
> system.time(snowapr(c8,x,y,3,dyn=T,chunk=20))
user system elapsed
1.808 0.316 6.740
> system.time(snowapr(c8,x,y,3,dyn=T,chunk=25))
user system elapsed
1.452 0.232 7.082
This is approximately a five-fold speedup over the serial version, very nice.
52 CHAPTER 3. SCHEDULING
> length(genallcombs(20,3))
[1] 1350
We did get good speedups above from parallelization, but at the same time
we should have some nagging doubts. After all, we are doing an awful lot
of duplicate work.
If you have background in the mathematics of linear models (don’t worry
about this if you don’t, as the following will still be readable)), you know
that the vector of estimated regression coefficients is calculated as
βb = (X 0 X)−1 X 0 Y (3.1)
For example, say we are currently working with the predictor set (2,3,4).
Let X̃ denote the analog of X for this set. Then it can be shown that
X̃ 0 X̃ is equal to the 3x3 submatrix of X 0 X corresponding to rows 3-5 and
columns 3-5 of the latter.
So it makes sense to calculate X 0 X once and for all, and then extract
submatrices as needed.
3.6.1 Code
# r e g r e s s e s r e s p o n s e v a r i a b l e Y column a g a i n s t
# a l l p o s s i b l e s u b s e t s o f t h e Xi p r e d i c t o r v a r i a b l e s ,
# w i t h s u b s e t s i z e up t h r o u g h k ; r e t u r n s t h e
# a d j u s t e d R−s q u a r e d v a l u e f o r each s u b s e t
# s c h e d u l i n g methods :
#
# static ( clusterApply ())
# dynamic ( c l u s t e r A p p l y L B ( ) )
# reverse the order of the t a s k s
# v a r y i n g chunk s i z e ( i n dynamic c a s e )
# arguments :
# cls : cluster
# x : m a t r i x o f p r e d i c t o r s , one p e r column
# y : vector of the response v a r i a b l e
# k : max s i z e o f p r e d i c t o r s e t
# r e v e r s e : TRUE means r e v e r s e t h e o r d e r o f
# the i t e r a t i o n s
# dyn : TRUE means dynamic s c h e d u l i n g
# c h u n k s i z e : s c h e d u l i n g chunk s i z e
# return value :
# R matrix , showing a d j u s t e d R−s q u a r e d v a l u e s ,
# i n d e x e d by p r e d i c t o r s e t
x <− cbind ( 1 , x )
xpx <− crossprod ( x , x )
xpy <− crossprod ( x , y )
p <− ncol ( x ) − 1
# generate matrix of p r e d i c t o r s u b s e t s
a l l c o m b s <− g e n a l l c o m b s ( p , k )
ncombs <− length ( a l l c o m b s )
clusterExport ( cls , ” do1pset1 ” )
clusterExport ( cls , ” linregadjr2 ”)
# s e t up t a s k i n d i c e s
t a s k s <− i f ( ! r e v e r s e )
seq ( 1 , ncombs , c h u n k s i z e ) e l s e
seq ( ncombs ,1 , − c h u n k s i z e )
i f ( ! dyn) {
out <− mclapply ( t a s k s , dochunk2 ,
x , y , xpx , xpy , a l l c o m b s , c h u n k s i z e )
} else {
out <− c l u s t e r A p p l y L B ( c l s , t a s k s , dochunk2 ,
x , y , xpx , xpy , a l l c o m b s , c h u n k s i z e )
}
Reduce ( rbind , out )
}
# g e n e r a t e a l l nonempty s u b s e t s o f 1 . . p o f s i z e <= k ;
# r e t u r n s a l i s t , one e l e m e n t p e r p r e d i c t o r s e t
g e n a l l c o m b s <− function ( p , k ) {
a l l c o m b s <− l i s t ( )
for ( i in 1 : k ) {
tmp <− combn ( 1 : p , i )
a l l c o m b s <− c ( a l l c o m b s , m a t r i x t o l i s t ( tmp , r c =2))
}
allcombs
}
# p r o c e s s a l l t h e p r e d i c t o r s e t s i n t h e chunk
3.6. IMPROVED VERSION 55
# whose f i r s t i n d e x i s p s e t s t a r t
dochunk2 <− function ( p s e t s t a r t , x , y , xpx , xpy ,
allcombs , chunksize ) {
ncombs <− length ( a l l c o m b s )
l a s t t a s k <− min( p s e t s t a r t+c h u n k s i z e −1,ncombs )
t ( sapply ( a l l c o m b s [ p s e t s t a r t : l a s t t a s k ] , do1pset1 ,
x , y , xpx , xpy ) )
}
# f i n d t h e a d j u s t e d R−s q u a r e d v a l u e s f o r t h e g i v e n
# predictor set index
d o 1 p s e t 1 <− function ( o n e p s e t , x , y , xpx , xpy ) {
ps <− c ( 1 , o n e p s e t +1) # a c c o u n t f o r c o n s t a n t term
x1 <− x [ , ps ]
xpx1 <− xpx [ ps , ps ]
xpy1 <− xpy [ ps ]
a r 2 <− l i n r e g a d j r 2 ( x1 , y , xpx1 , xpy1 )
n0s <− ncol ( x ) − length ( ps )
# form t h e r e p o r t f o r t h i s p r e d i c t o r s e t ; need
# t r a i l i n g 0 s so as t o form m a t r i c e s o f uniform
# numbers o f rows , t o use r b i n d ( ) i n snowapr ( )
c ( ar2 , o n e p s e t , rep ( 0 , n0s ) )
}
# f i n d s r e g r e s s i o n e s t i m a t e s ” from s c r a t c h ”
l i n r e g a d j r 2 <− function ( x , y , xpx , xpy ) {
bhat <− solve ( xpx , xpy )
r e s i d s <− y − x %∗% bhat
r 2 <− 1 − sum( r e s i d s ˆ 2 ) /sum( ( y−mean( y ) ) ˆ 2 )
n <−nrow( x ) ; p <− ncol ( x ) − 1
1 − (1− r 2 ) ∗ ( n−1) / ( n−p−1) # a d j R2
}
# which p r e d i c t o r s e t seems b e s t
s n o w t e s t 1 <− function ( c l s , n , p , k , c h u n k s i z e =1,
dyn=F , r v r s=F) {
gendata ( n , p )
snowapr ( c l s , x , y , k , r v r s , dyn , c h u n k s i z e )
}
x <− cbind ( 1 , x )
in snowapr1().2
• Our predictor set indices, e.g. (2,3,4) above, must then be shifted
accordingly in do1pset(), now named do1pset1() in this new code.
3.6.3 Timings
Let’s run snowapr1() in the same settings we did earlier for snowapr().
Again, this is for n = 10000, p = 20 and k = 3, all with dyn = T, reverse
2 The reader should note that the code here has not been optimized for good numerical
As explained in Section 1.4.2, the parallel package was formed from two
contributed R packages, snow and multicore. Now that we’ve seen how
the former works, let’s take a look at the latter. (Note that just as we have
been using snow as a shorthand for “the portion of the parallel package
that was adapted from snow,” we’ll do the same for multicore.)
As the name implies, multicore must be run on a multicore machine. Also,
it’s restricted to Unix-family operating systems, notably Linux and the
58 CHAPTER 3. SCHEDULING
Macintosh’s OS X. But with such a platform, you may find that multicore
outperforms snow.3
Unix-family OSs include a system call, i.e., a function in the OS that ap-
plication programmers can call as a service, named fork(). This is fork as
in “fork in the road,” rather than in “knife and fork.” The image the term
is meant to evoke is that of a process splitting into two.
What multicore does is call the OS fork(). The result is that if you call
one of the multicore functions in the parallel package, you will now have
two or more instances of R running on your machine! Say you have a quad
core machine, and you set mc.cores to 4 in your call to the multicore
function mclapply(). You will now have five instances of R running—
your original plus four copies. (You can check this by running your OS’ ps
command.)
This in principle should fully utilize your machine in the current computation—
four child R processes running on four cores. (The parent R process is
dormant, waiting for the four children to finish.)
An absolutely key point is that initially the four child R processes will
be exact copies of the parent. They will have the same values of your
variables, as of the time of the forks. Just as importantly, initially the four
children are actually sharing the data, i.e., are accessing the same physical
locations in memory. (Note the word initially above; any changes made to
the variables by a worker process will NOT be reflected at the manager or
at the other workers.)
To see why that is so important, think again of the all possible regressions
example earlier in this chapter, specifically the improved version discussed
in Section 3.6. The idea there was to limit duplicate computation, by
determining xpx and xpy just once, and sending them to the workers.
But the latter is a possible problem. It may take quite some time to send
large objects to the workers. In fact, shipping the two matrices to the
workers adds even more overhead, since as noted in Section 2.10, the snow
package serializes communication.
But with multicore, no such action is necessary. Because fork() creates
exact, shared, copies of the original R process, they all already have the
variables xpx and xpy! At least for Linux, a copy-on-write policy is used,
3 One should add, “In the form of snow used so far.” More on this below.
3.7. INTRODUCING ANOTHER TOOL: MULTICORE 59
which is to have the child processes physically share the data until such time
as it is written to. But in this application, the variables do not change, so
using multicore should be a win. Note that the same gain might be made
for the variable allcombs too.
The snow package also has an option based on fork(), called makeFork-
Cluster(). Thus, potentially this same performance advantage can be
attained in snow, using that function instead of makeCluster(). If you
are using snow on a multicore platform, you should consider this option.
# s c h e d u l i n g methods :
#
# static ( clusterApply ())
# dynamic ( c l u s t e r A p p l y L B ( ) )
# reverse the order of the t a s k s
# chunk s i z e ( i n dynamic c a s e )
# arguments :
# x : m a t r i x o f p r e d i c t o r s , one p e r column
# y : vector of the response v a r i a b l e
# k : max s i z e o f p r e d i c t o r s e t
# r e v e r s e : TRUE means r e v e r s e t h e o r d e r
# of the i t e r a t i o n s
# dyn : TRUE means dynamic s c h e d u l i n g
# chunk : chunk s i z e
60 CHAPTER 3. SCHEDULING
# return value :
# R matrix , showing a d j u s t e d R−s q u a r e d v a l u e s ,
# i n d e x e d by p r e d i c t o r s e t
# p r o c e s s a l l t h e p r e d i c t o r s e t s i n t h e chunk
# whose f i r s t a l l c o m b s i n d e x i s p s e t s s t a r t
dochunk2 <− function ( p s e t s s t a r t , x , y ,
xpx , xpy , a l l c o m b s , chunk ) {
ncombs <− length ( a l l c o m b s )
l a s t t a s k <− min( p s e t s s t a r t+chunk −1,ncombs )
t ( sapply ( a l l c o m b s [ p s e t s s t a r t : l a s t t a s k ] , do1pset2 ,
x , y , xpx , xpy ) )
}
# f i n d t h e a d j u s t e d R−s q u a r e d v a l u e s f o r t h e g i v e n
# predictor set , onepset
d o 1 p s e t 2 <− function ( o n e p s e t , x , y , xpx , xpy ) {
ps <− c ( 1 , o n e p s e t +1) # a c c o u n t f o r 1 s column
xps <− x [ , ps ]
xpxps <− xpx [ ps , ps ]
xpyps <− xpy [ ps ]
a r 2 <− l i n r e g a d j r 2 ( xps , y , xpxps , xpyps )
n0s <− ncol ( x ) − length ( ps )
3.7. INTRODUCING ANOTHER TOOL: MULTICORE 61
# form t h e r e p o r t f o r t h i s p r e d i c t o r s e t ; need
# t r a i l i n g 0 s so as t o form m a t r i c e s o f uniform
# numbers o f rows , t o use r b i n d ( ) i n mcapr ( )
c ( ar2 , o n e p s e t , rep ( 0 , n0s ) )
}
# do l i n e a r r e g r e s s i o n w i t h g i v e n xpx , xpy ,
# r e t u r n a d j . R2
l i n r e g a d j r 2 <− function ( xps , y , xpx , xpy ) {
# get beta c o e f f i c i e n t estimates
bhat <− solve ( xpx , xpy )
# f i n d R2 and t h e n a d j u s t e d R2
r e s i d s <− y − xps %∗% bhat
r 2 <− 1 − sum( r e s i d s ˆ 2 ) /sum( ( y−mean( y ) ) ˆ 2 )
n <−nrow( xps ) ; p <− ncol ( xps ) − 1
1 − (1− r 2 ) ∗ ( n−1) / ( n−p−1)
}
# g e n e r a t e a l l nonempty s u b s e t s o f 1 . . p o f s i z e <= k ;
# r e t u r n s a l i s t , one e l e m e n t p e r p r e d i c t o r s e t
g e n a l l c o m b s <− function ( p , k ) {
a l l c o m b s <− l i s t ( )
for ( i in 1 : k ) {
tmp <− combn ( 1 : p , i )
a l l c o m b s <− c ( a l l c o m b s , m a t r i x t o l i s t ( tmp , r c =2))
}
allcombs
}
# t e s t data
gendata <− function ( n , p ) {
x <<− matrix (rnorm( n∗p ) , ncol=p )
y <<− x%∗%c ( rep ( 0 . 5 , p ) ) + rnorm( n )
}
62 CHAPTER 3. SCHEDULING
As noted, the changes from the snow version are pretty small. References
to clusters are gone, and we no longer export functions like do1pset1()
to the workers, again because the workers already have them! The calls to
clusterApply() have been replaced by mclapply().4
Let’s look at the calls to mclapply():
out <− mclapply ( t a s k s , dochunk2 , x , y , xpx , xpy , a l l c o m b s , chunk ,
mc . c o r e s=n c o r e s , mc . p r e s c h e d u l e=! dyn)
The call format (at least as used here) is almost identical to that of clus-
terApply(), with the main difference being that we specify the number of
cores rather than specifying a cluster.
As with snow, multicore offers both static and dynamic scheduling, by
setting the mc.preschedule parameter to either TRUE or FALSE, respec-
tively. (The default is TRUE.) Thus here we simply set mc.preschedule
to the opposite of dyn.
In that static case, multicore assigns loop iterations to the cores in a
Round Robin manner as with clusterApply().
For dynamic scheduling, mclapply() initially creates a number of R child
processes equal to the specified number of cores; each one will handle one
iteration. Then, whenever a child process returns its result to the original
R process, the latter creates a new child, to handle another iteration.
Timings:
So, does it work well? Let’s try it on a slightly larger problem than before—
using eight cores again, same n and p, but with k = 5 instead of k = 3.
Here are the better times found in runs of the improved snow version we
developed earlier:
> system.time(snowapr1(c8,x,y,5,dyn=T,chunk=300))
user system elapsed
7.561 0.368 8.398
> system.time(snowapr1(c8,x,y,5,dyn=T,chunk=450))
user system elapsed
5.420 0.228 7.175
> system.time(snowapr1(c8,x,y,5,dyn=T,chunk=600))
user system elapsed
3.696 0.124 6.677
4 Though mclapply() still has xpx etc. as arguments, what will be copied will just
> system.time(snowapr1(c8,x,y,5,dyn=T,chunk=800))
user system elapsed
2.984 0.124 6.544
> system.time(snowapr1(c8,x,y,5,dyn=T,chunk=1000))
user system elapsed
2.505 0.092 6.441
> system.time(snowapr1(c8,x,y,5,dyn=T,chunk=1200))
user system elapsed
2.248 0.072 7.218
> system.time(mcapr(x,y,5,dyn=T,chunk=50,ncores=8))
user system elapsed
35.186 14.777 7.259
> system.time(mcapr(x,y,5,dyn=T,chunk=75,ncores=8))
user system elapsed
36.546 15.349 7.236
> system.time(mcapr(x,y,5,dyn=T,chunk=100,ncores=8))
user system elapsed
37.218 9.949 6.606
> system.time(mcapr(x,y,5,dyn=T,chunk=125,ncores=8))
user system elapsed
38.871 9.572 6.675
> system.time(mcapr(x,y,5,dyn=T,chunk=150,ncores=8))
user system elapsed
34.458 8.012 5.843
> system.time(mcapr(x,y,5,dyn=T,chunk=175,ncores=8))
user system elapsed
34.754 5.936 5.716
> system.time(mcapr(x,y,5,dyn=T,chunk=200,ncores=8))
user system elapsed
39.834 7.389 6.440
There are two points worth noting here. First, of course, we see that
multicore did better, by about 10%. But also note that the snow version
required much larger chunk sizes in order to do well. This should make
sense, recalling the fact that the whole point of chunking is to amortize
the overhead. Since the snow version has more overhead, it needs a larger
chunk size to get good performance.
We’ve seen here that program performance can be quite sensitive to the
chunk size. If the chunk size is too small, we’ll have more chunks to process,
and thus will incur more overhead. But if the chunks are too large, we may
have load balance problems near the end of the run.
64 CHAPTER 3. SCHEDULING
Say we have two data sets, with m and n observations, respectively. There
are a number of applications in which we need to compute the mn pairs
of distances between observations in one set and observations in the other.
(The two data sets will be assumed separate from each other here, but the
code could be adjusted if the sets are the same.)
3.9. EXAMPLE: PARALLEL DISTANCE COMPUTATION 65
Many clustering algorithms make use of distances, for example. These tend
to be complex, so in order to have a more direct idea of why distances are
important in many statistical applications, consider nonparametric regres-
sion.
Suppose we are predicting one variable from two others. For simplicity
of illustration, let’s use an example with concrete variables. Suppose we
are predicting human weight from height and age. In essence, this involves
expressing mean weight as a function of height and age, and then estimating
the relationship from sample data in which all three variables are known,
often called the training set. We also have another data set, consisting of
people for whom only height and age are known, called the prediction set;
this is used for comparing the performance of several models we ran on the
training set, without the possible overfitting problem.
In nonparametric regression, the relationship between response and predic-
tor variables is not assumed to have a linear or other parametric form. To
guess the weight of someone in the prediction set, known to be 70 inches
tall and 32 years old, we might look at people in our training set who are
within, say, 2 inches of that height and 3 years of that age. We would then
take the average weight of those people, and use it as our predicted weight
for the 70-inch tall, age 32 person in our prediction set. As a refinement,
we could give the people in our training sets who are very close to 70 inches
tall and 32 years old more weight in this average.
Either way, we need to know the distances from observations in our training
set to points in our prediction set. Suppose we have n people in our sample,
and wish to predict p new people. That means we need to calculate np
distances, exactly the setting described above. This could involve lots of
computation, so let’s see how we can parallelize it all, shown in the next
section.
# arguments :
# cls : cluster
# x : data matrix
66 CHAPTER 3. SCHEDULING
# y : data matrix
# dyn : TRUE means dynamic s c h e d u l i n g
# chunk : chunk s i z e
# return value :
# f u l l d i s t a n c e matrix , as p d i s t o b j e c t
library ( p a r a l l e l )
library ( p d i s t )
# p r o c e s s a l l rows i n i c h u n k
dochunk <− function ( ichunk , x , y ) {
p d i s t ( x [ ichunk , ] , y ) @ d i s t
}
> x
[,1] [,2]
[1,] 2 5
[2,] 4 3
> y
[,1] [,2]
[1,] 1 4
[2,] 3 1
1.414214 4.123106
(3.3)
3.162278 2.236068
p
The distance from row 1 of x to row 1 of y is p(1 − 2)2 + (4 − 5)2 =
1.414214, while that from row 1 of x to row 2 of y is (3 − 2)2 + (1 − 5)2 =
4.123106. These numbers form row 1 of the distance matrix, and row 2 is
formed similarly.
The function pdist() computes the distance matrix, returning it as the
dist slot in an object of the class pdist:
> pdist(x,y)
An object of class "pdist"
Slot "dist":
[1] 1.414214 4.123106 3.162278 2.236068
attr(,"Csingle")
[1] TRUE
Slot "n":
[1] 2
Slot "p":
[1] 2
Slot ".S3Class":
[1] "pdist"
The list dists will contain the results of calling pdist() on the various
chunks. Each one will be an object of class pdist. We need to essentially
take them apart, combine the distance slots, then form a new object of
class pdist.
Since the dist slot in a pdist object contains row-by-row distances any-
way, we can simply use the standard R concatenate function c() to do the
combining. We then use new() to create a grand pdist object for our final
result.
If we simply wanted the distance matrix itself, we’d apply as.matrix() as
the last step in dochunk(), and not call new() in snowpdist().
3.9.2 Timings
> genxy
function (n, k)
{
x <<- matrix(runif(n * k), ncol = k)
y <<- matrix(runif(n * k), ncol = k)
}
> genxy(15000,20)
> system.time(pdist(x,y))
3.10. THE FOREACH PACKAGE 69
The 2-node cluster failed to yield a speedup. The 4-node system was faster,
but yielded a speedup of only about 1.36, rather than the theoretical value
of 4.0.
Overhead seemed to have a major impact here, so a larger problem was
investigated, with 50 variables instead of 20, and computing with up to 8
cores:
> genxy(15000,50)
> system.time(pdist(x,y))
user system elapsed
88.925 5.597 94.901
> system.time(snowpdist(c2,x,y,chunk=500))
user system elapsed
16.973 3.832 77.563
> system.time(snowpdist(c4,x,y,chunk=500))
user system elapsed
17.069 3.800 49.824
> system.time(snowpdist(c8,x,y,chunk=500))
user system elapsed
15.537 3.360 32.098
Here even use of only two nodes produced an improvement, and cluster
sizes of 4 and 8 showed further speedups.
Yet another popular R tool for parallelizing loops is the foreach package,
available from the CRAN repository of contributed code. Actually foreach
is more explicitly aimed at the loops case, as seen from its name, evoking
for loops.
70 CHAPTER 3. SCHEDULING
The package has the user set up a for loop, as in serial code, but then use
the foreach() function instead of for(). One must also make one more
small change, adding an operator, %dopar%, but that’s all the user must
do to parallelize his/her serial code.
Thus foreach has a very appealing simplicity. However, in some cases, this
simplicity can mask major opportunities for achieving speedup, as will be
seen in the example in the next section.
s i m f e <− function ( nr , nc , n c o r e s ) {
require (doMC) # l o a d s ’ p a r a l l e l ’ t o o
c l s <− makeCluster ( n c o r e s )
r e g i s t e r M C ( c o r e s=n c o r e s )
l n k s <<− matrix ( sample ( 0 : 1 , ( nr∗nc ) , replace=TRUE) ,
nrow=nr )
print ( system . time ( mutoutfe ( l n k s ) ) )
}
f o r ( i i n 1 : ( nr −1)) {
f o r ( j i n ( i +1): nr ) {
f o r ( k i n 1 : nc )
t o t <− t o t + l i n k s [ i , k ] ∗ l i n k s [ j , k ]
}
}
t o t / nr
}
The original for loop with index i has now been replaced by foreach and
%dopar%:
f o r e a c h ( i = 1 : ( nr −1)) %dopar% {
The user needs to also specify the platform to run on, the backend in fore-
ach parlance. This can be snow, multicore or various other parallel
software systems. This is the flexibility alluded to above—one can use the
same code on different platforms.
To see how this works, here is a function that performs a speed test of the
above code:
s i m f e <− function ( nr , nc , n c o r e s ) {
require (doMC)
registerDoMC ( c o r e s=n c o r e s )
l n k s <<− matrix ( sample ( 0 : 1 , ( nr∗nc ) , replace=TRUE) ,
nrow=nr )
print ( system . time ( mutoutfe ( l n k s ) ) )
}
Here we’ve chosen to use the multicore backend. The package doMC
is designed for this purpose. We call registerDoMC() to set up a call
to multicore with the desired number of cores, and then when foreach
within mutoutfe() runs, it uses that multicore platform.
Let’s see how well it works:
> simfe (500 ,500 ,2)
u s e r system e l a p s e d
17.392 0.036 17.663
> simfe (500 ,500 ,4)
u s e r system e l a p s e d
52.900 0.176 13.578
> simfe (500 ,500 ,8)
u s e r system e l a p s e d
62.488 0.352 7.408
72 CHAPTER 3. SCHEDULING
by
f o r e a c h ( i i n i r a n g e ) %dopar%
3.11 Stride
In discussions of parallel loop computation, one often sees the word stride,
which refers to address spacing between successive memory accesses. Sup-
pose we have a matrix having m rows and n columns. Consider the effects
of, say, summing the elements in some column. If our storage uses column-
major order, the accesses will be one word apart in memory, i.e. they will
have stride 1. On the other hand, if we are using row-major storage, the
stride will be n.
This is an issue in the context of memory bank structure (Section 2.5.1.1).
Typically, low-order interleaving is used, meaning that consecutive words
are stored in consecutive banks. If we have 4 banks, i.e. an interleaving
factor of 4, then the first word will be stored in Bank 0, then next three in
Banks 1, 2 and 3, then the next in Bank 0, and so on in a cyclical fashion.
The issue is avoiding bank conflicts. If say we have a stride of 1, then we can
potentially keep all banks busy at once, the best we can hope for. Suppose
on the other hand we have a stride of 4. This would be disastrous, as
all accesses would go to the same bank, destroying our chance for parallel
operation.
Even though we write our code at a high level, such as in R or C, it is
important to keep in mind what stride will be implied by the algorithms we
design. We may not know the bank interleaving factor of the machine our
code will be run on, but at least such issues should be kept in mind. And
in the case of GPU programming, truly maximizing speed may depend on
this.
74 CHAPTER 3. SCHEDULING
If you are not interested in the mathematics, this subsection can easily be
skipped, but it may provide insight for those who stay.
Say we have n iterations, with times t1 , ..., tn , handled by p processes in
static scheduling. Let π denote a random permutation of (1,...,n), and set
s = (i − 1)c + 1 (3.5)
3.12. RANDOM TASK PERMUTATION 75
and
e = ic (3.6)
c = n/p (3.7)
n
1X
µ= ti (3.8)
n i=1
n
2 1X
σ = (ti − µ)2 (3.9)
n i=1
Note that these are not the mean and variance of some hypothesized parent
distribution. No probabilistic model is being assumed for the ti ; indeed,
they are not even being assumed random. So, µ and σ 2 are simply the
mean and variance of the set of numbers t1 , ..., tn .
Then Ts , ..., Te form a simple random sample (i.e. without replacement)
from t1 , ..., tn . From finite-population sampling theory, the total computa-
tion time Ui for the ith process has mean
cµ (3.10)
and variance
(1 − f )cσ 2 (3.11)
where f = c/n.
The coefficient of variation of Ui , i.e. its standard deviation divided by its
mean, is then
p
(1 − f )cσ 2
→ 0 as c → ∞ (3.12)
cµ
76 CHAPTER 3. SCHEDULING
The intuition behind the Random method is that in large problems, the
variance between processing time from thread to thread should be small.
This implies good load balance.
Simulation results by the author have shown that the Random method
generally performs fairly well. However, there are no “silver bullets” in the
parallel processing world. Note the following:
It is quite typical that either (a) the iteration times are known to be mono-
tonic or (b) the overhead for running a task queue is small, relative to task
times. In such cases, the Random method may not produce an improve-
ment. However, it’s something to keep in your loop scheduling toolkit.
3.13. DEBUGGING SNOW AND MULTICORE CODE 77
Generally debugging any code is hard, but it is extra difficult with parallel
code. Just like a juggler, we have to be good at watching many things
happening at once!
Worse, one cannot use debugging tools directly, such as R’s built-in de-
bug() and browser() functions. This is because our worker code is not
running within a terminal/window environment. For the same reason, even
calls to print() won’t work.
So, let’s see what we can do.
One can still use browser() in a kind of tricked-up way, which will be
presented below. As it is a little clumsy, note that if you are using a
Unix-family system (Mac/Linux/Cygwin/etc.), the dbs() function in my
partools package (Section 3.5) automates the whole process for you! In
that case, you can happily ignore the following.
Here is an outline of the procedure, say for a cluster of 2 workers:
• That call will create the cluster, and then print out a message inform-
ing us at what IP address and port the manager is available.
• The workers will hit the browser() call, and we can then debug as
usual in the two windows.
78 CHAPTER 3. SCHEDULING
Note the append parameter. We can then inspect the file dbg from an-
other window.
One major drawback of this is that all the debugging output from the
various workers will be mixed together! This makes it hard to read, even if
one also prints an ID number for each worker. This too is remedied by the
partools package, via the function dbsmsg(), which has different workers
write to different files; this function is platform-independent.
Chapter 4
The Shared-Memory
Paradigm: A Gentle
Introduction via R
The familiar model for the shared memory paradigm (in the hardware sense)
is the multicore machine. The several parallel processes communicate with
each other by accessing memory (RAM) cells that they have in common
within a machine. This contrasts with message-passing hardware, in which
there are a number of separate, independent machines, with processes com-
municating via a network that connects the machines.
Shared-memory programming is considered by many in the parallel pro-
cessing community as being the clearest of the various paradigms available.
Since programming development time is often just as important as program
run time, the clear, concise form of the shared-memory paradigm can be a
major advantage.
Another type of shared-memory hardware is accelerator chips, notably
graphics processing units (GPUs). Here one can use one’s computer’s
graphics card not for graphics, but for fast parallel computation of non-
graphics specific operations, say matrix multiply.
Shared memory programming will be presented in three chapters. This
chapter will present an overview of the subject, and illustrate it with the
R package Rdsm. Though to get the most advantage from shared mem-
ory, one should program in C/C++, Rdsm enables one to achieve shared
79
80 CHAPTER 4. SHARED MEMORY: R
The term shared memory means that the processors all share a common
memory address space. Let’s see what that really means.
We won’t deal with machine language in this book, but a quick example
will be helpful. A processor will typically include several registers, which
are like memory cells but located inside the processor. In Intel processors,
one of the registers is named RAX.2 Note that on a multicore machine, each
core will have its own registers, so that for example each core will have its
own independent register named RAX.
Recall from Section 2.5.1.1 that the standard method of programming mul-
ticore machines is to set up threads. These are several instances of the
same program running simultaneously, with the key feature that they share
memory. To see what this means, suppose all the cores are running threads
from our program, and that the latter includes the Intel machine language
instruction
1 Onecan also use FORTRAN, but its usage is much less common in data science.
2 Some architectures are not register-oriented, but for simplicity we will assume a
register orientation here.
4.1. SO, WHAT IS ACTUALLY SHARED? 81
which copies the contents of memory location 200 to the core’s RAX regis-
ter.
Before continuing, it is worth asking the question, “Where did that 200
come from?” In our high-level language source code, say C, we may have
had a variable named z. The C compiler would have decided at which
memory address to store z, say 200, and may have translated one of our
lines of C code accessing z to the above machine instruction.
The same principles hold for interpreted languages with virtual machines
like R. If we have a variable w in our R code, the R interpreter will choose
a memory location for it, and we will in the end be executing machine
instructions like the one above.
Now, what happens when that machine instruction executes? Remember,
there is only one memory location 200, shared by all cores, but each core
has its own separate register set. If core 1 and core 4 happen to execute this
same instruction at about the same time, the contents of memory location
200 will be copied to both core 1’s RAX and core 4’s RAX in the above
example.
One technical issue that should be mentioned is that most machines today
use virtual addressing, as explained in Chapter 2. Location 200 is actually
mapped by the hardware to a different address, say 5208, during execution.
But since in our example the cores are running threads from the same
program, the virtual address 200 will map to that same location 5208, for
all of the cores. Thus whether one is talking about a virtual or physical
address, the key point is that all the cores are accessing the same actual
memory cell.
f <− function ( x ) {
...
y <− 2
z <− x + y
...
}
...
g <− function ( ) {
t <− 3
v <− 6
w <− f ( v )
u <− w + 1
}
What happens when the thread executes the call f(v) from with g()? The
internal R code (or the compiled machine code, in the case of C/C++) will
subtract 8 from SP, and then write 6 to the top of the (newly expanded)
stack, i.e., to the word currently pointed to by SP. Then to start execution
of f(), the code will do two things: (a) It will again decrement SP by 8,
and write to the stack the address of the code following the call, in this
case the code u <−w + 2 (to know where to come back later). (b) It will
jump to the area of memory where f() is stored, thus starting execution of
that function. The internal R code will then make space on the stack for
y, again by subtracting 8 from SP, and when y <−2 is executed, 2 will be
written to y’s location in the stack. Any code making use of x will be able
to fetch it from within the stack, as it was placed there earlier.
The above is useful information in general, and will be discussed again later
in the book, but for our purposes right now, the point is this: Since each
core has a separate stack pointer, the stacks for the various threads will
be in different sections of memory. Thus my threads-for-R package Rdsm,
to be presented shortly, has been designed so that the local variable y
will have a separate, independent instantiation at each thread. An Rdsm
shared variable, by contrast, will have just one instantiation, readable and
writable by all threads, as we’ll see.
By the way, note too what happens when the function f() finishes execution
and returns. The internal code will clean up the stack, by moving SP back
to where it had been before writing the 6 and 2 to the stack. SP will now
point to a word that contains the address recorded in step (a) above, and
the code will now make a jump to that address, i.e., u <−w + 1 will be
4.2. CLARITY OF SHARED-MEMORY CODE 83
executed, exactly what we want. So, g() resumes execution, and its own
local variables, such as t, are still available, as they are still on the stack,
from the time g() itself had been called.
globals are initially shared among the workers, but changes made by the workers to the
globals won’t be shared.
5 See Chandra, Rohit (2001), Parallel Programming in OpenMP, Kaufmann, pp.10ff
(especially Table 1.1), and Hess, Matthias et al (2003), Experiences Using OpenMP
Based on Compiler Directive Software DSM on a PC Cluster, in OpenMP Shared Memory
Parallel Programming: International Workshop on OpenMP Applications and Tools,
Michael Voss (ed.), Springer, p.216.
84 CHAPTER 4. SHARED MEMORY: R
# r e c e i v e from worker 2
y <− mpi . r e c v . Robj ( t a g =0, s o u r c e =2)
What a difference! Now that x and y are shared by the processes, we can
access them directly, making our code vastly simpler.
Note carefully that we are talking about human efficiency here, not machine
efficiency. Use of shared memory can greatly simplify our code, with far less
clutter, so that we can write and debug our program much faster than we
could in a message-passing environment. That doesn’t necessarily mean our
program itself has faster execution speed. We may have cache performance
issues, for instance; we’ll return to this point later.
It will turn out, though, that Rdsm can indeed enjoy a speed advantage
over other parallel R packages for some applications. We’ll return to this
issue in Section 4.5.
rather than
print (m)
The latter just prints out the location of the shared memory object.
# m a t r i x m u l t i p l i c a t i o n ; t h e p r o d u c t u %∗% v i s
# computed on t h e snow c l u s t e r c l s , and w r i t t e n
# in−p l a c e i n w ; w i s a b i g . m a t r i x o b j e c t
# t e s t on snow c l u s t e r c l s
t e s t <− function ( c l s ) {
# i n i t Rdsm
mgrinit ( c l s )
# s e t up s h a r e d v a r i a b l e s a , b , c ,
mgrmakevar ( c l s , ” a ” , 6 , 2 )
mgrmakevar ( c l s , ”b” , 2 , 6 )
mgrmakevar ( c l s , ” c ” , 6 , 6 )
# f i l l i n some t e s t d a t a
a [ , ] <− 1 : 1 2
b [ , ] <− rep ( 1 , 1 2 )
# g i v e t h e t h r e a d s t h e f u n c t i o n t o be run
c l u s t e r E x p o r t ( c l s , ” mmulthread ” )
# run i t
c l u s t e r E v a l Q ( c l s , mmulthread ( a , b , c ) )
print ( c [ , ] ) # not p r i n t ( c ) !
}
[3 ,] 12 12 12 12 12 12
[4 ,] 14 14 14 14 14 14
[5 ,] 16 16 16 16 16 16
[6 ,] 18 18 18 18 18 18
Here we first set up a two-node snow cluster c2. Remember, with snow,
clusters are not necessarily physical clusters, and can be multicore ma-
chines. For Rdsm the latter is the case.
The code test() is run as the snow manager. It creates shared variables,
then launches the Rdsm threads via snow’s clusterEvalQ().
4.4.2 Analysis
will cause
mmulthread ( a , b , c )
to run on all threads at once (though it probably won’t be the case that
all threads are running the same line of code simultaneously). Note that
we first needed to ship the function mmulthread() itself to the threads,
again because clusterEvalQ() runs our specified command in the context
of the environments at the threads.
It is crucial to keep in mind the sharing, e.g. of c[,]. The manager acquires
the key for a chunk of memory containing this variable and shares it with
the workers, via mgrmakevar(). The workers write to that memory, and
due to sharing—remember, sharing means they are all accessing the same
physical memory locations—the manager can then read it and print out
c[,].
88 CHAPTER 4. SHARED MEMORY: R
Now, how does mmulthread() work? The basic idea is break the rows
of the argument matrix u into chunks, and have each thread work on one
chunk.7 Say there are 1000 rows, and we have a quadcore machine (on
which we’ve set up a four-node snow cluster). Thread 1 would handle
rows 1-250, thread 2 would work on rows 251-500 and so on.
The chunks are assigned in the code
myidxs <−
s p l i t I n d i c e s (nrow( u ) , myinfo$ nwrkrs ) [ [ myinfo$ i d ] ]
calling the snow function splitIndices(). For example, the value of myidxs
at thread 2 will be 251:500. The built-in Rdsm variable myinfo is an R
list containing nwrkrs, the total number of threads, and id, the ID num-
ber of the thread executing the above displayed line. On thread 2 in our
example here, those numbers will be 4 and 2, respectively.
The reader should note the “me, my” point of view that is key to threads
programming. Remember, each of the threads is (more or less) simulta-
neously executing mmulthread(). So, the code in that function must be
written from the point of view of a particular thread. That’s why we put
the “my” in the variable name myidxs. We’re writing the code from the
anthropomorphic view of imagining ourselves as a particular thread exe-
cuting the code. That thread is “me,” and so the row indices are “my”
indices, hence the name myidxs.
Each thread multiplies v by the thread’s own chunk of u, placing the result
in the corresponding chunk of w:
w [ myidxs , ] <− u [ myidxs , ] %∗% v [ , ]
When they do so, the variable c in the call will be w within mmulthread(),
and thus references to w will again be via that same address, 0x105804ce0.
8 Readers who are well-versed in languages such as C may be interested in how the
address is actually used. Basically, in R the array-access operations are themselves func-
tions, such as the built-in function ”[”. As such, they can be overridden, as with opera-
tor overloading in C++, and bigmemory uses this approach to redirect expressions like
w[2,5] to shared memory accesses. An earlier version of Rdsm, developed independently
around the time bigmemory was being written, took the same approach.
90 CHAPTER 4. SHARED MEMORY: R
As you can see, then, all of the threads are indeed sharing this matrix, as
is the manager, since they are all accessing this spot in memory. So for
example if any one of these entities writes to that shared object, the others
will see the new values.
A side note: “Traditionally,” R is a functional language, (mostly) free of
side effects. To explain this concept, consider a function call f(x). Any
change that f() makes to x does not change the value of x in the caller.
If it could change, this would be a side effect of the call, a commonplace
occurrence in languages such as C/C++ but not in R. If we do want x to
change in the caller, we must write f() to reurn the changed value of x, and
then in the caller, reassign it, e.g.
x <− f ( x )
As seen above, the bigmemory package, and thus Rdsm, do produce side
effects.9
R has never been 100% free of side effects, e.g. due to use of the <<− oper-
ator, and the number of exceptions has been increasing. The bigmemory
and data.table packages are examples, as is R’s new reference classes. The
motivation of allowing side effects is to avoid expensive copying of a large
object when one changes only one small component of it. This is especially
important for our parallel processing context; as mentioned earlier, needless
copying of large objects can rob a parallel program of its speed.
The Rdsm package includes instructions for saving a key to a file and then
loading it from another invocation of R on the same machine. The latter
will then be able to access the shared variable as well. For example, one
might write a Web crawler application, collecting Web data and storing it
in shared member, and meanwhile monitor it interactively via a separate
R process.
We won’t do extensive timing experiments here, but let’s just check that
the code is indeed providing a speedup:
> n <− 5000
> m <− matrix ( runif ( n ˆ 2 ) , ncol=n )
> system . time (m %∗% m)
u s e r system e l a p s e d
9 Indeed, this is one of bigmemory’s major user attractions, according to bigmem-
4.4.6 Leveraging R
It was pointed out earlier that a good reason for avoiding C/C++ if possible
is to be able to leverage R’s powerful built-in operations. In this example,
we made use of R’s built-in matrix-multiply capability, in addition to its
ability to extract subsets of matrices.
This is a common strategy. To solve a big problem, we break it into smaller
ones of the same type, apply R’s tools to the small problems, and then
somehow combine to obtain the final result. This of course is a general
parallel processing design pattern, not just for R, but with a difference in
that here we need to find appropriate R tools. R is an interpreted language,
thus with a tendency to be slow, but its basic operations typically make use
of functions that are written in C, which are fast. Matrix multiplication is
such an operation, so our approach here does work well.
Note, though, that shared-memory access may involve hidden data copy-
ing. Each cache coherency transaction (Section 2.5.1.1) involves copying
of data, and if such transactions occur frequently, it can add up to large
amounts. Indeed, some of that copying may be unnecessary, say when a
cache block is brought in but never used much afterward. Thus shared-
memory programming is not necessarily a “win,” but it will become clear
below that it can be much faster for some applications, relative to other R
parallel packages such as snow, multicore, foreach and even Rmpi.
To see why, here is a version of mmulthread() using the snow package:
snowmmul <− function ( c l s , u , v ) {
require ( p a r a l l e l )
i d x s <− s p l i t I n d i c e s (nrow( u ) , length ( c l s ) )
mmulchunk <−
function ( idxchunk ) u [ idxchunk , ] %∗% v
r e s <− c l u s t e r A p p l y ( c l s , i d x s , mmulchunk )
Reduce ( rbind , r e s )
}
It turns out that snow is considerably slower than the Rdsm implemen-
tation, as seen in Table 4.1. The results are for various sizes of n × n
matrices, and various numbers of cores. The machine had 16 cores, with a
hyperthreading degree of 2 (Section 1.4.5.2).
One of the culprits is the line
Reduce ( rbind , r e s )
4.5. PERFORMANCE ADVANTAGE 93
in the snow version. This involves a lot of copying of data, and possibly
worse, multiple allocation of large matrices, greatly sapping speed. This
is in stark contrast to the Rdsm case, in which the threads directly write
their chunked-multiplication results to the desired output matrix. Note
that the Reduce() operation itself is done serially, and though we might
try to parallelize that too, that itself would require lots of copying, and
thus may be difficult to make work well.
This of course was not a problem particular to snow. The same Reduce()
operation or equivalent would be needed with multicore, foreach (using
the .combine option), Rmpi and so on.10 Rdsm, by writing the results
directly to the desired output, avoids that problem.
It is clear that there are many applications with similar situations, in which
tools like snow etc. do a lot of serial data manipulation following the paral-
lel phase. In addition, iterative algorithms, such as k-means clustering (Sec-
tion 4.9) involve repeated alternating between a serial and parallel phase.
Rdsm should typically give faster speed than do the others in these appli-
cations.
On the other hand, some improvement can be obtained by using unlist()
instead of Reduce(), writing the last line of mmulthread() as11
matrix ( u n l i s t ( r e s ) , ncol=ncol ( v ) )
Using this approach, the snow times for 16- and 24-node clusters on a
3000 × 3000 matrix seen above were reduced to 11.600 and 13.792, respec-
tively (and were confirmed in subsequent runs not shown here).
The shared-memory vs. message-passing debate is a long-running one in the
parallel processing community. It has been traditional to argue that the
10 With multicore, we would have a little less copying, as explained in Section 3.7.1.
11 As suggested by M. Hannon.
94 CHAPTER 4. SHARED MEMORY: R
shared-memory paradigm doesn’t scale well (Section 2.9), but the advent
of modern multicore systems, especially GPUs, has done much to counter
that argument.
indivisible pair, without having any other thread being able to act between
the two phases, thus eliminating the danger.
4.6.2 Locks
What we need to avoid race conditions is a mechanism that will limit ac-
cess to the critical section to only one thread at a time, known as mutual
exclusion. A common mechanism is a lock variable or mutex. Most thread
systems include functions lock() and unlock(), applied to a lock variable.
Just before a critical section, one inserts a call to lock(), and we follow the
section with a call to unlock(). Execution will work as follows.
Suppose the lock variable is already locked, due to some other thread cur-
rently being inside the critical section. Then the thread making the call
to lock() will block, meaning that it will just freeze up for the time being,
not returning yet. When the thread currently in the critical section finally
exits, it will call unlock(), and the blocked thread will now unblock: This
thread will enter the critical section, and relock the lock. (Of course, if
several threads had been waiting at the lock, only one will succeed, and the
others will continue waiting.)
To make this concrete, consider this toy example, in Rdsm. We’ve initial-
ized Rdsm as a two-thread system, c2, and set up a 1 × 1 shared variable
tot. The code simply repeatedly adds 1 to the total, n times, and thus
should have a final value of n.
# t h i s f u n c t i o n i s not r e l i a b l e ; i f 2 t h r e a d s b o t h t r y
# t o i n c r e m e n t t h e t o t a l a t a b o u t t h e same time , t h e y
# c o u l d i n t e r f e r e w i t h each o t h e r
s <− function ( n ) {
for ( i in 1 : n) {
t o t [ 1 , 1 ] <− t o t [ 1 , 1 ] + 1
}
}
library ( p a r a l l e l )
c2 <− makeCluster ( 2 )
c l u s t e r E x p o r t ( c2 , ” s ” )
m g r i n i t ( c2 )
mgrmakevar ( c2 , ” t o t ” , 1 , 1 )
t o t [ 1 , 1 ] <− 0
c l u s t e r E v a l Q ( c2 , s ( 1 0 0 0 ) )
t o t [ 1 , 1 ] # s h o u l d be 2000 , b u t l i k e l y f a r from i t
96 CHAPTER 4. SHARED MEMORY: R
I did two runs of this. On the first one, the final value of tot[1,1] was 1021,
while the second time it was 1017. Neither time did it come out 2000 as it
“should.” Moreover, the result was random.
The problem here is that the action
t o t [ 1 , 1 ] <− t o t [ 1 , 1 ] + 1
Here, tot[1,1] should be 229, but is only 228. No wonder in the experiments
above, the total turned out to fall far short of the correct number, 2000.
But with locks, everything works fine. Continuing the above example, we
run the code
# here i s the r e l i a b l e version , surrounding the
# i n c r e m e n t by l o c k and u n l o c k , so o n l y 1 t h r e a d
# can e x e c u t e i t a t once
s 1 <− function ( n ) {
for ( i in 1 : n) {
rdsmlock ( ” t o t l o c k ” )
t o t [ 1 , 1 ] <− t o t [ 1 , 1 ] + 1
rdsmunlock ( ” t o t l o c k ” )
}
}
mgrmakelock ( c2 , ” t o t l o c k ” )
t o t [ 1 , 1 ] <− 0
c l u s t e r E x p o r t ( c2 , ” s 1 ” )
c l u s t e r E v a l Q ( c2 , s 1 ( 1 0 0 0 ) )
t o t [ 1 , 1 ] # w i l l p r i n t o u t 2000 , t h e c o r r e c t number
4.6.3 Barriers
Once again, let’s leverage the power of R. The zoo time series package
includes a function rollmean(w,m), which returns all the means of blocks
of length k, i.e., what are usually called moving averages—just what we
need.
Here is the code:
# Rdsm code t o f i n d max b u r s t i n a time s e r i e s ;
# arguments :
# x : data vector
# k : block size
# mas : s c r a t c h space , shared , 1 x ( l e n g t h ( x )−1)
# r s l t s : 2− t u p l e showing t h e maximum b u r s t v a l u e ,
# and where i t s t a r t s ; shared , 1 x 2
98 CHAPTER 4. SHARED MEMORY: R
t e s t <− function ( c l s ) {
require (Rdsm)
mgrinit ( c l s )
mgrmakevar ( c l s , ”mas” , 1 , 9 )
mgrmakevar ( c l s , ” r s l t s ” , 1 , 2 )
x <<− c ( 5 , 7 , 6 , 2 0 , 4 , 1 4 , 1 1 , 1 2 , 1 5 , 1 7 )
c l u s t e r E x p o r t ( c l s , ” maxburst ” )
c l u s t e r E x p o r t ( c l s , ”x” )
c l u s t e r E v a l Q ( c l s , maxburst ( x , 2 , mas , r s l t s ) )
print ( r s l t s [ , ] ) # not p r i n t ( r s l t s ) !
}
The division of labor here involves assigning different chunks of the data to
different Rdsm threads. To determine the chunks, we could call snow’s
splitIndices() as before, but actually Rdsm provides a simpler wrap-
per for that, getidxs(), which we’ve called here, to determine where this
thread’s chunk begins and ends:
n <− length ( x )
myidxs <− g e t i d x s ( n−k+1)
m y f i r s t <− myidxs [ 1 ]
mylast <− myidxs [ length ( myidxs ) ]
4.8. EXAMPLE: TRANSFORMING AN ADJACENCY MATRIX 99
We then call rollmean() on this thread’s chunk, and write the results into
this thread’s section of mas:
When all the threads are done executing the above line, we will be ready
to combine the results. But how will we know when they’re done? That’s
where the barrier comes in. We call barr() to make sure everyone is done,
and then designate one thread to then combine the results found by the
threads:
Here is another example of the use of barriers, this one more involved, both
because the computation is a little more complex, and because we need two
variables this time.
Say we have a graph with an adjacency matrix
0 1 0 0
1 0 0 1
0
(4.1)
1 0 1
1 1 1 0
For example, the 1s in row 1, column 2 and row 4, column 1, signify that
there is an edge from vertex 1 to vertex 2, and one from vertex 4 to vertex
1. We’d like to transform this to a two-column matrix that displays the
100 CHAPTER 4. SHARED MEMORY: R
1 2
2 1
2 4
3 2
(4.2)
3 4
4 1
4 2
4 3
For instance, the (4,3) in the last row means there is an edge from vertex
4 to 3, corresponding to the 1 in row 4, column 3 of the adjacency matrix.
# arguments :
# adj : adjacency matrix
# l n k s : e d g e s m a t r i x ; shared , nrow ( a d j )ˆ2 rows
# and 2 columns
# c o u n t s : numbers o f e d g e s found by each t h r e a d ;
# s h a r e d ; 1 row , l e n g t h ( c l s ) columns
# ( i . e . 1 element per thread )
# i n t h i s v e r s i o n , t h e m a t r i x l n k s must be c r e a t e d
# p r i o r t o c a l l i n g f i n d l i n k s ( ) ; s i n c e t h e number o f
# rows i s unknown a p r i o r i , one must a l l o w f o r t h e
# w o r s t case , nrow ( a d j )ˆ2 rows ; a f t e r t h e run , t h e
# number o f a c t u a l rows w i l l be i n
# c o u n t s [ 1 , l e n g t h ( c l s ) ] , so t h a t t h e e x c e s s
# re mainin g rows can be removed
# d e t e r m i n e where t h e 1 s a r e i n t h i s t h r e a d ’ s
# p o r t i o n o f a d j ; f o r each row number i i n myidxs ,
# an e l e m e n t o f myout w i l l r e c o r d t h e column
# l o c a t i o n s o f t h e 1 s i n t h a t row , i . e . r e c o r d t h e
# edges out of v e r t e x i
myout <− apply ( a d j [ myidxs , ] , 1 ,
function ( onerow ) which ( onerow==1))
# t h i s t h r e a d w i l l now form i t s p o r t i o n o f l n k s ,
# s t o r i n g i n tmp
tmp <− matrix (nrow=0, ncol =2)
my1strow <− myidxs [ 1 ]
f o r ( i d x i n myidxs )
tmp <− rbind ( tmp , c o n v e r t 1 r o w ( idx ,
myout [ [ idx−my1strow + 1 ] ] ) )
# so , l e t ’ s f i n d c u m u l a t i v e e d g e sums , and
# p l a c e them i n c o u n t s
nmyedges <−
Reduce (sum, lapply ( myout , length ) ) # my c o un t
me <− myinfo$ i d
c o u n t s [ 1 , me ] <− nmyedges
barr () # wait for a l l threads to write to counts
# d e t e r m i n e where i n l n k s t h e p o r t i o n o f t h r e a d
# 1 ends ; t h r e a d 2 ’ s p o r t i o n o f l n k s b e g i n s
# i m m e d i a t e l y a f t e r t h r e a d 1 ’ s , e t c . , so we need
# c u m u l a t i v e sums , which we ’ l l p l a c e i n c o u n t s ;
# we ’ l l have t h r e a d 1 perform t h i s t a s k , t h o u g h
# any t h r e a d c o u l d do i t
i f (me == 1 ) c o u n t s [ 1 , ] <− cumsum( c o u n t s [ 1 , ] )
barr () # others wait for thread 1 to f i n i s h
# t h i s t h r e a d now p l a c e s tmp i n i t s p r o p e r
102 CHAPTER 4. SHARED MEMORY: R
0 # don ’ t do e x p e n s i v e r e t u r n o f r e s u l t
}
t e s t <− function ( c l s ) {
require (Rdsm)
mgrinit ( c l s )
mgrmakevar ( c l s , ”x” , 6 , 6 )
mgrmakevar ( c l s , ” l n k s ” , 3 6 , 2 )
mgrmakevar ( c l s , ” c o u n t s ” , 1 , length ( c l s ) )
x [ , ] <− matrix ( sample ( 0 : 1 , 3 6 , replace=T) , ncol =6)
clusterExport ( cls , ” findlinks ”)
c l us t e rE x po r t ( c l s , ” convert1row ” )
clusterEvalQ ( cls , f i n d l i n k s (x , lnks , counts ) )
print ( l n k s [ 1 : c o u n t s [ 1 , length ( c l s ) ] , ] )
}
The division of labor here involves assigning different chunks of rows of the
adjacency matrix to different Rdsm threads. We first partition the rows,
as before, then determine the locations of the 1s in this thread’s chunk of
rows:
myidxs <− g e t i d x s ( nr )
myout <− apply ( a [ myidxs , ] , 1 , function ( rw ) which ( rw==1))
The R list myout will now give a row-by-row listing of the column numbers
of all the 1s in the rows of this thread’s chunk. Remember, our ultimate
output matrix, lnks, will have one row for each such 1, so the information
in myout will be quite useful.
4.8. EXAMPLE: ADJACENCY MATRIX 103
This function returns a chunk that will eventually go into lnks, specifically
the chunk corresponding to row rownum in adj. The code to form all such
chunks for our given thread is
tmp <− matrix (nrow=0, ncol =2)
my1strow <− myidxs [ 1 ]
f o r ( i d x i n myidxs ) tmp <−
rbind ( tmp , c o n v e r t 1 r o w ( idx , myout [ [ idx−my1strow + 1 ] ] ) )
Note that here the code needed to recognize the fact that the information
for row number idx in adj is stored in element idx - my1strow + 1 of
myout.
Now that this thread has computed its portion of lnks, it must place it
there. But in order to do so, this thread must know where in lnks to start
writing. And for that, this thread needs to know how many 1s were found
by threads prior to it. If for instance thread 1 finds eight 1s and thread 2
finds three, then thread 3 must start writing at row 8 + 3 + 1 = 12 in lnks.
Thus we need to find the overall 1s counts (across all rows of a thread) for
each thread,
nmyedges <− Reduce (sum, lapply ( myout , length ) )
and then need to find cumulative sums, and share them. To do this, we’ll
have (for instance) thread 1 find those sums, and place them in our shared
variable counts:
me <− myinfo$ i d
c o u n t s [ 1 , me ] <− nmyedges
barr ()
i f (me == 1 ) {
c o u n t s [ 1 , ] <− cumsum( c o u n t s [ 1 , ] )
}
barr ()
Note the barrier calls just before and just after thread 1. The first call is
needed because thread 1 can’t start finding the cumulative sums before all
the individual counts are ready. Then we need the second barrier, because
104 CHAPTER 4. SHARED MEMORY: R
all the threads will be making use of the cumulative sums, and we need to
be sure those sums are ready first. These are typical examples of barrier
use.
Now that our thread knows where in lnks to write its results, it can go
ahead:
mystart <− i f (me == 1 ) 1 e l s e c o u n t s [ 1 , me−1] + 1
myend <− mystart + nmyedges − 1
l n k s [ mystart : myend , ] <− tmp
A problem above is having to allocate the lnks matrix to handle the worst
case, thus wasting space and execution time. The problem is that we don’t
know in advance the size of our “output,” in this case the argument lnks.
In our little example above, the adjacency matrix was of size 4x4, while
the edges matrix was 7x2. We know the number of columns in the edges
matrix will be 2, but the number of rows is unknown a priori.
Note that the user can determine the number of “real” rows in lnks by
inspecting counts[1,length(cls)] after the call returns, as seen in the test
code. One could copy the “real” rows to another matrix, then deallocate
the big one.
One alternate approach would be to postpone allocation until we know
how big the lnks matrix needs to be, which we will know after the cumu-
lative sums in counts are calculated. We could have thread 1 then create
the shared matrix lnks, by calling bigmemory directly rather than us-
ing mgrmakevar(). To distribute the shared-memory key for this matrix,
thread 1 would save the bigmemory descriptor to a file, then have the
other threads get access to lnks by loading from the file.
Actually, this problem is common in parallel processing applications. We
will return to it in Section 5.4.2.
(For convenience, we are still using Rdsm to set up the shared variables,
though we run in non-Rdsm code.)
Now try the parallel version:
1. For each data point, i.e., each row of our data matrix, determine
which centroid this point is closest to.
3. After all data points are processed in this manner, update the cen-
troids to reflect the current group memberships.
4. Next iteration.
# i n i t i a l c e n t r o i d s t a k e n t o be k randomly chosen
# rows o f x ; i f a c l u s t e r becomes empty , i t s new
# c e n t r o i d w i l l be a random row o f x
l i b r a r y (Rdsm)
12 If we have m variables, then the centroid of a group is the m-element vector of means
# arguments :
# x : data matrix x ; shared
# k : number o f c l u s t e r s
# n i : number o f i t e r a t i o n s
# c n t r d s : c e n t r o i d s m a t r i x ; row i i s c e n t r o i d i ;
# shared , k by n c o l ( x )
# cinit : optional i n i t i a l values for centroids ;
# k by n c o l ( x )
# sums : s c r a t c h m a t r i x ; sums [ j , ] c o n t a i n s count ,
# sum f o r c l u s t e r j ; shared , k by 1+n c o l ( x )
# l c k : lock v a r i a b l e ; shared
d s t s <−
matrix ( p d i s t (myx , c n t r d s [ , ] ) @dist , ncol=nrow(myx ) )
n r s t <− apply ( d s t s , 2 , which . min)
# nrst [ i ] contains the index of the nearest
# c e n t r o i d t o row i i n myx
tmp <− tapply ( 1 : nrow(myx ) , n r s t , mysum , myx)
# i n t h e above , we g a t h e r t h e o b s e r v a t i o n s
# i n myx whose c l o s e s t c e n t r o i d i s c e n t r o i d j ,
# and f i n d t h e i r sum , p l a c i n g i t i n # tmp [ j ] ;
# t h e l a t t e r w i l l a l s o have t h e c o un t o f such
# o b s e r v a t i o n s # i n i t s l e a d i n g component ;
# next , we need t o add t h a t t o sums [ j , ] ,
# as an atomic o p e r a t i o n
realrdsmlock ( lck )
# j v a l u e s i n tmp w i l l be s t r i n g s , so c o n v e r t
f o r ( j i n as . integer (names( tmp ) ) ) {
sums [ j , ] <− sums [ j , ] + tmp [ [ j ] ]
}
realrdsmunlock ( lck )
b a r r ( ) # w a i t f o r sums [ , ] t o be r e a d y
i f ( myinfo$ i d == 1 ) {
# u p d a t e c e n t r o i d s , u s i n g a random
# d a t a p o i n t i f a c l u s t e r becomes empty
for ( j in 1 : k ) {
# update c e n t r o i d f o r c l u s t e r j
i f ( sums [ j , 1 ] > 0 ) {
c n t r d s [ j , ] <− sums [ j , −1] / sums [ j , 1 ]
} e l s e c n t r d s [ j ] <<− x [ sample ( 1 : nx , 1 ) , ]
}
}
}
0 # don ’ t do e x p e n s i v e r e t u r n o f r e s u l t
}
t e s t <− function ( c l s ) {
library ( p a r a l l e l )
mgrinit ( c l s )
mgrmakevar ( c l s , ”x” , 6 , 2 )
mgrmakevar ( c l s , ” c n t r d s ” , 2 , 2 )
mgrmakevar ( c l s , ”sms” , 2 , 3 )
mgrmakelock ( c l s , ” l c k ” )
x [ , ] <− matrix ( sample ( 1 : 2 0 , 1 2 ) , ncol =2)
c l u s t e r E x p o r t ( c l s , ” kmeans” )
4.9. EXAMPLE: K-MEANS CLUSTERING 109
c l u s t e r E v a l Q ( c l s , kmeans ( x , 2 , 1 , c n t r d s , sms , ” l c k ” ,
c i n i t=rbind ( c ( 5 , 5 ) , c ( 1 5 , 1 5 ) ) ) )
}
t e s t 1 <− function ( c l s ) {
mgrinit ( c l s )
mgrmakevar ( c l s , ”x” , 1 0 0 0 0 , 3 )
mgrmakevar ( c l s , ” c n t r d s ” , 3 , 3 )
mgrmakevar ( c l s , ”sms” , 3 , 4 )
mgrmakelock ( c l s , ” l c k ” )
x [ , ] <− matrix (rnorm ( 3 0 0 0 0 ) , ncol =3)
r i <− sample ( 1 : 1 0 0 0 0 , 3 0 0 0 )
x [ r i , 1 ] <− x [ r i , 1 ] + 5
r i <− sample ( 1 : 1 0 0 0 0 , 3 0 0 0 )
x [ r i , 2 ] <− x [ r i , 2 ] + 5
c l u s t e r E x p o r t ( c l s , ” kmeans” )
c l u s t e r E v a l Q ( c l s , kmeans ( x , 3 , 5 0 , c n t r d s , sms , ” l c k ” ) )
}
Let’s first discuss the arguments of kmeans(). Our data matrix is x, which
is described in the comments as a shared variable (on the assumption that
it will often be such) but actually need not be.
By contrast, cntrds needs to be shared, as the threads repeatedly use it as
the iterations progress. We have thread 1 writing to this variable,
i f ( myinfo$ i d == 1 ) {
for ( j in 1 : k ) {
i f ( sums [ j , 1 ] > 0 ) {
c n t r d s [ j , ] <<− sums [ j , −1] / sums [ j , 1 ]
} e l s e c n t r d s [ j ] <<− x [ sample ( 1 : nx , 1 ) , ]
}
}
If cntrds were not shared, the whole thing would fall apart. When thread
1 would write to it, it would become a local variable for that thread, and
the new value would not become visible to the other threads. Note that as
in our previous examples, we store our function’s final result, in this case
cntrds, in a shared variable, rather than as a return value.
110 CHAPTER 4. SHARED MEMORY: R
Once again our approach will be to break the data matrix into chunks of
rows. Each thread will handle one chunk, finding distances from rows in its
chunk to the current centroids. How is the above code preparing for this?
Note again the “me, my” point of view here, pointed out in Section 4.4 and
present in almost any threads function. The code here is written from the
point of view of a particular thread. So, the code first needs to determine
this thread’s rows chunk.
Why have this separate variable, myx? Why not just use x[myidxs,]?
First, having the separate variable results in less cluttered code. But sec-
ondly, repeated access to x could cause a lot of costly cache misses and
cache coherency actions.
Next we see another use of barriers:
i f ( i s . null ( c i n i t ) ) {
i f ( myinfo$ i d == 1 )
c n t r d s [ , ] <− x [ sample ( 1 : nx , k , replace=F ) , ]
barr ()
} e l s e c n t r d s [ , ] <− c i n i t
We’ve set things up so that if the user does not specify the initial values of
the centroids, they will be set to k random rows of x. We’ve written the
code so that thread 1 performs this task, but we need the other threads
to wait until the task is done. If we didn’t do that, one thread might race
ahead and start accessing cntrds before it is ready. Our call to barr()
ensures that this won’t happen.
We have a similar use of a barrier at the beginning of the main loop:
i f ( myinfo$ i d == 1 ) {
sums [ ] <− 0
}
4.9. EXAMPLE: K-MEANS CLUSTERING 111
We need to compute the distances to the various centroids from all the rows
in this thread’s portion of our data:
d s t s <−
matrix ( p d i s t (myx , c n t r d s [ , ] ) @dist , ncol=nrow(myx ) )
R’s pdist package comes to the rescue! This package, which we saw in
Section 3.9, finds all distances from the rows of one matrix to the rows
of another, exactly what we need. So, here again, we are leveraging R!
(Indeed, an alternate way to parallelize the computation from what we are
doing here would be to parallelize pdist(), say using Rdsm instead of
snow as before.)
Next, we leverage R’s which.min() function, which finds indices of minima
(not the minima themselves). We use this to determine the new group
memberships for the data points in myx:
n r s t <− apply ( d s t s , 2 , which . min)
Next, we need to collect the information in nrst into a more usable form,
in which we have, for each centroid, a vector stating the indices of all rows
in myx that now will belong to that centroid’s group. For each centroid,
we’ll also need to sum all such rows, in preparation for later averaging them
to find the new centroids.
Again, we can leverage R to do this quite compactly (albeit needing a bit
of thought):
mysum <− function ( i d x s , myx) {
c ( length ( i d x s ) , colSums (myx [ i d x s , , drop=F ] ) )
}
...
tmp <− tapply ( 1 : nrow(myx ) , n r s t , mysum , myx)
But remember, all the threads are doing this! For instance, thread 1 is
finding the sum of its rows that are now closest to centroid 6, but thread
4 is doing the same. For centroid 6, we will need the sum of all such rows,
across all such threads.
In other words, multiple threads may be writing to the same row of sums
at about the same time. Race condition ahead! So, we need a lock:
lock ( lck )
f o r ( j i n names( tmp ) ) {
112 CHAPTER 4. SHARED MEMORY: R
j <− as . integer ( j )
sums [ j , ] <− sums [ j , ] + tmp [ [ j ] ]
}
unlock ( l c k )
The for loop here is a critical section. Without the restriction, chaos could
result. Say for example two threads want to add 3 and 8 to a certain total,
respectively, and that the current total is 29. What could happen is that
they both see the 29, and compute 32 and 37, respectively, and then write
those numbers back to the shared total. The result might be that the new
total is either 32 or 37, when it actually should be 40. The locks prevent
such a calamity.
A refinement would be to set up k locks, one for each row of sums. As noted
earlier, locks sap performance, by temporarily serializing the execution of
the threads. Having k locks instead of one might ameliorate the problem
here.
After all the threads are done with this work, we can have thread 1 compute
the new averages, i.e., the new centroids. But the key word in the last
sentence is “after.” We can’t let thread 1 do that computation until we are
sure that all the threads are done. This calls for using a barrier:
barr ()
i f ( myinfo$ i d == 1 ) {
for ( j in 1 : k ) {
i f ( sums [ j , 1 ] > 0 ) {
c n t r d s [ j , ] <<− sums [ j , −1] / sums [ j , 1 ]
} e l s e c n t r d s [ j ] <<− x [ sample ( 1 : nx , 1 ) , ]
}
}
As noted earlier, the shared variable sums serves as storage for intermediate
results, not only sums of the data points in a group, but also their counts.
We can now use that information to compute the new centroids:
i f ( myinfo$ i d == 1 ) {
for ( j in 1 : k ) {
# update c e n t r o i d f o r c l u s t e r j
i f ( sums [ j , 1 ] > 0 ) {
c n t r d s [ j , ] <− sums [ j , −1] / sums [ j , 1 ]
} e l s e c n t r d s [ j ] <<− x [ sample ( 1 : nx , 1 ) , ]
}
}
4.9. EXAMPLE: K-MEANS CLUSTERING 113
Let n denote the number of rows in our data matrix. With k clusters, we
have to compute nk distances per iteration, and then take n minima. So
the time complexity is O(nk).
This is not very promising for parallelization. In many cases O(n) (fixing
k here) does not provide enough computation to overcome overhead issues.
However, with our code here, there really isn’t much overhead. We copy
the data matrix just once,
myx <− x [ myidxs , ]
and thus avoid problems of contention for shared memory and so on.
It appears that we can indeed get a speedup from our parallel version some
cases:
> x <− matrix ( runif ( 1 0 0 0 0 0 ∗ 2 5 ) , ncol =25)
> system . time ( kmeans ( x , 1 0 ) ) # kmeans ( ) , b a s e R
u s e r system e l a p s e d
8.972 0.056 9.051
> c l s <− makeCluster ( 4 )
> mgrinit ( c l s )
> mgrmakevar ( c l s , ” c n t r d s ” , 1 0 , 2 5 )
> mgrmakevar ( c l s , ”sms” , 1 0 , 2 6 )
> c l u s t e r E x p o r t ( c l s , ”kmeans” )
> mgrmakevar ( c l s , ”x” , 1 0 0 0 0 0 , 2 5 )
> x [ , ] <− x
> system . time ( c l u s t e r E v a l Q ( c l s ,
kmeans ( x , 1 0 , 1 0 , c n t r d s , sms , l c k ) ) )
u s e r system e l a p s e d
0.000 0.000 4.086
A bit more than 2X speedup for four cores, fairly good in view of the above
considerations.
Chapter 5
The Shared-Memory
Paradigm: C Level
1 Currently OpenMP is not supported by clang, the default compiler for Macs. One
115
116 CHAPTER 5. SHARED-MEMORY: C
5.1 OpenMP
Here is the code, written without an R interface for the time being.
We will discuss it in detail below, but you should glance through it first.
As you do, note the pragma lines, such as
#pragma omp s i n g l e
These are actually OpenMP directives, which instruct the compiler to insert
certain thread operations at that point.
For convenience, the code will assume that the time series values are non-
negative.
// OpenMP example program , Burst . c ; b u r s t ( ) f i n d s
5.2. EXAMPLE: MAXIMAL BURST IN A TIME SERIES 117
// p e r i o d o f h i g h e s t b u r s t o f a c t i v i t y i n a time s e r i e s
// arguments f o r b u r s t ( )
// inputs :
// x : t h e time s e r i e s , assumed n o n n e g a t i v e
// nx : l e n g t h o f x
// k: shortest period of i n t e r e s t
// outputs :
// startmax , endmax : p o i n t e r s to i n d i c e s of
// t h e maximal−b u r s t p e r i o d
// maxval : p o i n t e r t o maximal b u r s t v a l u e
me = omp g e t t h r e a d num ( ) ;
mymaxval = −1;
#pragma omp f o r
f o r ( p e r s t a r t = 0 ; p e r s t a r t <= nx−k ;
p e r s t a r t ++) {
f o r ( p e r l e n = k ; p e r l e n <= nx − p e r s t a r t ;
p e r l e n++) {
perend = p e r s t a r t+p e r l e n −1;
i f ( p e r l e n == k )
xbar = mean ( x , p e r s t a r t , perend ) ;
else {
// update t h e o l d mean
pl1 = perlen − 1;
xbar =
( p l 1 ∗ xbar + x [ perend ] ) / p e r l e n ;
}
i f ( xbar > mymaxval ) {
mymaxval = xbar ;
mystartmax = p e r s t a r t ;
myendmax = perend ;
}
}
}
#pragma omp c r i t i c a l
{
i f ( mymaxval > ∗maxval ) {
∗maxval = mymaxval ;
∗ startmax = mystartmax ;
∗endmax = myendmax ;
}
}
}
}
// h e r e ’ s our t e s t code
nx = a t o i ( argv [ 2 ] ) ; // l e n g t h o f x
x = m a l l o c ( nx∗ s i z e o f ( d o u b l e ) ) ;
f o r ( i = 0 ; i < nx ; i ++)
x [ i ] = rand ( ) / ( d o u b l e ) RAND MAX;
d o u b l e s t a r t i m e , endtime ;
s t a r t i m e = omp g e t wtime ( ) ;
// p a r a l l e l
b u r s t ( x , nx , k ,&startmax ,&endmax ,&maxval ) ;
// back t o s i n g l e t h r e a d
endtime = omp g e t wtime ( ) ;
p r i n t f ( ” e l a p s e d time : %f \n ” , endtime−s t a r t i m e ) ;
p r i n t f (”%d %d %f \n ” , startmax , endmax , maxval ) ;
i f ( nx < 2 5 ) {
f o r ( i = 0 ; i < nx ; i ++) p r i n t f (”% f ” , x [ i ] ) ;
p r i n t f (”\n ” ) ;
}
}
One does need to specify to the compiler that one is using OpenMP. On
Linux, for instance, I compiled the code via the command
% g c c −g −o b u r s t Burst . c −fopenmp −lgomp
#include <omp.h>
5.2.3 Analysis
purposes.
5.2. EXAMPLE: MAXIMAL BURST IN A TIME SERIES 121
scope rules give the programmer the ability to designate some nonglobals
as shared as well. (OpenMP also has other options for this, which will not
be covered here.)
Let’s look at the next pragma:
The single pragma directs that one thread (whichever reaches this line
first) will execute the next block, while the other threads wait. In this case,
we are just setting nth, the number of threads, and since the variable is
shared, only one thread need set it.
As mentioned, the other threads will wait for the one executing that single
block. In other words, there is an implied barrier right after the block. In
fact, OpenMP inserts invisible barriers after all parallel, for and sections
pragma blocks. In some settings, the programmer knows that such a barrier
is unnecessary, and can use the nowait clause to instruct OpenMP to not
insert a barrier after the block:
me = omp_get_thread_num();
Note again that me was declared inside the parallel pragma block, so that
each thread will have a different, independent version of this variable—
which of course is exactly what we need.
Unlike most of our earlier examples, the code here does not break our data
into chunks. Instead, the workload is partitioned in a different way to the
threads. Here is how. Look at the nested loop,
The outer loop iterates over all possible starting points for a burst period,
while the inner loop iterates over all possible lengths for the period. One
natural way to divide up the work among the threads is to parallelize the
outer loop. The for pragma does exactly that:
This pragma says that the following for loop will have its iterations divided
among the threads. Each thread will work on a separate set of iterations,
thus accomplishing the work of the loop in parallel. (Clearly, a requirement
is that the iterations must be independent of each other.) One thread will
work on some values of perstart, a second thread will work on some other
values, and so on.
Note that we won’t know ahead of time which threads will handle which
loop iterations. We’ll have more on this below, but the point is that there
will be some partitioning done by the OpenMP code, thus parallelizing the
computation. Of course, a for pragma is meaningless if it is not inside a
parallel block, as there would be no threads to assign the iterations to.
The way we’ve set things up here, the inner loop,
does not have its work partitioned among threads. For any given value of
perstart, all values of perlen will be handled by the same thread.
So, each thread will keep track of its own record values, i.e., the location
and value of the maximal burst it has found so far. In the end, each thread
will need to update the overall record values, in this code:
One can set the number of threads either before or during execution, For
the former, one sets the OMP NUM THREADS environment variable,
e.g.
export OMP_NUM_THREADS=8
omp_set_dynamic(0)
124 CHAPTER 5. SHARED-MEMORY: C
# threads time
2 18.543343
4 11.042197
8 6.170748
16 3.183520
5.2.6 Timings
You may have noticed that we have a potential load balance problem in
the above maximal-burst example. Iterations that have a larger value of
perstart do less work. In fact, the pattern here is very similar to that of
our mutual outlinks example, in which we first mentioned the load balance
issue (Section 1.4.5.2). Thus the manner in which iterations are assigned
to threads may make a big difference in program speed.
So far, we haven’t discussed the details of how the various iterations in a
loop are assigned to the various threads. Back in Section 3.1, we discussed
general strategies for doing this, and OpenMP offers the programmer sev-
eral options along those lines.
The type of scheduling is specified via the schedule clause in a for pragma,
e.g.
and
Again, for most looping applications this won’t be necessary. But for com-
plicated algorithms with dynamic work queues, work stealing may produce
a performance boost.
Let’s see how the example in Section 4.8 can be implemented in OpenMP.
(It is recommended that the reader review the R version of this algorithm
before continuing. The pattern used below is similar, but a bit harder to
follow in C, which is a lower-level language than R.)
// AdjMatXform . c
// t a k e s a graph a d j a c e n c y matrix f o r a d i r e c t e d
// graph , and c o n v e r t s i t t o a 2−column matrix o f
// p a i r s ( i , j ) , meaning an edge from v e r t e x i t o
// v e r t e x j ; t h e output matrix must be i n
// l e x i c o g r a p h i c a l order
// t r a n s g r a p h ( ) d o e s t h e work
// arguments :
{ int c h u n k s i z e = n / nth ;
myrange [ 0 ] = me ∗ chunksize ;
i f (me < nth −1)
myrange [ 1 ] = (me+1) ∗ c h u n k s i z e − 1 ;
e l s e myrange [ 1 ] = n − 1;
}
Before we begin, note that parallel C/C++ code involving matrices typi-
cally is written in one dimension, as follows:
Consider a 3x8 array x. Since row-major order (recall Section 2.3) is used in
C/C++, the array is stored internally in 24 consecutive words of memory,
in row-by-row order. Keep in mind that C/C++ indices start at 0, not 1 as
in R. The element in the second row and fifth column of the array is then
x[1,4], and it would be in the 8 + 4 = 12th word in internal storage. In
general, x[i,j] is stored in word
130 CHAPTER 5. SHARED-MEMORY: C
8 * i + j
of the array.
In writing generally-applicable code, we typically don’t know at compile
time how many columns (8 in the little example above) our matrix has. So
it is typical to recognize the linear nature of the internal storage, and use
it in our C code explicitly, e.g.
i f ( adjm [ n∗ i+j ] == 1 ) {
adjm [ n∗ i +( t o t 1 s ++)] = j ;
The memory allocation issue has popped up again, as it did in the Rdsm
implementation. Recall that in the latter, we allocated memory for an
output of size equal to that of the worst possible case. In this case, we
have chosen to allocate memory during the midst of execution, rather than
allocating beforehand, with an array num1s that will serve the following
purpose.
Note that if some row in the input matrix contains, say, five 1s, then this
row will contribute five rows in the output. We calculate such information
for each input row, placing this information in the array num1s:
Once that array is known, we find its cumulative values. These will inform
each thread as to where that thread will write to the output matrix, and
also will give us the knowledge of how large the output matrix will be. The
latter information is used in the call to the C library memory allocation
function malloc():
}
*nout = cumul1s[n];
outm = malloc(2*(*nout) * sizeof(int));
}
Note that implied and explicit barriers are used in this program. For in-
stance, consider the second single pragma:
...
}
num1s[i] = tot1s;
}
#pragma omp barrier
#pragma omp single
{
cumul1s[0] = 0; // cumul1s[i] will be tot 1s before row i of adjm
// now calculate where the output of each row in adjm
// should start in outm
for (m = 1; m <= n; m++) {
cumul1s[m] = cumul1s[m-1] + num1s[m-1];
}
*nout = cumul1s[n];
outm = malloc(2*(*nout) * sizeof(int));
}
for (i = myrows[0]; i <= myrows[1]; i++) {
outrow = cumul1s[i];
...
The num1s array is used within the single pragma, but computed just
before it. We thus needed to insert a barrier before the pragma, to make
sure nums1 is ready.
132 CHAPTER 5. SHARED-MEMORY: C
// t r a n s g r a p h ( ) d o e s t h i s work
// arguments :
// adjm : t h e a d j a c e n c y matrix (NOT assumed
// symmetric ) , 1 f o r edge , 0 o t h e r w i s e ;
// n o t e : matrix i s o v e r w r i t t e n
// np : p o i n t e r t o number o f rows and
// columns o f adjm
// nout : output , number o f rows i n
// r e t u r n e d matrix
// outm : t h e c o n v e r t e d matrix
5.5. EXAMPLE: ADJACENCY MATRIX, R-CALLABLE CODE 133
We could have a main() function here, but instead will be calling the code
from R, as will be seen shortly.
In writing a C file y.c containing a function f() that we’ll call from R, one
can compile using R from a shell command line:
after which can call f() from R in some manner, such as .C() or .Call().
We’ve written the code above to be compatible with the simpler interface,
.C(), which takes the form
> .C( ” f ” , our arguments h e r e )
A more complex but more powerful call form, .Call() is also available, to
be discussed below.
5.5. EXAMPLE: ADJACENCY MATRIX, R-CALLABLE CODE 135
#include <R.h>
Generally the good thing about compiling via R CMD SHLIB is that we
don’t have to worry where the header file is, or worry about the library
files. But things are a bit more complicated if one’s code uses OpenMP, in
which case we must so inform the compiler. We can do this by setting the
proper environment variable. For C code and the bash shell, for instance,
we would issue the shell command
% e x p o r t SHLIB OPENMP CFLAGS = −fopenmp
Here is a sample run, again in the R interactive shell, with the C file being
AdjMatXformForR.c:
n <− 5
dyn . load ( ”AdjMatXformForR . s o ” )
a <− matrix ( sample ( 0 : 1 , n ˆ 2 , replace=T) , ncol=n )
out <−.C( ” t r a n s g r a p h ” , as . integer ( a ) , as . integer ( n ) ,
integer ( 1 ) , integer ( 2 ∗n ˆ 2 ) )
• The return value must be of type void, and in fact return values are
passed via the arguments, in this case nout (the number of rows in
the output matrix) and outm (the output matrix itself).
Concerning that last point, there is no longer reason to have our C code
allocate memory for the output matrix, as it did in Section 5.4. Here we
set up that matrix to have worst-case size before the call, as we did in the
Rdsm version.
So, here is a test run:
136 CHAPTER 5. SHARED-MEMORY: C
> n <- 5
> dyn.load("AdjMatXformForR.so")
> a <- matrix(sample(0:1,n^2,replace=T),ncol=n)
> out <-.C("transgraph",as.integer(a),as.integer(n),
+ integer(1),integer(2*n^2))
> out
[[1]]
[1] 0 0 0 1 0 1 3 0 4 1 3 4 0 0 3 4 1 0 0 4 1 1 0 1 1
[[2]]
[1] 5
[[3]]
[1] 14
[[4]]
[1] 1 1 1 1 2 2 2 3 4 4 5 5 5 5 0 0 0 0 0 0 0 0 0 0 0 1
2 4 5 1 4 5 1 2 5 1 2 4
[39] 5 0 0 0 0 0 0 0 0 0 0 0
As you can see, the return value of .C() is an R list, with one element for
each of the arguments to transgraph(), including the output arguments.
Note that by default, all input arguments are duplicated, so that any
changes to them are visible only in the output list, not the original ar-
guments. Here out[[1]] is different from the input matrix a:
> a
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5]
[1 ,] 1 1 0 1 1
[2 ,] 1 0 0 1 1
[3 ,] 1 0 0 0 0
[4 ,] 0 1 0 0 1
[5 ,] 1 1 0 1 1
Duplication of the data might impose some slowdown, and can be disabled,
but this usage is discouraged by the R development team.
Our output matrix, out[[4]], is hard to read in its linear form. Let’s display
it as a matrix, keeping in mind that our other output variable, out[[3]],
tells us how many (real) rows there are in our output matrix:
> ( nout <− out [ [ 3 ] ] )
[ 1 ] 14
5.5. EXAMPLE: ADJACENCY MATRIX, R-CALLABLE CODE 137
5.5.3 Analysis
So, what has changed in this version? Most of the change is due to the
differences between R and C.
Most importantly, the fact that R uses column-major storage for matrices
while C uses row-major order (Section 2.3) means that much of our new
code must “reverse” the old code. For example, the line
outm[2*(outrow+j)+1] = adjm[n*i+j];
int n2 = n * n;
...
outm[outrow+j+n2] = adjm[n*j+i] + 1;
The other major way to call C/C++ code from R is via the .Call() function.
It is considered more advanced than .C(), but is much more complex. That
138 CHAPTER 5. SHARED-MEMORY: C
// t h e f u n c t i o n t r a n s g r a p h ( ) d o e s t h e work
// arguments :
// adjm : t h e a d j a c e n c y matrix (NOT assumed
// symmetric ) , 1 f o r edge , 0 o t h e r w i s e ;
// n o t e : matrix i s o v e r w r i t t e n
// by t h e f u n c t i o n
// return v a l u e : t h e c o n v e r t e d matrix
// f i n d s t h e chunk o f rows t h i s t h r e a d w i l l p r o c e s s
void findmyrange ( int n , int nth , int me , int ∗myrange )
{ int c h u n k s i z e = n / nth ;
myrange [ 0 ] = me ∗ c h u n k s i z e ;
i f (me < nth −1)
myrange [ 1 ] = (me+1) ∗ c h u n k s i z e − 1 ;
e l s e myrange [ 1 ] = n − 1 ;
}
#pragma omp p a r a l l e l
{ int i , j ,m;
int me = omp g e t t h r e a d num ( ) ,
nth = omp g e t num t h r e a d s ( ) ;
int myrows [ 2 ] ;
5.5. EXAMPLE: ADJACENCY MATRIX, R-CALLABLE CODE 139
int t o t 1 s ;
int outrow , num1si ;
#pragma omp s i n g l e
{
num1s = ( int ∗ ) m a l l o c ( n∗ s i z e o f ( int ) ) ;
cumul1s = ( int ∗ ) m a l l o c ( ( n+1)∗ s i z e o f ( int ) ) ;
}
findmyrange ( n , nth , me , myrows ) ;
f o r ( i = myrows [ 0 ] ; i <= myrows [ 1 ] ; i ++) {
// number o f 1 s found i n t h i s row
tot1s = 0;
f o r ( j = 0 ; j < n ; j ++)
i f ( xadjm ( i , j ) == 1 ) {
xadjm ( i , ( t o t 1 s ++)) = j ;
}
num1s [ i ] = t o t 1 s ;
}
#pragma omp b a r r i e r
#pragma omp s i n g l e
{
// cumul1s [ i ] w i l l be t o t 1 s b e f o r e row
// i o f xadjm
cumul1s [ 0 ] = 0 ;
// now c a l c u l a t e where t h e output o f each
// row i n adjm s h o u l d s t a r t i n outm
f o r (m = 1 ; m <= n ; m++) {
cumul1s [m] = cumul1s [m−1] + num1s [m− 1 ] ;
}
}
f o r ( i = myrows [ 0 ] ; i <= myrows [ 1 ] ; i ++) {
// c u r r e n t row w i t h i n outm
outrow = cumul1s [ i ] ;
num1si = num1s [ i ] ;
f o r ( j = 0 ; j < num1si ; j ++) {
outm ( outrow+j , 0 ) = i + 1 ;
outm ( outrow+j , 1 ) = xadjm ( i , j ) + 1 ;
}
}
}
return outmshort ;
}
We will still run R CMD SHLIB to compile, but we have more libraries
to specify in this case. In the bash shell, we can run
export R_LIBS_USER=/home/nm/R
export PKG_LIBS="-lgomp"
export PKG_CXXFLAGS="-fopenmp -I/home/nm/R/Rcpp/include"
That first command lets R know where our R packages are, in this case the
Rcpp package. The second states we need to link in the gomp library,
which is for OpenMP, and the third both warns the compiler to watch for
OpenMP pragmas and to include the Rcpp header files.
Note that that last export assumes our source code is in C++, as indicated
below by a .cpp suffix to the file name. Since C is a subset of C++, our
code can be pure C but we are presenting it as C++.
We then run
[3 ,] 1 3
[4 ,] 2 1
[5 ,] 2 2
[6 ,] 2 4
[7 ,] 3 1
[8 ,] 3 2
[9 ,] 4 1
[10 ,] 4 4
Sure enough, we do use .Call() instead of .C(). And note that we have
only one argument here, m, rather than five as before, and that the result is
actually in the return value, rather than being in one of the arguments. In
other words, even though .Call() is more complex than .C(), use of Rcpp
makes everything much simpler than under .C(). In addition, Rcpp allows
us to write our C/C++ code as if column-major order were used, consistent
with R. No wonder Rcpp has become so popular!
The heart of using .Call(), including via Rcpp, is the concept of the SEXP
(“S-expression,” alluding to R’s roots in the s language). In R internals,
a SEXP is a pointer to a C struct containing the given R object and in-
formation about the object. For instance, the internal storage for an R
matrix will consist of a struct that holds the elements of the matrix and
and its numbers of rows and columns. It is this encapsulation of data and
metadata into a struct that enabled us to have only a single argument in
the new version of transgraph():
RcppExport SEXP t r a n s g r a p h (SEXP adjm )
The term RcppExport will be explained shortly. But first, note that both
the input argument, adjm, and the return value are of type SEXP. In other
words, the input is an R object and the output is an R object. In our run
example above,
> . Call ( ” t r a n s g r a p h ” ,m)
the input was the R matrix m, and the output was another R matrix.
The machinery in .Call() here is set up for C, and C++ users (including
us in the above example) need a line like
e x t e r n ”C” t r a n s g r a p h ;
142 CHAPTER 5. SHARED-MEMORY: C
in the C++ code. The RcppExport term is a convenience for the pro-
grammer, and is actually
#define RcppExport extern ”C”
Now, let’s see what other changes have been made. Consider these lines:
Rcpp : : NumericMatrix xadjm ( adjm ) ;
n = xadjm . nrow ( ) ;
int n2 = n∗n ;
Rcpp : : NumericMatrix outm ( n2 , 2 ) ;
Rcpp has its own vector and matrix types, serving as a bridge between
those types in R and corresponding arrays in C/C++. The first line above
creates an Rcpp matrix xadjm from our original R matrix adjm. (Ac-
tually, no new memory space is allocated; here xadjm is simply a pointer
to the data portion of the struct where adjm is stored.) The encapsula-
tion mentioned earlier is reflected in the fact that Rcpp matrices have the
built-in method nrow(), which we use here. Then we create a new n2 × 2
Rcpp matrix, outm, which will serve as our output matrix. As before, we
are allowing for the worst case, in which the input matrix consists of all 1s.
Rcpp really shines for matrix code. Recall the discussion at the beginning
of Section 5.4.2. In our earlier versions of this adjacency matrix code, both
in the standalone C and R-callable versions, we were forced to use one-
dimensional subscripting in spite of working with two-dimensional arrays,
e.g.
i f ( adjm [ n∗ i+j ] == 1 ) {
This was due to the fact that ordinary two-dimensional arrays in C/C++
must have their numbers of columns declared at compile time, whereas
in this application such information is unknown until run time. This is
not a problem with object-oriented structures, such as those in the C++
Standard Template Library (STL) and Rcpp.
So now with Rcpp we can use genuine two-dimensional indexing, albeit
with parentheses instead of brackets:4
i f ( xadjm ( i , j ) == 1 ) {
As before, we allocated space for outm to allow for the worst case, in which
n2 rows were needed. Typically, there are far fewer than n2 1s in the matrix
adjm, so the last rows in outm are filled with 0s. Here we copy the nonzero
rows into a new Rcpp matrix outmshort, and then return that.
All in all, Rcpp made our code simpler and easier to write: We have
fewer arguments, arguments are in explicit R object form, we don’t need to
deal with row-major vs. column-major order, and our results come back in
exactly the desired R object, rather than as one component of a returned
R list.
5.6 Speedup in C
So, let’s check whether running in C can indeed do much better than R in
a parallel context, as discussed back in Section 1.1.
> n <− 10000
144 CHAPTER 5. SHARED-MEMORY: C
Gathering our old timings, the various methods are compared in Table 5.3.
Inspecting Table 5.3, we see that going from serial R to parallel R cut down
run time by about 72%, while the corresponding figure for OpenMP was
88%. To be sure, the OpenMP version was actually more than twice as
fast as the parallel R one. But relative to the serial R code, the move to C
yielded only a modest improvement over parallel R.
We thus see here a concrete illustration of the Principle of Pretty Good
Parallelism introduced in Section 1.1: Running in C can indeed pay off, if
we are willing to devote the development time, but that payoff may not be
worth the effort.
It has been mentioned several times in this book that cache coherency
transactions (and virtual memory paging) can really compromise perfor-
mance. Coupling that with the point, made in Section 2.3.4, that different
designs of the same code can have quite different memory access patterns
and thus quite different cache performance, we see that we must be mind-
5.8. FURTHER CACHE/VIRTUAL MEMORY ISSUES 145
ful of such issues when we write shared-memory code. And remember, the
problem is especially acute in multicore settings, due to cache coherency
issues (Section 2.5.1.1).
To make this idea concrete, we’ll look at two OpenMP programs to do
in-place matrix transpose. Here’s the first:
// CacheRow . c
int ∗m;
int ∗m;
}
}
double s t a r t i m e , endtime ;
s t a r t i m e = omp g e t wtime ( ) ;
t r a n s p (m, n ) ;
endtime = omp g e t wtime ( ) ;
p r i n t f ( ” e l a p s e d time : %f \n” , endtime−s t a r t i m e ) ;
i f ( n <= 1 0 ) {
f o r ( i = 0 ; i < n ; i ++) {
f o r ( j = 0 ; j < n ; j ++)
p r i n t f ( ”%d ” ,m[ n∗ i+j ] ) ;
p r i n t f ( ” \n” ) ;
}
}
}
in the same block. Yet now an access to y will trigger an unnecessary and
expensive cache coherency operation, since y is in a “bad” block.
One could avoid such a calamity by placing padding in between our decla-
rations of x and y, say
If our cache block size is 512 bytes, i.e., 64 8-byte integers, then y should
be 512 bytes past x in memory, hence not in the same block.
Here we again revisit our mutual outlinks problem, from Section 1.4. In
this case, we’ll compute inbound links, partly for variety but also to make
a point about caches in Section 5.9.2.
150 CHAPTER 5. SHARED-MEMORY: C
// M u t I n l i n k s . cpp
// i n p u t i s a graph a d j a c e n c y matrix , e l e m e n t ( i , j )
// b e i n g 1 o r 0 , depending on whether t h e r e i s an
// edge from v e r t e x i t o v e r t e x j
// s e t number o f t h r e a d s
int n t h r e a d s = INTEGER( nth ) [ 0 ] ;
omp s e t num t h r e a d s ( n t h r e a d s ) ;
// s i m p l e s t approach
5.9. REDUCTION OPERATIONS IN OPENMP 151
int t o t , i ;
#pragma omp p a r a l l e l f o r r e d u c t i o n (+: t o t )
f o r ( i = 0 ; i < nc ; i ++)
t o t += do one i ( xadj , i ) ;
return Rcpp : : wrap ( t o t ) ;
}
Note the need to write 2 in the call as as.integer(2). This is not an Rcpp
issue; instead, the problem is that R treats the constant 2 as having type
double.
5.9.1.3 Analysis
In this pragma, note first that our for clause has in this case been accom-
panied by the parallel and reduction clauses. The former is there simply
to save typing. The code
#pragma omp p a r a l l e l
...
#pragma omp f o r
at the same time, thus risking a race condition (Section 4.6.1). We need
the above statement to be executed atomically.
We could make use of OpenMP’s critical pragma (Section 5.2.1) to avoid
this, but what’s nice is that OpenMP does all that for us, behind the scenes,
when we specify reduction. OpenMP would set up independent copies of
tot for the various threads, and then add them atomically to the “real” tot
when exiting the loop, but again, we need not worry about this.
Any time a program is found to be slow, the first suspect is cache behavior.
Often that suspicion is valid.
Recall from Section 2.6 that when a thread starts a timeslice on a core, the
cache at that core may not contain anything useful to that thread. The
5.10. DEBUGGING 153
cache contents will then have to be built up, causing a lot of cache misses
for a while, thus slowing things down.
It is thus desirable to be able to assign certain threads to certain cores,
known as specifying processor affinity. If you use the gcc compiler, for in-
stance, you can set the environment variable GOMP CPU AFFINITY for
this. Or, from within an OpenMP program you can call sched setaffinity().
Check the documentation for your system for details.
5.10 Debugging
Most debugging tools have the capability to follow specific threads. We’ll
use GDB here.
First, as you run a program under GDB, the creation of new threads will
be announced, e.g.
(gdb) r 100 2
Starting program: /debug/primes 100 2
[New Thread 16384 (LWP 28653)]
[New Thread 32769 (LWP 28676)]
[New Thread 16386 (LWP 28677)]
[New Thread 32771 (LWP 28678)]
You can do backtrace (bt) etc. as usual. Here are some threads-related
commands:
% R -d gdb
GNU gdb (GDB) 7.5.91.20130417-cvs-ubuntu
...
(gdb) b burst
Function "burst" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
• Due to work stealing (Section 5.3.2), and possibly better cache be-
havior, TBB code in some cases yields better performance.
Concerning this last bullet item, TBB requires one to make use of C++
functors, which are function objects taking the form of a struct or class.
We will use these in the Thrust context in Chapter 7, but functors are just
the beginning of the added complexity of TBB.
For instance, take reduction. A function tbb::parallel reduce() is avail-
able in TBB, but it requires not only defining a “normal” functor, but also
defining a second function with the struct or class, named join().
Other than using TBB indirectly via Thrust, this book will not cover TBB.
However, if you are a good C++ programmer, you may find TBB structure
interesting and powerful. The principles of OpenMP covered in this chapter
should provide a good starting point for you.
Bear in mind that locks and barriers are “necessary evils.” We do need
them (or something equivalent) to ensure correct execution of our program,
but they slow things down. For instance, we say that lock variables, or the
critical sections they guard, serialize a program in the section they are used,
i.e., they change its parallel character to serial; only one thread is allowed
into the critical section at a time, so that execution is temporarily serial.
And contention for locks can cause lots of cache coherency transactions,
definitely putting a damper on performance. Thus one should always try
to find clever ways to avoid locks and barriers if possible.
One way to do this is to take advantage of the hardware. Modern processors
typically include a variety of hardware assists to make synchronization more
efficient.
For example, Intel machines allow a machine instruction to be prefixed by
a special byte called a lock prefix. It orders the hardware to lock up the
system bus while the given instruction is executing—so that the execution
156 CHAPTER 5. SHARED-MEMORY: C
is atomic. (The fact that this prefix, a hardware operation, is named lock
should not be confused with lock variables in software.)
Under the critical section approach, code to atomically add 1 to y would
look something like this:
OpenMP includes an atomic pragma, which we’d use in the above example
via this code:
#pragma omp atomic
y++;
This instructs the compiler to try to find a hardware construct like the
lock prefix above to implement mutual exclusion, rather than taking the
less efficient critical section route.
Also, the C++ Standard Template Library contains related constructs,
such as the function fetch add(), which again instructs the compiler to
attempt to find an atomic hardware solution to the update-total example
above. This idea has been advanced even further in C++11.
Chapter 6
The Shared-Memory
Paradigm: GPUs
6.1 Overview
The video game market is so lucrative that the industry has developed ever-
faster graphics cards, in order to handle ever-faster and ever-more visually
detailed video games. These actually are parallel processing hardware de-
vices, so around 2003 some people began to wonder if one might use them
for parallel processing of nongraphics applications. Such programming was
called GPGPU, general programming on graphics processing units, later
shortened to simply GPU programming.
Originally this was cumbersome. One needed to figure out clever ways of
mapping one’s application to some kind of graphics problem, i.e., ways of
disguising one’s problem so that it appeared to be doing graphics computa-
tions. Though some high-level interfaces were developed to automate this
transformation, effective coding required some understanding of graphics
principles.
But current-generation GPUs separate out the graphics operations, and
now consist of multiprocessor elements that run under the familiar shared-
memory threads model. Granted, effective coding still requires an intimate
knowledge of the hardware, but at least it’s (more or less) familiar hardware,
not requiring knowledge of graphics.
Moreover, unlike a multicore machine, with the ability to effectively run just
157
158 CHAPTER 6. SHARED-MEMORY: GPUS
a few threads at one time, e.g. four threads on a quad core machine, GPUs
can run hundreds or thousands of threads well at once. There are various
restrictions that come with this, but you can see that there is fantastic
potential for speed here.
We will focus on NVIDIA’s line of GPUs here. (For brevity, the presentation
here will often refer to “GPUs” rather than “NVIDIA GPUs,” but the latter
is implied.) They are programmed in an extension of C/C++ called CUDA,
for which various R interfaces have been developed. So as with OpenMP
in Chapter 5, we again have an instance of the R+X notion introduced in
Section 1.1.
To run the examples here, you’ll need a CUDA-capable video card, and the
CUDA development kit. To check whether your GPU is CUDA-capable,
first determine what type you have (e.g. under Linux run /sbin/lspci
or possibly /usr/bin/lspci), and then check the NVIDIA website or the
CUDA Wikipedia entry.
The CUDA kit can be downloaded from the NVIDIA site. It is free of
charge (though requires about a gigabyte of disk space).
Recall the discussion in Section 3.3, in which it was pointed out that effi-
cient parallel programming often requires keen attention to detail. This is
especially true for GPUs, which have a complex hardware structure, to be
presented shortly. Thus, optimizing CUDA code is difficult.
Another issue is that NVIDIA is continuing to add more and more pow-
erful features to that hardware. Thus we have a “moving target,” making
optimization even more of a challenge (though backward compatibility has
been maintained).
On the other hand, it was noted at the start of Chapter 4 that in the parallel
processing world, there is always a tradeoff between speed and programming
effort. In many cases, one is happy to have just “good enough” speed,
attained without having to expend herculean efforts in programming.
In light of this latter point, the CUDA code presented here is meant to have
good speed while staying simple. But there is more:
An important approach to addressing the above issues is to make use of
libraries. For many applications, very efficient CUDA libraries have been
developed, such as the CUBLAS library for matrix operations. In addition,
6.3. GOAL OF THIS CHAPTER 159
some R packages that include good CUDA code have also been developed.
A big advantage of these, tying in with the “moving target” metaphor
above, is that as NVIDIA hardware evolves, the libraries will typically be
updated. This obviates the need for you to update your own CUDA code.
Thus, this chapter will first present examples directly written in CUDA,
with relatively simple complexity but with reasonably good speed. These
will serve the dual purposes of introducing CUDA
(b) to illustrate how the hardware works, vital even for those who only
use libraries.
Let’s start with an easy one. Below is CUDA code that inputs a matrix
and outputs an array consisting of the sums of the rows of the matrix.
// RowSums . cu ; s i m p l e i i l u s t r a t i o n o f CUDA
// t h i s i s t h e ” k e r n e l ” , which each t h r e a d on t h e
// GPU e x e c u t e s
global void f i n d 1 e l t ( int ∗m, int ∗ r s , int n )
{
// t h i s t h r e a d w i l l h a n d l e row # rownum
int rownum = b l o c k I d x . x ;
int sum = 0 ;
f o r ( int k = 0 ; k < n ; k++)
sum += m[ rownum∗n+k ] ;
r s [ rownum ] = sum ;
}
// t h e r e m a i n i n g code i s e x e c u t e d on t h e CPU
int main ( int argc , char ∗∗ argv )
{
// number o f matrix rows/ c o l s
int n = a t o i ( argv [ 1 ] ) ;
int ∗hm, // h o s t matrix
∗dm, // d e v i c e matrix
∗ hrs , // h o s t rowsums
∗ d r s ; // d e v i c e rowsums
// s i z e o f matrix i n b y t e s
int m s i z e = n ∗ n ∗ s i z e o f ( int ) ;
// a l l o c a t e s p a c e f o r h o s t matrix
6.4. INTRODUCTION TO NVIDIA GPUS AND CUDA 161
hm = ( int ∗ ) m a l l o c ( m s i z e ) ;
// a s a t e s t , f i l l matrix with c o n s e c . i n t e g e r s
int t = 0 , i , j ;
f o r ( i = 0 ; i < n ; i ++) {
f o r ( j = 0 ; j < n ; j ++) {
hm[ i ∗n+j ] = t++;
}
}
// a l l o c a t e matrix s p a c e a t d e v i c e
cudaMalloc ( ( void ∗∗ )&dm, m s i z e ) ;
// copy h o s t matrix t o d e v i c e matrix
cudaMemcpy (dm, hm, msize , cudaMemcpyHostToDevice ) ;
// a l l o c a t e host , d e v i c e rowsum a r r a y s
int r s s i z e = n ∗ s i z e o f ( int ) ;
h r s = ( int ∗ ) m a l l o c ( r s s i z e ) ;
cudaMalloc ( ( void ∗∗ )&drs , r s s i z e ) ;
// s e t up t h r e a d s s t r u c t u r e p a r a m e t e r s
dim3 dimGrid ( n , 1 ) ; // n b l o c k s i n t h e g r i d
dim3 dimBlock ( 1 , 1 , 1 ) ; // 1 t h r e a d p e r b l o c k
// l a u n c h t h e k e r n e l
f i n d 1 e l t <<<dimGrid , dimBlock>>>(dm, drs , n ) ;
// w a i t u n t i l k e r n e l f i n i s h e s
cudaThreadSynchronize ( ) ;
// copy row v e c t o r from d e v i c e t o h o s t
cudaMemcpy ( hrs , drs , r s s i z e , cudaMemcpyDeviceToHost ) ;
// check r e s u l t s
i f (n < 10)
f o r ( int i =0; i <n ; i ++) p r i n t f ( ”%d\n” , h r s [ i ] ) ;
// c l e a n up , v e r y i m p o r t a n t
f r e e (hm ) ;
cudaFree (dm ) ;
f r e e ( hrs ) ;
cudaFree ( d r s ) ;
}
• The kernel function, find1elt() in this case, runs on the GPU, and
is so denoted by the prefix global .
• The host code sets up space in the device memory via calls to cud-
aMalloc(), and transfers data from host to device or vice versa by
calling cudaMemcpy(). The data on the device side is global to all
threads.
Other than the host/device distinction, the above description sounds very
much like ordinary threaded programming. There is a major departure
from the ordinary, though, in the structure of the threads.
Threads on the GPU are broken down into blocks, with the totality of all
blocks being called the grid. In a kernel launch, we must tell the hardware
how many blocks our grid is to have, and how many threads each block will
have. Our code
dim3 dimGrid ( n , 1 ) ; // n b l o c k s i n t h e grid
dim3 dimBlock ( 1 , 1 , 1 ) ; // 1 t h r e a d p e r b l o c k
specifies n blocks per grid and one thread per block. One can also impose
imaginary two- and three-dimensional structures on the grid and blocks,
to be explained below; in the above code, the 1 in dimGrid(n,1) and the
latter two 1s in dimBlock(1,1,1) here basically decline to use this feature.
The advantage of having more than one thread per block will be discussed
below. In this simple code, we have a separate thread for each row of the
matrix.
In the kernel code itself, the line
i n t rownum = b l o c k I d x . x ;
determines which matrix row this particular thread will handle, as follows.
Each block and thread has an ID, stored in programmer-accessible structs
blockIdx and threadIdx, consisting of the block ID within the grid, and
6.4. INTRODUCTION TO NVIDIA GPUS AND CUDA 163
the thread ID within the block. Since in our case we’ve set up only one
thread per block, the block ID is effectively the thread ID.
The .x field refers to first coordinate in the block ID. The “coordinates”
of a block within the grid, and of a thread within a block, are merely
abstractions. If for instance one is programming computation of heat flow
across a two-dimensional slab, the programmer may find it clearer to use
two-dimensional IDs for the threads. But this does not correspond to any
physical arrangement in the hardware.
Some other points to mention:
• One compiles such code using nvcc in the CUDA toolkit, e.g.
% nvcc −o rowsums RowSums . cu
tells the compiler that we need code for compute capability (an NVIDIA
term) of at least 1.1, say because we call the CUDA function atom-
icAdd().
• Kernels can only have return type void. Thus a kernel must return
its results through its arguments.
1 Installing CUDA is beyond the scope of this book.
164 CHAPTER 6. SHARED-MEMORY: GPUS
• Functions that will run on the device other than the kernel are denoted
by the prefix device . These functions can have return values.
They are called only by kernels or by other device functions.
i n t ∗hm, // h o s t matrix
∗dm, // d e v i c e matrix
Scorecards, get your scorecards here! You can’t tell the players without a
scorecard—classic cry of vendors at baseball games
Know thy enemy—Sun Tzu, The Art of War
The enormous computational potential of GPUs cannot be truly unlocked
without an intimate understanding of the hardware. This of course is a
fundamental truism in the parallel processing world, but it is acutely im-
portant for GPU programming. This section presents an overview of the
hardware.3
6.4.2.1 Cores
6.4.2.2 Threads
As we have seen, when you write a CUDA application program, you parti-
tion the threads into groups called blocks. The salient points are:
• The hardware will assign an entire block to a single SM, though sev-
eral blocks can run in the same SM.
• All the threads in a warp run the code in lockstep. During the machine
instruction fetch cycle, the same instruction will be fetched for all of
the threads in the warp. Then in the execution cycle, each thread will
either execute that particular instruction or execute nothing. The
execute-nothing case occurs in the case of branches; see below.
This is the classical single instruction, multiple data (SIMD) pattern
used in some early special-purpose computers such as the ILLIAC;
here it is called single instruction, multiple thread (SIMT).
166 CHAPTER 6. SHARED-MEMORY: GPUS
Knowing that the hardware works this way, the programmer controls the
block size and the number of blocks, and in general writes the code to take
advantage of how the hardware works.
The SIMT nature of thread execution has major implications for perfor-
mance. Consider what happens with if/then/else code. If some threads
in a warp take the “then” branch and others go in the “else” direction,
they cannot operate in lockstep. That means that some threads must wait
while others execute. This renders the code at that point serial rather than
parallel, a situation called thread divergence. As one CUDA Web tutorial
points out, this can be a “performance killer.” Threads in the same block
but in different warps can diverge with no problem.
The hardware support for threads is extremely good; a context switch from
one warp to another takes very little time, quite a contrast to the OS
6.4. INTRODUCTION TO NVIDIA GPUS AND CUDA 167
case. Moreover, as noted above, the long latency of global memory may
be solvable by having a lot of threads that the hardware can timeshare to
hide that latency; while one warp is fetching data from memory, another
warp can be executing, thus not losing time due to the long fetch delay.
For these reasons, CUDA programmers typically employ a large number of
threads, each of which does only a small amount of work—again, quite a
contrast to something like OpenMP.
In choosing the number of blocks and the number of threads per block, one
typically knows the number of threads one wants (recall, this may be far
more than the device can physically run at one time, due to the desire to
ameliorate memory latency problems), so configuration mainly boils down
to choosing the block size. This is a delicate art, again beyond the scope
of this book, but here is an overview of the considerations:
• The device will have limits on the block size, number of threads on
an SM, and so on. (See Section 6.4.2.9 below.)
• One wants to utilize all the SMs. If one sets the block size too large,
not all will be used, as a block cannot be split across SMs.
• On the other hand, if one is using shared memory, this can only be
done at the block level, and efficient use may indicate using a larger
block.
• Two threads doing unrelated work, or the same work but with many
if/elses, would cause a lot of thread divergence if they were in the same
block. In some cases, it may be known in advance which threads will
do the “ifs” and which will do the “elses,” in which case they should
be placed in different blocks if possible.
In our code example in Section 6.4.1, we had one thread per row of the
matrix. That degree of fine-grained parallelism may surprise those who are
used to classical shared-memory programming. Section 2.7 did note that
it may be beneficial to have more threads than cores, due to cache effects
and so on, but in the multicore setting this would mean just a few more.
By contrast, in the GPU world, it’s encouraged to have lots of threads,4 in
order to circumvent the memory latency problems in GPUs: If the GPU
operating system senses that there may be quite a delay in the memory
access needed by a given thread, that thread is suspended and another is
found to run; by having a large number of threads, we ensure that the OS
will succeed in finding a new thread for this. Here we are using latency
hiding (Section 2.5).
Note too that while the previous paragraph spoke of the OS sensing that a
thread faces a memory delay, that was an oversimplification. Since threads
are scheduled in warps, if just one thread faces a memory delay, then the
entire warp must wait.
On the other hand, this is actually a plus, as follows. The global memory in
GPUs uses low-order interleaving, which means that consecutive memory
addresses are physically stored in simultaneously accessible places. And
furthermore, the memory is capable of burst mode, meaning that one can
request accesses to several consecutive locations at once.
This means we can reap great benefits if we can design our code so that con-
secutive threads access consecutive locations in memory. In that case, you
can see why the NVIDIA designers were wise to schedule thread execution
in warp groups, giving us excellent latency hiding.
As noted earlier, the GPU has a small amount of shared memory, the
term shared meaning that threads in a block share that storage. Access
to shared memory is both low-latency and high-bandwidth, compared to
access of global memory, which is off-chip. As noted, the size is small, so the
programmer must anticipate what data, if any, is likely at any given time to
be accessed repeatedly by the code. If data exists, the code can copy it from
global memory to shared memory, and access the latter, for a performance
win. In essence, shared memory is a programmer-managed cache.
4 There are limits, however, and a request to set up too many threads may fail.
6.4. INTRODUCTION TO NVIDIA GPUS AND CUDA 169
Details of the use of shared memory are beyond the scope of this book.
Indeed, there is so much more than what could be comfortably included
here.
All this illustrates why our discussion in Section 6.2 recommended that
most users either (a) settle for writing “pretty fast but simple” CUDA
code, and/or (b) rely mainly on libraries of either preoptimized CUDA
code or R code that interfaces to high-quality CUDA code.
Any CUDA device has limits on the number of blocks, threads per block
and so on. For safety, calls to cudaMalloc() should be accompanied by
error checking, something like
i f ( c u d a S u c c e s s != cudaMalloc ( . . . ) ) {...}
int main ( )
{
cudaDeviceProp Props ;
c u d a G e t D e v i c e P r o p e r t i e s ( &Props , 0 ) ;
p r i n t f ( ” s h a r e d mem: %d ) \ n” ,
Props . sharedMemPerBlock ) ;
p r i n t f ( ”max t h r e a d s / b l o c k : %d\n” ,
Props . maxThreadsPerBlock ) ;
p r i n t f ( ”max b l o c k s : %d\n” , Props . maxGridSize [ 0 ] ) ;
170 CHAPTER 6. SHARED-MEMORY: GPUS
Once again, our example involves our mutual Web outlinks problem, first
introduced in Section 1.4. Here is one approach to the calculation on a
GPU, but with the same change made in Section 5.9.1: We now look at
inbound links. Thus we are looking for matches of 1s between pairs of
columns of the adjacency matrix, as opposed to rows in the outlinks exam-
ples. (Or equivalently, suppose we are interested in outlinks but are storing
the matrix in transpose form.)
The reason for this change is that the example will illustrate the principle
of GPU latency hiding discussed in Section 6.4.2.6. Consider threads 3
and 4, say. In the “i” (i.e., outermost) loop in the code below. these two
threads will be processing consecutive columns of the matrix. Since the
GPU is running the two threads in lockstep, and because storage is in row-
major order, thread 4 will always be accessing an element of the matrix
in the same row as thread 3, but with the two accesses being to adjacent
elements. Thus we can really take advantage of burst mode in the memory
chips.
// u s a g e :
//
// mutin n u m v e r t i c e s numblocks
// k e r n e l : p r o c e s s e s a l l p a i r s assigned to
// a g i v e n t h r e a d
global void p r o c p a i r s ( int ∗m, int ∗ t o t , int n )
{
// t o t a l number o f t h r e a d s =
// number o f b l o c k s ∗ b l o c k s i z e
int t o t t h = gridDim . x ∗ BLOCKSIZE,
// my t h r e a d number
me = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
int i , j , k , sum = 0 ;
// v a r i o u s columns i
f o r ( i = me ; i < n ; i += t o t t h ) {
f o r ( j = i +1; j < n ; j ++) { // a l l columns j > i
f o r ( k = 0 ; k < n ; k++)
sum += m[ n∗k+i ] ∗ m[ n∗k+j ] ;
}
}
atomicAdd ( t o t , sum ) ;
}
hm = ( int ∗ ) m a l l o c ( m s i z e ) ;
// a s a t e s t , f i l l matrix with random 1 s and 0 s
int i , j ;
f o r ( i = 0 ; i < n ; i ++) {
hm[ n∗ i+i ] = 0 ;
f o r ( j = 0 ; j < n ; j ++) {
i f ( j != i ) hm[ i ∗n+j ] = rand ( ) % 2 ;
}
}
// more o f t h e u s u a l i n i t i a l i z a t i o n s
cudaMalloc ( ( void ∗∗ )&dm, m s i z e ) ;
// copy h o s t matrix t o d e v i c e matrix
cudaMemcpy (dm, hm, msize , cudaMemcpyHostToDevice ) ;
htot = 0;
// s e t up d e v i c e t o t a l and i n i t i a l i z e i t
cudaMalloc ( ( void ∗∗ )&dtot , s i z e o f ( int ) ) ;
cudaMemcpy ( dtot ,&htot , s i z e o f ( int ) ,
cudaMemcpyHostToDevice ) ;
// OK, ready t o l a u n c h k e r n e l , s o c o n f i g u r e g r i d
dim3 dimGrid ( nblk , 1 ) ;
dim3 dimBlock (BLOCKSIZE, 1 , 1 ) ;
// l a u n c h t h e k e r n e l
p r o c p a i r s <<<dimGrid , dimBlock>>>(dm, dtot , n ) ;
// w a i t f o r k e r n e l t o f i n i s h
cudaThreadSynchronize ( ) ;
// copy t o t a l from d e v i c e t o h o s t
cudaMemcpy (&htot , dtot , s i z e o f ( int ) ,
cudaMemcpyDeviceToHost ) ;
// check r e s u l t s
i f ( n <= 1 5 ) {
f o r ( i = 0 ; i < n ; i ++) {
f o r ( j = 0 ; j < n ; j ++)
p r i n t f ( ”%d ” ,hm[ n∗ i+j ] ) ;
p r i n t f ( ” \n” ) ;
}
}
p r i n t f ( ”mean = %f \n” , h t o t / ( f l o a t ) ( ( n∗ ( n−1))/ 2 ) ) ;
// c l e a n up
f r e e (hm ) ;
cudaFree (dm ) ;
cudaFree ( d t o t ) ;
}
6.6. SYNCHRONIZATION ON GPUS 173
By now the reader will immediately recognize that the reason for stagger-
ing the columns is that a given thread handles is to achieve load balance.
Also, we’ve continued to calculate the “dot product” between each pair
of columns, to avoid thread divergence. The only new material, a call to
atomicAdd() is discussed in Section 6.6.
The code was run on a machine with a Geforce 285 GPU, for various num-
bers of blocks, with a comparison to ordinary CPU code. Here are the
results, with times in seconds:
blocks time
CPU 97.26
4 4.88
8 3.17
16 2.48
32 2.36
So, first of all, we see that using the GPU brought us a dramatic gain!
Note, though, that while GPUs work quite well for certain applications,
they do poorly on others. Any algorithm that necessarily requires a lot of
if-then-else operations, for instance, is a poor candidate for GPUs, and the
same holds if the algorithm needs a considerable number of synchronization
operations. Even in our Web link example here, the corresponding speedup
was far more modest for the outlinks version (not shown here), due to the
memory latency issues we’ve discussed.
Second, the number of blocks does matter. But since our block size was
192, we can only use about 1750/192 blocks, so there was no point in going
beyond 32 blocks.
The call
atomicAdd ( t o t ,sum ) ;
These operations look good, but appearances can be deceiving, in this case
masking the fact that these operations are extremely slow. For example,
though a barrier could in principle be constructed from the atomic op-
erations, its overhead would be quite high. In earlier models that delay
was near a microsecond, and though that problem has been ameliorated in
more recent models, implementing a barrier in this manner would not be
not much faster than attaining interblock synchronization by returning to
the host and calling cudaThreadSynchronize() there. The latter is a
possible way to implement a barrier, since global memory stays intact in
between kernel calls, but again, it would be slow.
NVIDIA does offer barrier synchronization at the block level, via a call to
syncthreads(). This is reasonably efficient, but it still leaves us short of
an efficient way to do a global barrier operation, across the entire GPU.
The overall outcome from all of this is that algorithms that rely heavily on
barriers may not run well on GPUs.
Data in global memory persists through the life of the program. In other
words, if our program does
then in that second kernel execution, dx will still contain whatever data it
had at the end of the first kernel call.
The above scenario occurs in many GPU-based applications. For example,
consider iterative algorithms. As noted in Section 6.6 it is difficult to have
a barrier operation across blocks, and it may be easier just to return to the
CPU after each iteration, in essence using cudaThreadSynchronize() to
implement a barrier. Or, the algorithm itself may not be iterative, but
our data may be too large to fit in GPU memory, necessitating doing the
computation one chunk at a time, with a separate kernel call for each chunk.
In such scenarios, if we had to keep copying back and forth between hx and
dx, we may incur very significant delays. Thus we should exploit the fact
that data in global memory does persist across kernel calls.6
Some GPU libraries for R, such as gputools, do not exploit this persistence
of data in global memory. However, the gmatrix package on CRAN does
do this, as does RCUDA.
6 By contrast, for instance, data in shared memory lives only during the given kernel
call.
176 CHAPTER 6. SHARED-MEMORY: GPUS
I filled n × n matrices with U(0,1) data, for various n, comparing run times.
Here are the results, in seconds:
n dist() gpuDist(
1000 3.671 0.258
2500 103.219 3.220
5000 609.271 error
The speedups in the first two cases are quite impressive, but an execution
error occurred in the last case, with a message, “the launch timed out and
was terminated.” The problem here was that the GPU was being used
both for computation and for the ordinary graphics screen of the computer
housing the GPU. One can disable the timeout if one owns the machine,
or better, purchase a second GPU for computation only. GPU work is not
easy...
GPUs are just one kind of accelerator chip. Others exist, and in fact go
back to the beginning of the PC era in the 1980s, when one could purchase
a floating-point hardware coprocessor.
The huge success of the NVIDIA GPU family did not go unnoticed by
Intel. In 2013 Intel released the Xeon Phi chip, which had been under
development for several years. At this writing (Spring 2014), NVIDIA has
7 If you have CUDA installed in a nonstandard location, you’ll need to download the
gputools source and then build using R CMD INSTALL, specifiying the locations
of the library and include files. See the INSTALL file that is included in the package.
During execution and use, be sure your environment variable LD LIBRARY PATH
includes the CUDA library.
6.8. THE INTEL XEON PHI CHIP 177
a large head start in this market, so it is unclear how well the Intel chip
will do in the coming years.
The Xeon Phi features 60 cores, each 4-way hyperthreaded, thus with a
theoretical level parallelism level of 240. This number is on the low end
of NVIDIA chips. But on the other hand, the Intel chip is much easier
to program, as it has much more of a classic multicore design. One can
run OpenMP, MPI and so on. As with GPUs, though, the bandwidth and
latency for data transfers between the CPU and the accelerator chip can
be a major issue.
It must again be kept in mind that while the NVIDIA chips can attain
exceptionally good performance on certain applications, they perform poorly
on others. A number of analysts have done timing tests, and there are
indeed applications for which the Intel chip seems to do better than, for
instance, the NVIDIA Tesla series, in spite of having fewer cores.
Chapter 7
179
180 CHAPTER 7. THRUST AND RTH
Thrust was developed by NVIDIA, maker of the graphics cards that run
CUDA. It consists of a collection of C++ templates, modeled after the
C++ Standard Template Library (STL). One big advantage of the template
approach is that no special compiler is needed; Thrust is simply a set of
“#include” files.
In one sense, Thrust may be regarded as a higher-level way to develop GPU
code, avoiding the tedium of details that arise when programming GPUs.
But more important, as noted above, Thrust does enable heterogeneous
computing, in the sense that it can produce different versions of machine
code to run on different platforms.
When one compiles code using Thrust, one can choose the backend, be
it GPU code or multicore. In the latter case one can currently choose
OpenMP or TBB for multicore machines.
In other words, Thrust allows when to write a single piece of code that can
run either on a GPU or on a classic multicore machine. The code of course
will not be optimized, but in many cases one will attain reasonable speed
for this diversity of hardware types.
7.3 Rth
As will be seen below, even though Thrust is designed to ease the task
of GPU programming, coding in Thrust is somewhat difficult. C++ tem-
plate code can become quite intricate and abstract if you are not used to
something like the STL (or even if you do have STL experience).
Thus once again it is desirable to have R libraries available that interface
to a lower-level language, the “R+X” concept mentioned in Section 1.1.2.
Drew Schmidt and I have developed Rth (https://github.com/Rth-org/
Rth), an R package that builds on Thrust. Rth implements a number of
R-callable basic operations in Thrust.
The goal is to give the R programmer the advantages of Thrust, without
having to program in Thrust (let alone C++, CUDA or OpenMP) him or
7.4. SKIPPING THE C++ 181
herself.
For instance, Rth provides a parallel sort, which is really just an R wrapper
for the Thrust sort. (See Section 10.5.) Again, this is a versatile sort, in
that it can take advantage of both GPUs and multicore machines.
There are of course many ways to write code for a given application, and
that certainly is the case here. The approach shown below is here mainly
to illustrate how Thrust works, rather than being claimed an optimal im-
plementation.
// Q u a n t i l e s . cpp , Thrust example
// c a l c u l a t e e v e r y k−th e l e m e n t i n g i v e n numbers ,
// g o i n g from s m a l l e s t t o l a r g e s t ; k o b t a i n e d from
// command l i n e and f e d i n t o t h e i s m u l t k ( ) f u n c t o r
// t h e s e a r e t h e i k /n ∗ 100 p e r c e n t i l e s , i = 1 , 2 , . . .
1 If one is only going to use multicore backends, it’s better not to copy to a device, to
// f u n c t o r
struct i s m u l t k {
const int increm ; // k i n above comments
// g e t k from c a l l
i s m u l t k ( int increm ) : increm ( increm ) {}
device b o o l o p e r a t o r ( ) ( const int i )
{ return i != 0 && ( i % increm ) == 0 ;
}
};
In Thrust, one works with vectors rather than arrays, in the sense that
vectors are objects, very much like in R. They thus have built-in methods,
leading to expressions such as dx.begin(), a function call that returns the
location of the start of dx. Similarly, dx.size() tells us the length of dx,
dx.end() points to the location one element past the end of dx and so on.
184 CHAPTER 7. THRUST AND RTH
The key here is the function copy if(), which as the name implies copies
all elements of an input vector (dx here) that satisfy a certain predicate.
The latter role is played by ismultk, to be explained shortly, with the help
of a stencil, in this case the vector seq.
The output is stored here in out, but not all of that vector will be filled.
Thus the return value of copy if(), assigned here to newend, is used to
inform us as to where the output actually ends.
Now let’s look at ismultk (“is a multiple of k”):
struct i s m u l t k {
const int increm ; // k i n above comments
// g e t k from c a l l
i s m u l t k ( int increm ) : increm ( increm ) {}
device b o o l o p e r a t o r ( ) ( const int i )
{ return i != 0 && ( i % increm ) == 0 ;
}
};
The keyword operator here tells the compiler that ismultk will serve
as a function. At the same time, it is also a struct, containing the data
increm, which as the comments note, is “k” in our description “take every
k th element of the array.”
7.5. EXAMPLE: FINDING QUANTILES 185
Now, how is that function called? Let’s look at the copy if() call examined
above:
ismultk ( incr )
and what that call does is instantiate the struct ismultk. In other words,
that call returns a struct of type ismultk, with the member variable in-
crem in that struct being set to incr. That assignment is done in the
line,
What copy if() does is apply that function to all values i in the stencil,
where the latter consists of 0,1,2,...,n-1.
So all of this is a roundabout way of copying dx[i] to out for i = k, 2k, 3k,
..., with ik not to exceed n-1. This is exactly what we want.
The word roundabout above is apt, and arguably is typical of Thrust (and,
for that matter, of the C++ STL). But we do get hardware generality from
that effect, with our code being applicable both to GPUs and multicore
platforms.
Important note: If you plan to write Thrust code on your own, or any other
code that uses functors, C++11 lambda functions can simplify things a lot.
See Section 11.6.4 for an example.
186 CHAPTER 7. THRUST AND RTH
// s i n g l e −pass , s u b j e c t t o i n c r e a s e d r o u n d o f f e r r o r
typedef t h r u s t : : d e v i c e v e c t o r <int> i n t v e c ;
typedef t h r u s t : : d e v i c e v e c t o r <double> d o u b l e v e c ;
#i f RTH OMP
omp s e t num t h r e a d s (INT( n t h r e a d s ) ) ;
# e l i f RTH TBB
tbb : : t a s k s c h e d u l e r i n i t i n i t (INT( n t h r e a d s ) ) ;
#e n d i f
double xy =
t h r u s t : : i n n e r p r o d u c t ( dx . b e g i n ( ) , dx . end ( ) ,
dy . b e g i n ( ) , z e r o ) ;
double x2 =
t h r u s t : : i n n e r p r o d u c t ( dx . b e g i n ( ) , dx . end ( ) ,
dx . b e g i n ( ) , z e r o ) ;
double y2 =
t h r u s t : : i n n e r p r o d u c t ( dy . b e g i n ( ) , dy . end ( ) ,
dy . b e g i n ( ) , z e r o ) ;
double xt =
t h r u s t : : r e d u c e ( dx . b e g i n ( ) , dx . end ( ) ) ;
double yt =
t h r u s t : : r e d u c e ( dy . b e g i n ( ) , dy . end ( ) ) ;
double xm = xt /n , ym = yt /n ;
double xsd = s q r t ( x2/n − xm∗xm ) ;
double ysd = s q r t ( y2/n − ym∗ym ) ;
double c o r = ( xy/n − xm∗ym) / ( xsd∗ysd ) ;
return Rcpp : : wrap ( c o r ) ;
}
188 CHAPTER 7. THRUST AND RTH
Here
Pn the Thrust function inner product() performs the “dot product”
i=1 Xi Yi , and Thrust’s reduce() does a reduction as with constructs of
a similar name in OpenMP and Rmpi; the default operation is addition.
Chapter 8
189
190 CHAPTER 8. MESSAGE PASSING
The most popular C-level package for message passing is the Message Pass-
ing Interface (MPI), a collection of routines callable from C/C++.1 Pro-
fessor Hao Yu of the University of Western Ontario wrote an R package,
Rmpi, that interfaces R to MPI, as well as adding a number of R-specific
functions. Rmpi will be our focus in this chapter. (Two other popular
message-passing packages, PVM and 0MQ, also have had R interfaces de-
veloped for them, Rpvm and Rzmq, as well as a very promising new R
interface to MPI, pdbR.)
So with Rmpi, we might have, say, eight machines in our cluster. When we
run Rmpi from one machine, that will then start up R processes on each of
the other machines. This is the same as what happens when we use snow
on a physical cluster, where for example the call makeCluster(8), causes
there to be 8 R processes created on the manager’s machine. The various
processes will occasionally exchange data, via calls to Rmpi functions, in
order to run the given application in parallel. Again, this is the same as for
snow, but here the workers can directly exchange data with each other.
We’ll cover a specific example shortly. But first, let’s follow up on the
discussion of Section 2.5, and note the special issues that arise with message
passing code. (The reader may wish to review the concepts of latency and
bandwidth in that section before continuing.)
http://heather.cs.ucdavis.edu/parprocbook.
8.3. PERFORMANCE ISSUES 191
8.4 Rmpi
Note that MPI also provides network services beyond simply sending and
receiving messages. An important point is that it enforces message order.
If say, messages A and B are sent from process 8 to process 3 in that order,
then the program at process 3 will receive them in that order. A call at
process 3 to receive from process 8 will receive A first, with B not being
processed until the second such call.2
This makes the logic in your application code much easier to write. Indeed,
if you are a beginner in the parallel processing world, keep this foremost
in mind. Code that makes things happen in the wrong order (among the
various processes) is one of the most common causes of bugs in parallel
programming.
In addition, MPI allows the programmer to define several different kinds of
messages. One might make a call, for instance, that says in essence, “read
the next message of type 2 from process 8,” or even “read the next message
of type 2 from any process.”
Rmpi provides the R programmer with access to such operations, and
also provides some new R-specific messaging operations. It is a very rich
package, and we can only provide a small introduction here.
With all that power comes complexity. Rmpi can be tricky to install—and
even to launch—with various platform dependencies to deal with, such as
those related to different flavors of MPI. Since there are too many possible
scenarios, I will simply discuss an example setup.
On a certain Linux machine, I had installed MPI in a directory /home/-
matloff/Pub/MPICH, and then installed Rmpi as follows. After down-
loading the source package from CRAN, I ran
$ R CMD INSTALL − l ˜/Pub/ R l i b Rm∗z \\
−−c o n f i g u r e −args=”−−with−mpi=/home/ m a t l o f f /Pub/MPICH \\
−−with−Rmpi−type=MPICH”
Instead, I did the following setup to enable running Rmpi, making use of
a file Rprofile that comes with the Rmpi package for this purpose:
$ mkdir ˜/MyRmpi
$ cd ˜/MyRmpi
# make copy o f R
$ cp / u s r / b i n /R Rmpi
# make i t r u n n a b l e
$ chmod u+x Rmpi
# e d i t my s h e l l s t a r t u p f i l e ( not shown ) :
# p l a c e Rmpi f i l e i n my e x e c u t i o n path
# add /home/ m a t l o f f /Pub/MPICH/ l i b
# t o LD LIBRARY PATH
# e d i t Rmpi f i l e ( not shown ) :
# a f t e r ” e x p o r t R HOME” , i n s e r t
# R PROFILE=/home/ m a t l o f f /MyRmpi/ R p r o f i l e ;
# e x p o r t R PROFILE
$ cp ˜/Pub/ R l i b /Rmpi/ R p r o f i l e .
# e d i t R p r o f i l e ( not shown ) :
# i n s e r t a t top
# ” . l i b P a t h s ( ’ /home/ m a t l o f f /Pub/ R l i b ’ ) ”
# test :
# s e t up t o run a l l p r o c e s s e s on l o c a l machine
$ echo ” l o c a l h o s t ” > h o s t s
# run MPI on t h e Rmpi f i l e , with 3 p r o c e s s e s
$ mpiexec −f h o s t s −n 3 Rmpi −−no−s a v e −q
# R now running , with Rmpi loaded , mgr and 2 wrkrs
> mpi . comm . s i z e ( ) # number o f p r o c e s s e s , s h o u l d be 3
# have t h e 2 w o r k e r s run sum ( ) , one on 1 : 3 and
# t h e o t h e r on 4 : 5
> mpi . apply ( l i s t ( 1 : 3 , 4 : 5 ) , sum ) # s h o u l d p r i n t 6 , 9
8.5.1 Algorithm
• msgsize: the size of messages from the manager to the first worker
To do all this, I first got Rmpi running as in Section 8.4.1, and placed the
code in Section 8.5.2 in a file PrimePipe.R. I then ran as follows:
> source("PrimePipe.R")
> dvs <- serprime(ceiling(sqrt(1000)))
> dvs
[1] 2 3 5 7 11 13 17 19 23 29 31
> primepipe(1000,dvs,100)
[1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61
[19] 67 71 73 79 83 89 97 101 103 107 109 113 127 131 137 139 149 151
[37] 157 163 167 173 179 181 191 193 197 199 211 223 227 229 233 239 241 251
[55] 257 263 269 271 277 281 283 293 307 311 313 317 331 337 347 349 353 359
[73] 367 373 379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463
[91] 467 479 487 491 499 503 509 521 523 541 547 557 563 569 571 577 587 593
[109] 599 601 607 613 617 619 631 641 643 647 653 659 661 673 677 683 691 701
[127] 709 719 727 733 739 743 751 757 761 769 773 787 797 809 811 821 823 827
[145] 829 839 853 857 859 863 877 881 883 887 907 911 919 929 937 941 947 953
[163] 967 971 977 983 991 997
# f o r i l l u s t r a t i o n p u r p o s e s , not i n t e n d e d t o be
# o p t i m a l , e . g . n ee d s b e t t e r l o a d b a l a n c i n g
# r e t u r n s v e c t o r o f a l l primes i n 2 . . n ; t h e v e c t o r
# ” d i v i s o r s ” i s used as a b a s i s f o r a S i e v e o f
# E r a t h o s t h e n e s o p e r a t i o n ; must have n <=
# (max( d i v i s o r s ) ˆ 2 ) and n even ; ” d i v i s o r s ” c o u l d
# f i r s t be found , f o r i n s t a n c e , by a p p l y i n g a s e r i a l
# prime−f i n d i n g method on 2 . . s q r t ( n ) , say
196 CHAPTER 8. MESSAGE PASSING
# t h e argument ” m s g s i z e ” c o n t r o l s t h e chunk s i z e i n
# communication from t h e manager t o t h e f i r s t worker ,
# node 1
# manager code
p r i m e p i p e <− function ( n , d i v i s o r s , m s g s i z e ) {
# s u p p l y t h e w o r k e r s w i t h t h e f u n c t i o n s t h e y need
mpi . b c a s t . R o b j 2 s l a v e ( dowork )
mpi . b c a s t . R o b j 2 s l a v e ( d o s i e v e )
# s t a r t workers , i n s t r u c t i n g them t o each
# run dowork ( ) ;
# note nonblocking c a l l
mpi . b c a s t . cmd( dowork , n , d i v i s o r s , m s g s i z e )
# remove t h e e v e n s r i g h t away
odds <− seq ( from =3, t o=n , by=2)
nodd <− length ( odds )
# send odds t o node 1 , i n chunks
s t a r t m s g <− seq ( from =1, t o=nodd , by=m s g s i z e )
for ( s in startmsg ) {
rng <− s : min( s+m s g s i z e −1,nodd )
# o n l y one message t y p e , 0
mpi . send . Robj ( odds [ rng ] , t a g =0, d e s t =1)
}
# send end−d a t a s e n t i n e l , chosen t o be NA
mpi . send . Robj (NA, t a g =0, d e s t =1)
# w a i t f o r and r e c e i v e r e s u l t s from l a s t node ,
# and r e t u r n t h e r e s u l t ;
# don ’ t f o r g e t t h e 2 , t h e f i r s t prime
# ID o f l a s t p r o c e s s
l a s t n o d e <− mpi . comm . s i z e () −1
c ( 2 , mpi . r e c v . Robj ( t a g =0, source=l a s t n o d e ) )
}
# worker code
dowork <− function ( n , d i v i s o r s , m s g s i z e ) {
me <− mpi . comm . rank ( )
# which chunk o f ” d i v i s o r s ” i s mine?
l a s t n o d e <− mpi . comm . s i z e () −1
l d <− length ( d i v i s o r s )
tmp <− f l o o r ( l d / l a s t n o d e )
mystart <− (me−1) ∗ tmp + 1
8.5. EXAMPLE: FINDING PRIMES 197
# c h e c k d i v i s i b i l i t y o f x by d i v s
d o s i e v e <− function ( x , d i v s ) {
for (d in divs ) {
x <− x [ x %% d != 0 | x == d ]
198 CHAPTER 8. MESSAGE PASSING
}
x
}
Let’s try the case n = 10000000. The serial code took time 424.592 seconds.
Let’s try it in parallel on a network of PCs, for first two, then three and
then four workers. with various values for msgsize. The results are shown
in Table 8.1.
The parallel version was indeed faster than the serial one. This was partly
due to parallelism and partly to the fact that the parallel version is more
efficient, since the serial algorithm does more total crossouts. A fairer
comparison might be a recursive version of serprime(), which would reduce
the number of crossouts. But there are other important facets of the timing
numbers.
First, as expected, using more workers produced more speed, at least in
the range tried here. Note, though, that the speedup was not linear. The
best time for three workers was only 30% better than that for two workers,
compared to a “perfect” speedup of 50%. Using four workers instead of
two yields only a 53% gain. Second, we see that msgsize is an important
factor, explained in the next section.
8.5. EXAMPLE: FINDING PRIMES 199
Another salient aspect here is that msgsize matters. Recall Section 2.5,
especially Equation (2.1). Let’s see how they affect things here.
In our timings above, setting the msgsize parameter to the lower value,
1000, results in having more chunks, thus more times that we incur the net-
work latency. On the other hand, a value of 50000 yields less parallelism—
there would be no parallelism at all with a chunk size of 10000000/2—and
thus impedes are ability to engage in latency hiding (Section 2.5), in which
we try to overlap computation and communication; this reduces speed.
There are a number of ways in which the code could be improved algo-
rithmically. Notably, we have a serious load balance problem (Section 2.1).
Here’s why:
Suppose for simplicity that each process handles only one element of divi-
sors. Process 0 then first removes all multiples of 2, leaving n/2 numbers.
Process 1 then removes all multiples of 3 from the latter, leaving n/3 num-
bers. It can be seen from this that Process 2 has much less work to do that
Process 1, and Process 3 has a much lighter load than process 2, etc.
One possible solution might be to have the code do its partitioning of the
vector divisors in an uneven way, assigning larger chunks of the vector to
the later processes.
Note that the code sends data messages via the functions mpi.send.Robj()
200 CHAPTER 8. MESSAGE PASSING
So, let’s look at the code. First, a bit of housekeeping. Just as with snow,
we need to send the workers the functions they’ll use:
mpi . b c a s t . R o b j 2 s l a v e ( dowork )
mpi . b c a s t . R o b j 2 s l a v e ( d o s i e v e )
As noted earlier, the parameter msgsize controls the tradeoff between the
computation/communication overlap, and the overhead of launching a mes-
sage. A larger value means fewer times we pay the latency price, but less
parallelism.
Note that each worker needs to know when there will be no further input
from its predecessor. Sending an NA value serves that purpose.
In the call to mpi.send.Robj() above,
mpi . send . Robj ( odds [ rng ] , t a g =0, d e s t =1)
will complete, by executing the request from the manager, in this case a
command to run the dowork() function.
The command broadcast by the manager here tells the workers to execute
dowork(n,divisors,msgsize). They will thus now be doing useful work,
in the sense that they are now running the application, though they still
must wait to receive their data.
Eventually the last worker will send the final list of primes back to the
manager, which will receive it, and return the result to the caller:
l a s t n o d e <− mpi . comm . s i z e () −1
c ( 2 , mpi . r e c v . Robj ( t a g =0, source=l a s t n o d e ) )
Note that since the manager had removed the multiples of 2 originally, the
number 2 won’t be in what is received here. Yet of course 2 is indeed a
prime, so we need to add it to the list.
The function mpi.comm.size() returns the communicator size, the total
number of processes, including the manager. Recalling that the latter is
Process 0, we see that the last worker’s process number will be the com-
municator size minus 1. In more advanced MPI applications, we can set
up several communicators, i.e., several groups of processes, rather than just
one, our case here. A broadcast then means transmitting a message to all
processes in that communicator.
So, what about dowork(), the function executed by the workers? First,
note that worker i must receive data from worker i-1 and send data to
worker i+1. Thus the worker needs to know its process ID number, or rank
in MPI parlance:
me <− mpi . comm . rank ( )
Now the worker must decide which of the divisors it will be responsible for.
This will be a standard chunking operation:
l d <− length ( d i v i s o r s )
tmp <− f l o o r ( l d / l a s t n o d e )
mystart <− (me−1) ∗ tmp + 1
myend <− mystart + tmp − 1
i f (me == l a s t n o d e ) myend <− l d
mydivs <− d i v i s o r s [ mystart : myend ]
In the case of the final worker, it accumulates its “crossing out” results in
a vector out,
s i e v e o u t <− d o s i e v e ( msg , mydivs )
out <− c ( out , s i e v e o u t )
b <- double(100000)
b <- mpi.recv(x=b,type=2,source=5)
204 CHAPTER 8. MESSAGE PASSING
The call
mpi . send ( x , type =2, t a g =0, d e s t =8)
sends the data in x. But when does the call return? The answer depends on
the underlying MPI implementation. In some implementations, probably
most, the call returns as soon as the space x is reusable, as follows. Rmpi
will call MPI, which in turn will call network-send functions in the operating
system. That last step will involve copying the contents of x from one’s R
4 Even this may not be enough. R has a copy on write policy, meaning that if a vector
element is changed, the memory for the vector may be reallocated. The word “may” is
key here, and recent versions of R attempt to reduce the number of reallocations, but
there is never any guarantee on this.
8.7. MESSAGE-PASSING PERFORMANCE SUBTLETIES 205
program to space in the OS, after which x is reusable. The point is that
the call could return long before the data reaches the receiver.
Other implementations of MPI, though, wait until the destination process,
number 8 in the example above, has received the transmitted data. The
call to mpi.send() at the source process won’t return until this happens.
Due to network delays, there could be a large performance difference be-
tween the two MPI implementations. There are also possible implications
for deadlock (Section 8.7.2).
In fact, even with the first kind of implementation, there may be some delay.
For such reasons, MPI offers nonblocking send and receive functions, for
which Rmpi provides the interfaces such as mpi.isend() and mpi.irecv().
This way you can have your code get a send or receive started, do some
other useful work, and then later check back to see if the action has been
completed, using a function such as mpi.test().
If the MPI implementation has send operations block until the matching
receive is posted, then this would create a deadlock problem, meaning that
two processes are stuck, waiting for each other. Here Process 3 would start
the send, but then wait for an acknowledgment from 8, while 8 would do
the same and wait for 3. They would wait forever.
This arises in various other ways as well. Suppose we have the manager
launch the workers via the call
mpi . b c a s t . cmd( dowork , n , d i v i s o r s , m s g s i z e )
would make the same call at the workers, but would wait until the workers
were done with their work before returning (and then assigning the results
to res). Now suppose the function dowork() does a receive from the man-
ager, and suppose we use that second call above, with mpi.remote.exec(),
after which the manager does a send operation, intended to be paired with
the receive ops at the workers. In this setting, we would have deadlock.
Deadlock can arise in shared-memory programming as well (Chapter 4),
but the message-passing paradigm is especially susceptible to it. One must
constantly beware of the possibility when writing message-passing code.
So, what are the solutions? In the example involving Processes 3 and 8
above, one could simply switch the ordering:
me <− mpi . comm . rank ( )
i f (me == 3 ) {
mpi . send ( x , type =2, t a g =0, d e s t =8)
mpi . r e c v ( y , type =2, t a g =0, source=8)
} e l s e i f (me == 8 ) {
mpi . r e c v ( y , type =2, t a g =0, source=3)
mpi . send ( x , type =2, t a g =0, d e s t =3)
}
MapReduce Computation
As the world emerged into an era of Big Data, demand grew for a comput-
ing paradigm that (a) is generally applicable and (b) works on distributed
data. The latter term means that data is physically distributed over many
chunks, possibly on different disks and maybe even different geographical
locations. Having the data stored in a distributed manner facilitates paral-
lel computation — different chunks can be read simultaneously — and also
enables us to work with data sets that are too large to fit into the memory
of a single machine. Demand for such computational capability led to the
development of various systems using the MapReduce paradigm.
MapReduce is really a form of the scatter-gather pattern we’ve seen fre-
quently in this book, with the added feature of a sorting operation added in
the middle. In rough form, it works like this. The input is in a (distributed)
file, fed into the following process:
207
208 CHAPTER 9. MAPREDUCE
key \t data
The typical “Hello World,” introductory example is word count for a text
file. The mapper program breaks a line into words, and emits (key,value)
pairs in the form of (word,1). (If a word appears several times in a line,
there would be several pairs emitted for that word.) In the Reduce stage,
all those 1s for a given word are summed, giving a frequency count for that
word. In this way, we get counts for all words.
Here is the mapper code:
9.1. APACHE HADOOP 209
#! / u s r / b i n /env R s c r i p t
# wordmapper .R
s i <− f i l e ( ” s t d i n ” , open=” r ” )
while ( length ( i n l n <−
scan ( s i , what=” ” , n l i n e s =1, q u i e t=TRUE,
blank . l i n e s . s k i p=FALSE ) ) ) {
f o r (w i n i n l n ) cat (w, ”\ t 1\n” )
}
#! / u s r / b i n /env R s c r i p t
# w o r d r e d u c e r .R
s i <− f i l e ( ” s t d i n ” , open=” r ” )
oldword <− ” ”
The above code is not very refined, for instance treating The as different
from the. The main point, though, is just to illustrate the principles.
I ran the following code from the top level of the Hadoop directory tree,
with obvious modifications possible for other run points. First, I needed to
210 CHAPTER 9. MAPREDUCE
That first command specifies that it will be for the file system (fs), and
that I am placing a file in that system. My ordinary version of the file was
in my home directory.
Hadoop, being Java-based, runs Java archive, .jar files. So, the second
command above specifies that I want to run in streaming mode. It also
states that I want the output to go to a file wordcountsnyt in my HDFS
system. Finally, I specify my mapper and reducer code files in my HDFS
system, after which I ran the program. I allowed Hadoop to use its default
values for the number of mappers and reducers, and could have specified
them above if desired.
Recall that the final output comes in chunks in the HDFS. Here’s how to
check (some material not shown), and to view the actual file contents:
$ b i n /hadoop f s −l s wordcountsnyt
Found 3 i t e m s
−rw−r−−r−− 1 . . . / u s e r / m a t l o f f / wordcountsnyt / SUCCESS
drwxr−xr−x − . . . / u s e r / m a t l o f f / wordcountsnyt / l o g s
−rw−r−−r−− 1 . . . / u s e r / m a t l o f f / wordcountsnyt / part −00000
$ b i n /hadoop f s −cat wordcountsnyt1 / part −00000
1 NA
$2 1
1 ,600 1
18 th 1
1991 , 1
1996 , 1
2009 1
2009 , 1
250 ,000 1
6, 1
7, 1
1 The file was the contents of the article “Data Analysts Captivated by Rs Power,”
A 1
ASHLEE 1
According 1
America , 1
Analysts 2
Anne 1
Another 1
Apache , 1
Are 1
At 1
...
So, in HDFS, one distributed file is stored as a directory, with the chunks
in part-00000, part-00001 and so on. We only had enough data to fill
one chunk here.
The first thing to notice is that these two R files are not executed directly
by R, but instead under Rscript. This is standard for running R in batch,
i.e., noninteractive, mode.
Next, as noted earlier, input to the mappers is from STDIN, in this case
from the redirected file rnyt in my HDFS, seen here in the identifier si in
the mapper. Final output is to STDOUT, redirected to the specified HDFS
file, hence the call to cat() in the reducer. The mapper output goes to the
shuffle and then to the reducers, again using STDOUT, visible here in the
call to cat() in the mapper code.
The reader might be wondering here about the line
count <− count + as . integer ( i n l n [ 2 ] )
in the reducer. The way things have been described so far, it would seem
that the expression as.integer(inln[2]) — a word count output from a
mapper — should always be 1. However, there is more to the story, as
Hadoop also allows one to specify combiner code, as follows.
Remember, all the communication, e.g. from the mappers to the shuffler,
is via our network, with one (key,value) pair per network message. So,
we may have an enormous number of short messages, thus incurring the
network latency penalty many times, as well as huge network congestion
due to, for instance, many mappers trying to use the network at once. The
212 CHAPTER 9. MAPREDUCE
solution is to have each mapper try to coalesce its messages before sending
to the shuffler.
The coalescing is done by a combiner specified by the user. Often the
combiner will be the same as the reducer. So, what occurs is that each
mapper will run the reducer on its own mapper output, then send the
combiner output to the shuffler, after which it goes to the reducers as
usual.
Thus in our word count example here, when a line arrives at a reducer, its
count field may already have a value greater than 1. The combiner code,
by the way, is specified via the -combiner field in the run command, like
-mapper and -reducer.
As noted, Hadoop has its own file system, HDFS, which is built on top of
the native OS’ file system of the machines. It is replicated for the sake of
reliability, with each HDFS block existing in at least 3 copies, i.e., on at
least 3 separate disks. Very large files are possible, in some cases spanning
more than one disk/machine.
Disk files play a major role in Hadoop programs:
• The output of the mappers goes to temporary files in the native OS’
file system.
Note that by having the input and output files in HDFS, we minimize com-
munications costs in shipping the data between nodes of a cluster. The
slogan used is “Moving computation is cheaper than moving data.” How-
ever, all that disk activity can be quite costly in terms of run time.
As of late 2014, there has been increasing concern regarding Hadoop’s per-
formance. One of the problems is that one cannot keep intermediate results
9.3. R INTERFACES TO MAPREDUCE SYSTEMS 213
So, what does Hadoop really give us? The two main features are (a) dis-
tributed data access and (b) an efficient distributed file sort. Hadoop works
well for many applications, but a realization developed that Hadoop can be
very slow, and very limited in available data operations.
Both of those shortcomings are addressed to a large extent by the new
kid on the block, Spark. Spark is apparently much faster than Hadoop,
sometimes dramatically so, due to strong caching ability and a wider variety
of available operations. Recently distributedR has also been released,
again with the goal of using R on voluminous data sets, and there is also
the more established pbdR.
But even Spark suffers a very practical problem, shared by the others men-
tioned above. All of these systems are complicated. There is a considerable
amount of configuration to do, worsened by dependence on infrastructure
software such as Java or MPI, and in some cases by interface software such
as rJava. Some of this requires systems knowledge that many R users may
lack. And once they do get these systems set up, they may be required
to design algorithms with world views quite different from R, even though
they are coding in R.
So, do we really need all that complicated machinery? Hadoop and Spark
provide efficient distributed sort operations, but if one’s application does
not depend on sorting, we have a cost-benefit issue here.
214 CHAPTER 9. MAPREDUCE
Let’s use as our example word count, the “Hello World” of MapReduce.
As noted earlier, word count determines which words are in a text file, and
calculates frequency counts for each distinct word:
# each node e x e c u t e s t h i s f u n c t i o n
wordcensus <− function (basename , n d i g s ) {
fname <− f i l e c h u n k n a m e (basename , n d i g s )
words <− scan ( fname , what=” ” )
tapply ( words , words , length , s i m p l i f y=FALSE)
}
# manager
f u l l w o r d c o u n t <− function ( c l s , basename , n d i g s ) {
s e t c l s i n f o ( c l s ) # g i v e w o r k e r s ID numbers , e t c .
c o u n t s <−
c l u s t e r C a l l ( c l s , wordcensus , basename , n d i g s )
a d d l i s t s s u m <− function ( l s t 1 , l s t 2 )
a d d l i s t s ( l s t 1 , l s t 2 ,sum)
Reduce ( a d d l i s t s s u m , c o u n t s )
}
The above code makes use of the following routines, which are general and
are used in many “Snowdoop” applications. These and other Snowdoop
utilities are included in my partools package (Section 3.5). Here are the
call forms:
# g i v e each c l u s t e r node an ID , i n a g l o b a l
# p a r t o o l s e n v $myid ; number o f w o r k e r s i n
# p a r t o o l s e n v $ n c l s ; a t each worker ,
# l o a d p a r t o o l s and s e t t h e R s e a r c h p a t h t o
# t h a t o f t h e manager
setclsinfo ( cls )
# ” add ” R l i s t s l s t 1 , l s t 2 , a p p l y i n g t h e o p e r a t i o n
# ’ add ’ t o e l e m e n t s i n common , c o p y i n g non−n u l l o t h e r s
a d d l i s t s ( l s t 1 , l s t 2 , add)
9.4. AN ALTERNATIVE: “SNOWDOOP” 215
# arguments :
#
# xname : name o f chunked d a t a ( t y p i c a l re ad i n
# e a r l i e r from chunked f i l e
# n i t r s : number o f i t e r a t i o n s t o perform
# n c l u s : number o f c l u s t e r s t o form
# c t r s : the matrix of i n i t i a l c e n t r o i d s
# assumes :
#
# s e t c l s i n f o already called
# a c l u s t e r w i l l n e v e r become empty
2 Note that neither Hadoop, Spark nor Snowdoop will achieve full parallel reading if
t e s t <− function ( c l s ) {
m <− matrix ( c ( 4 , 1 , 4 , 6 , 3 , 2 , 6 , 6 ) , ncol =2)
formrowchunks ( c l s ,m, ”m” )
i n i t c <− rbind ( c ( 2 , 2 ) , c ( 3 , 5 ) )
kmeans ( c l s , ”m” , 1 , i n i t c ) }
9.4. AN ALTERNATIVE: “SNOWDOOP” 217
One of the most common types of computation is sorting, the main subject
of this chapter. A related topic is merging, meaning to combine two sorted
vectors into one large sorted vector.
Sorting is not an embarrassingly parallel operation (Section 2.11). Accord-
ingly, many different types of parallel sorts have been invented. We’ll cover
some introductory material here.
219
220 CHAPTER 10. SORTING
Each of these algorithms can be described simply, and outlines of them will
be shown here; details may be found on a myriad of websites, including
specific implementations in C or other languages.
Here we assume we have an R vector x of length n to be sorted. For
simplicity, assume the n values are distinct.
n − 1 + n − 2 + n − 3 + ... + 1
The recursive nature of the algorithm is seen in the fact that the
function qs() calls itself! At first, that might seem like magic, but
it really makes sense when one thinks about the way this is handled
internally. Interested readers may wish to review from Section 4.1.2,
and ponder how recursion is implemented internally. Note that re-
cursion actually is not very efficient in R, though it can be used well
in C/C++. Full C code will be presented in Section 10.4.
Quicksort is optimal in principle, but even a serial implementation
must be done very carefully to achieve good efficinency. In parallel,
all the horrors discussed in Chapter 2—memory contention, cache
coherency, network overhead and so on—arise with a vengeance.
A variant, Hyperquicksort, was developed for hypercube network topolo-
gies, but is applicable generally, especially for distributed data. It is
discussed in Section 10.7.1.
• Mergesort
Say n is 100000 and we have four threads. We could assign x[1]
through x[250000] to the first thread, x[250001] through x[500000]
to the second thread, and so on. Each thread then does a local sort
of its chunk, and then all the sorted chunks are merged.
So far, this is embarrassingly parallel. But the merge phase isn’t.
The latter is typically done in a tree-like manner. In our four-thread
10.3. EXAMPLE: BUCKET SORT IN R 223
example above, say, we could have threads 1 and 2 merge their re-
spective chunks, giving the result to thread 1, and at the same time
have threads 3 and 4 do the same, giving this result to thread 3. Then
threads 1 and 3 would merge their new chunks, and the vector would
now be sorted.
In this manner, the merge phase would take O(log2 n) steps, since
at each step we are halving the number of active threads. At each
step, all n elements of the vector get touched in some way, so as
with Quicksort, the total time complexity is O(n log n). By the way,
Thrust does include a merge function, thrust::merge(x,y,z), that
merges sorted vectors x and y, and places the result in z.
The major drawback of mergesorts is that many threads become idle
at various points in the merge phase, thus robbing the process of
parallelism.
• Bucket Sort
This algorithm, sometimes called sample sort, is like a one-level Quick-
sort. Say we have three threads and 90000 numbers to sort. We could
first take a small subsample, say of size 1000, and then (if we are in
R) call quantile() to determine where the 0.33 and 0.67 quantiles
are, which I’ll call b and c. Thread 1 then handles all the numbers
less than b, thread 2 handles the ones between b and c, and thread
3 handles the ones bigger than c. Each sorts its own group locally,
then places the result back in the proper place in the vector. If for
instance, thread 1 has 29532 numbers to process, it places the result
of its sort in x[1] through x[29532], and so on.
# ( c o u l d a l s o use q u a n t i l e ( ) h e r e )
k <− f l o o r ( nsamp / n c o r e s )
i f (me > 1 ) mylo <− samp [ ( me−1) ∗ k + 1 ]
i f (me < n c o r e s ) myhi <− samp [ me ∗ k ]
i f (me == 1 ) myx <− x [ x <= myhi ] e l s e
i f (me == n c o r e s ) myx <− x [ x > mylo ] e l s e
myx <− x [ x > mylo & x <= myhi ]
# t h i s t h r e a d now s o r t s i t s a s s i g n e d group
sort (myx)
}
r e s <− mclapply ( 1 : n c o r e s , dowork , mc . c o r e s=n c o r e s )
# string the r e s u l t s together
c ( unlist ( res ))
}
t e s t <− function ( n , n c o r e s ) {
x <− runif ( n )
mcbsort ( x , n c o r e s=n c o r e s , nsamp=1000)
}
// exchange t h e e l e m e n t s p o i n t e d t o by y i and y j
void swap ( int ∗ yi , int ∗ y j )
{ int tmp = ∗ y i ;
∗yi = ∗yj ;
∗ y j = tmp ;
}
// c o n s i d e r t h e s e c t i o n o f x from x [ low ] t o x [ h i g h ] ,
// comparing each e l e m e n t t o t h e p i v o t , x [ low ] ; keep
// s h u f f l i n g t h i s s e c t i o n o f x u n t i l , f o r some m,
// a l l t h e e l e m e n t s t o t h e l e f t o f x [m] a r e <= the ,
10.4. EXAMPLE: QUICKSORT IN OPENMP 225
// p i v o t , and a l l t h e o n e s t o t h e r i g h t
// a r e >= t h e p i v o t
int ∗ s e p a r a t e ( int ∗x , int low , int h i g h )
{ int i , p i v o t ,m;
p i v o t = x [ low ] ;
swap ( x+low , x+h i g h ) ;
m = low ;
f o r ( i = low ; i < h i g h ; i ++) {
i f ( x [ i ] <= p i v o t ) {
swap ( x+m, x+i ) ;
m += 1 ;
}
}
swap ( x+m, x+h i g h ) ;
return m;
}
// q u i c k s o r t o f t h e a r r a y z , e l e m e n t s z s t a r t through
// zend ; s e t t h e l a t t e r t o 0 and n−1 i n f i r s t c a l l ,
// where n i s t h e l e n g t h o f z ; f i r s t c a l l i s 1 o r 0 ,
// a c c o r d i n g t o whether t h i s i s t h e f i r s t o f t h e
// r e c u r s i v e c a l l s
void qs ( int ∗z , int z s t a r t , int zend , int f i r s t c a l l )
{
#pragma omp p a r a l l e l
{ int p a r t ;
i f ( f i r s t c a l l == 1 ) {
#pragma omp s i n g l e nowait
qs ( z , 0 , zend , 0 ) ;
} else {
i f ( z s t a r t < zend ) {
p a r t = s e p a r a t e ( z , z s t a r t , zend ) ;
#pragma omp t a s k
qs ( z , z s t a r t , part − 1 , 0 ) ;
#pragma omp t a s k
qs ( z , p a r t +1, zend , 0 ) ;
}
}
}
}
// t e s t code
226 CHAPTER 10. SORTING
The code
i f ( f i r s t c a l l == 1 ) {
#pragma omp s i n g l e n o w a i t
qs ( z , 0 , zend , 0 ) ;
gets things going. We want only one thread to execute the root of the
recursion tree, hence the need for the single clause. The other threads will
have nothing to do this round, but the root call sets up two new calls, each
of which will again encounter the omp parallel pragma and the code
p a r t = s e p a r a t e ( z , z s t a r t , zend ) ;
#pragma omp t a s k
qs ( z , z s t a r t , part − 1 , 0 ) ;
Here the task directive states, “OMP system, please make sure that this
subtree is handled by some thread eventually.” If there are idle threads
available, then this new task will be started immediately by one of them;
otherwise, it’s a promise to come back later.
Thus during execution, we first use one thread, then two, then three and so
on until all threads are busy. In other words, there will be something of a
load balance issue near the beginning of execution, just as we noted earlier
for Mergesort.
There are various possible refinements, such as the barrier-like taskwait
clause.
And remember, the Bucket Sort above was for multicore platforms. A major
bonus of Rth is that we can run our code on either multicore machines or
GPUs.
Let’s test it first:
> l i b r a r y ( Rth )
Loading r e q u i r e d package : Rcpp
> x <− runif ( 2 5 )
> x
[ 1 ] 0.90189818 0.68357514 0.93200351 0.41806736
0.40033254 0.09879424
[ 7 ] 0.70001364 0.01025429 0.30682519 0.74398691
0.04592790 0.57226260
[ 1 3 ] 0.66428642 0.14953737 0.30014257 0.92142903
0.99587218 0.16254603
[ 1 9 ] 0.36737230 0.46898850 0.76138804 0.67405064
0.15926002 0.19043531
[ 2 5 ] 0.81125042
> rthsort (x)
[ 1 ] 0.01025429 0.04592790 0.09879424 0.14953737
0.15926002 0.16254603
[ 7 ] 0.19043531 0.30014257 0.30682519 0.36737230
0.40033254 0.41806736
[ 1 3 ] 0.46898850 0.57226260 0.66428642 0.67405064
0.68357514 0.70001364
[ 1 9 ] 0.74398691 0.76138804 0.81125042 0.90189818
0.92142903 0.93200351
[ 2 5 ] 0.99587218
// s e t up d e v i c e v e c t o r and copy xa t o i t
t h r u s t : : d e v i c e v e c t o r <double> dx ( xa . b e g i n ( ) , xa . end ( ) ) ;
Note that R modes double and integer are treated separately, so the above
needs to be adapted for the latter mode.
An optional argument to rthsort(), decreasing specifies a sort in de-
scending order if TRUE. Connected with that is use of the INTEGER()
function. This is from R internals, not Rcpp, but it is similar to the latter.
We input a SEXP and form a C++ proxy for it. In this case, the input is
an integer vector (of length 1).1 The output of INTEGER() is a C++
integer vector, and we want the first (and only) element, which has index
0 in the C++ world.
This is then used in the statement
t h r u s t : : sort ( dx . b e g i n ( ) , dx . end ( ) , t h r u s t : : g r e a t e r <i n t > ( ) ) ;
which returns true if x > y. Due to the way Thrust’s sort function is
set up, this amounts to specifying a descending-order sort. But we could
set up a custom sort, with our own comparison function tailored to our
application.
The fact that Rcpp and Thrust are both modeled after C++ STL makes
them rather similar to each other; learn one, and then learning the other is
much easier. Thus statements such as
t h r u s t : : copy ( dx . b e g i n ( ) , dx . end ( ) , xb . b e g i n ( ) ) ;
are straightforward.
called from R.
230 CHAPTER 10. SORTING
A run with n = 50000000 on a GeForce GTX 550 Ti GPU took only 1.243
seconds, less than a third of the 8-core run on the multicore machine. How-
ever, an attempt with n = 100000000 failed, due to the graphics card having
insufficient memory. For solutions, see Section 10.7.
Recall Chapter 9, in which it was explained that with really large data sets,
it may be desirable to store a file in chunks. Though it appears as a single
file to the user, the file is broken down into many separate files, under
different (typically numbered) file names, possibly on different disks and
maybe even in geographically separate locations. The Hadoop Distributed
File System exemplifies this approach.
In such a situation, our view of sorting is typically different from what we
have seen earlier in this chapter. Our input data may be in physically sep-
arate files, or possibly in memory but in separate machines, say in a cluster
— and we may wish our output to have the same distributed structure. In
other words, we would have our sorted array stored in chunked form, with
different chunks in different files or on different machines. In this section,
we take a look at how this might be done.
The issues of network latency and bandwidth, discussed in Section 2.5,
become especially acute in clusters and in distributed data sets. Since
sorting is not a parallel operation, these problems are central to developing
efficient parallel sorting in such contexts. This is very dependent on one’s
specific communications context, and only some general suggestions will be
made here.
10.7.1 Hyperquicksort
When all the dust clears, the vector is in sorted form, though again in a
distributed manner across the processes.
Chapter 11
Prefix scan computes cumulative operations, like R’s cumsum() for cumu-
lative sums:
> x <− c ( 1 2 , 5 , 1 3 )
> cumsum( x )
[ 1 ] 12 17 30
as we saw above.
In its general, abstract form, we have some associative operator, ⊗, and pre-
fix scan inputs sequence of objects (x0 , ..., xn−1 ), and outputs (s0 , ..., sn−1 ),
where
s0 = x0 ,
s1 = x0 ⊗ x1 ,
s2 = x0 ⊗ x1 ⊗ x2 , (11.1)
...,
sn−1 = x0 ⊗ x1 ⊗ ... ⊗ xn−1
233
234 CHAPTER 11. PREFIX SCAN
The operands xi need not be numbers. For instance, they could be matrices,
with ⊗ being matrix multiplication.
The form of scan used above is called inclusive scan, in which xi is in-
cluded in si . The exclusive version omits xi . So for instance the exclusive
cumulative sum vector in the little example above would be
11.2 Applications
Prefix scan has become a popular tool with which to implement paral-
lel computing algorithms, applicable in a surprising variety of situations.
Consider for instance a parallel filter operation, like
> x
[ 1 ] 19 24 22 47 27 8 28 39 23 4 43 11 49 45 43
2 13 8 50 41 24 13 7 14 38
> y <− x [ x > 2 8 ]
> y
[ 1 ] 47 39 43 49 45 43 50 41 38
With an eye toward parallelizing this operation, let’s see how to cast it as
a prefix scan problem, as follows:
> b <− as . integer ( x > 2 8 )
> b
[1] 0 0 0 1 0 0 0 1 0 0 1 0 1 1 1 0 0 0 1 1 0 0 0 0 1
> cumsum( b )
[1] 0 0 0 1 1 1 1 2 2 2 3 3 4 5 6 6 6 6 7 8 8 8 8 8 9
Look where the vector b changes values — at elements 4, 8, 11, 13, 14, 15,
19, 20 and 25. But these are precisely the elements of x that go into y.
So here cumsum(), a prefix scan operation, enabled the filtering operation.
Thus, if we can find a way to parallelize prefix scan, we can parallelize
filtering. (The creation of b above, and the operation of checking changed
values in b, are embarrassingly parallel.)
And, surprisingly, that gives us an efficient way to parallelize quicksort.
The partitioning step — finding all elements less than the pivot and all
greater than it — is just two filters, after all. The first step in a bucket sort
is also a filter.
11.3. GENERAL STRATEGIES 235
So, how can we parallelize prefix scan? Actually, there are pretty good
methods for this.
A common method for parallelizing prefix scan first works with adjacent
pairs of the xi , then pairs spaced two indices apart, then four, then eight,
and so on.
For the time being, we’ll assume we have n threads, i.e., one for each datum.
Clearly this condition will typically not hold, but we’ll extend things later.
Thread i will handle assignments to xi . Here’s the basic idea, say for n = 8:
Step 1:
x1 ← x0 + x1 (11.2)
x2 ← x1 + x2 (11.3)
x3 ← x2 + x3 (11.4)
x4 ← x3 + x4 (11.5)
x5 ← x4 + x5 (11.6)
x6 ← x5 + x6 (11.7)
x7 ← x6 + x7 (11.8)
Step 2:
x2 ← x0 + x2 (11.9)
x3 ← x1 + x3 (11.10)
x4 ← x2 + x4 (11.11)
x5 ← x3 + x5 (11.12)
x6 ← x4 + x6 (11.13)
x7 ← x5 + x7 (11.14)
236 CHAPTER 11. PREFIX SCAN
Step 3:
x4 ← x0 + x4 (11.15)
x5 ← x1 + x5 (11.16)
x6 ← x2 + x6 (11.17)
x7 ← x3 + x7 (11.18)
In Step 1, we look at elements that are 1 apart, then Step 2 considers the
ones that are 2 apart, then 4 for Step 3.
Why does this work? Consider how the contents of x7 evolve over time.
Let ai be the original xi , i = 0,1,...,n-1. Then here is x7 after the various
steps:
Step contents
1 a6 + a7
2 a4 + a5 + a6 + a7
3 a0 + a1 + a2 + a3 + a4 + a5 + a6 + a7
So in the end x7 will indeed contain what it should. Working through the
case i = 2 shows that x2 eventually contains a0 +a1 +a2 , again as it should.
Moreover, “eventually” comes early in this case, at the end of Step 2; this
will be an important issue below.
For general n, the routing is as follows. At Step i, each xj is routed both
to itself and to xj+2i−1 , for j >= 2i−1 .
There will be log2 n steps if n is a power of 2; otherwise the number of steps
is blog2 nc.
Note these important points:
• Again note the fact that as times goes on, more and more threads
become idle; xi will not change after Step i at the latest, typically
earlier. Thus thread i will become idle, and load balancing is poor.
Now, what if n is greater than p, our number of threads, the typical case?
One approach would be to make the assignment of threads to data be
dynamic, reconfigured at each step. If at any given step we have k nonidle
xi , then we assign each thread to handle about k/p of the xi positions.
(b) Have each thread compute the prefix scan for its chunk.
(c) Compute the prefix scan of the right-hand endpoints of the chunks.
(Actually, we need only the first p − 1.)
(d) Have each thread adjust its own prefix scan according to the result of
step (c).
Here’s pseudocode for an approach along these lines. Let Ti denote thread
i.
2 25 26 8 50 3 1 11 7 9 29 10
238 CHAPTER 11. PREFIX SCAN
and wish to compute a sum scan, i.e., cumulative sums. Suppose we have
three threads. We break the data into three sections,
2 25 26 8 50 3 1 11 7 9 29 10
and then apply a scan to each section:
2 27 53 61 50 53 54 65 7 16 45 55
But we still don’t have the scan of the array overall. That 50, for instance,
should be 61+50 = 111 and the 53 should be 61+53 = 114. In other words,
61 must be added to that second section, (50,53,54,65), and 61+65 = 126
must be added to the third section, (7,16,45,55). This then is the last step,
yielding
2 27 53 61 111 114 115 126 133 142 171 181
// i n p u t v e c t o r x , number o f d e s i r e d t h r e a d s nth
RcppExport SEXP ompcumsum(SEXP x , SEXP nth )
{
Rcpp : : NumericVector xa ( x ) ;
int nx = xa . s i z e ( ) ;
// c u m u l a t i v e sums v e c t o r
double csms [ nx ] ;
int c h u n k s i z e = nx / n t h r e a d s ;
// output v e c t o r
Rcpp : : NumericVector c s o u t ( nx ) ;
#pragma omp p a r a l l e l
{
int me = omp g e t t h r e a d num ( ) ;
int mystart = me ∗ c h u n k s i z e ,
myend = mystart + c h u n k s i z e − 1 ;
i f (me == n t h r e a d s −1) myend = nx − 1 ;
int i ;
// do c u m u l a t i v e sums f o r my chunk
double sum = 0 ;
f o r ( i = mystart ; i <= myend ; i ++) {
sum += xa [ i ] ;
240 CHAPTER 11. PREFIX SCAN
csms [ i ] = sum ;
}
// f i n d adjustment v a l u e s
//
// f i r s t , make s u r e a l l t h e chunk cumsusm
// a r e ready
#pragma omp b a r r i e r
// o n l y one t h r e a d need compute and
// accumulate t h e r i g h t −hand e n d p o i n t s
#pragma omp s i n g l e
{
a d j [ 0 ] = csms [ c h u n k s i z e − 1 ] ;
i f ( nthreads > 2)
f o r ( i = 1 ; i < n t h r e a d s −1; i ++) {
adj [ i ] =
a d j [ i −1] + csms [ ( i +1)∗ c h u n k s i z e − 1 ] ;
}
}
// i m p l i e d b a r r i e r a t t h e end o f any
// ’ s i n g l e ’ pragma
// do my a d j u s t m e n t s
double myadj ;
i f (me == 0 ) myadj = 0 ;
e l s e myadj = a d j [ me− 1 ] ;
f o r ( i = mystart ; i <= myend ; i ++)
c s o u t [ i ] = csms [ i ] + myadj ;
}
// i m p l i e d b a r r i e r a t t h e end o f any
// ’ p a r a l l e l ’ pragma
return c s o u t ;
}
Recall the discussion in Section 4.1.2, regarding how local variables are
usually stored in memory: Each thread is assigned space in memory called
a stack, in which local variables for a thread are stored. In our example
above, csms and csout are such variables.
The significance of this is that the operating system typically places a limit
on the size of a stack. Since our cumulative sum code is typically going
to run on very large vectors (otherwise the serial version, R’s cumsum(),
is fast enough), we run the risk of running out of stack space, causing an
execution error.
Typical OSs allow you to change the default stack size, in various ways.
This will be done in the next section. However, it brings up an issue
as to whether we want to follow basic R philosophy of not having side
effects. In our setting here, if we were willing to violate that informal rule,
we could write the above code so that csout is one of the arguments to
ompcumsum(), rather than being its return value. As long as our actual
csout is a top-level variable, i.e., set at the > command-line level, it would
not be on a stack, hence would not cause stack issues.
I ran this code from a C shell, and first set a large stack space of over 4
billion bytes, to accommodate calculation of cumulative sums on an array
of length 500 million::
% l i m i t s t a c k s i z e 4000m
Or, on any system, I could have set the stack size as one of the arguments
to gcc.
I tried 2 through 16 cores, spaced 2 apart, on the 16-core machine described
in this book’s Preface. In order to reduce sampling variation, I performed
242 CHAPTER 11. PREFIX SCAN
three runs at each value of the number of cores. The results are showing in
Figure 11.1.
For this particular sample size, adding more cores increases speed up about
8 cores, after which there is no improvement.
By comparison, the median time for R’s cumsum() in three runs was
10.553. So, even 2 cores yielded a speedup of much more than 2, reflecting
the fact that we are working purely at the C++ level.
xi−w+1 + ... + xi
ai = (11.19)
w
11.6. EXAMPLE: MOVING AVERAGE 243
The goal is to address the question, “What has the recent trend been?” at
time i, i = w, ..., n.
// update f u n c t i o n , t o c a l c u l a t e c u r r e n t moving
// a v e r a g e v a l u e from t h e p r e v i o u s one
struct minus and d i v i d e :
p u b l i c t h r u s t : : b i n a r y f u n c t i o n <double , double ,
double>
{
double w ;
minus and d i v i d e ( double w) : w(w) {}
host device
double o p e r a t o r ( ) ( const double& a ,
const double& b ) const
{ return ( a − b ) / w ; }
};
#i f RTH OMP
omp s e t num t h r e a d s (INT( n t h r e a d s ) ) ;
# e l i f RTH TBB
tbb : : t a s k s c h e d u l e r i n i t i n i t (INT( n t h r e a d s ) ) ;
#e n d i f
// s e t up d e v i c e v e c t o r and copy xa t o i t
t h r u s t : : d e v i c e v e c t o r <double> dx ( xa . b e g i n ( ) ,
xa . end ( ) ) ;
int xas = xa . s i z e ( ) ;
i f ( xas < wa)
return 0 ;
// a l l o c a t e d e v i c e s t o r a g e f o r c u m u l a t i v e sums ,
// and compute them
t h r u s t : : d e v i c e v e c t o r <double>
csums ( xa . s i z e ( ) + 1 ) ;
t h r u s t : : e x c l u s i v e s c a n ( dx . b e g i n ( ) , dx . end ( ) ,
csums . b e g i n ( ) ) ;
// need one more sum a t ( a c t u a l l y p a s t ) t h e end
csums [ xas ] = xa [ xas −1] + csums [ xas − 1 ] ;
return xb ;
}
11.6.2 Algorithm
Again with the xi as inputs, it first computes the cumulative sums ci , using
the Thrust function exclusive scan():
t h r u s t : : e x c l u s i v e scan ( dx . b e g i n ( ) , dx . end ( ) ,
csums . b e g i n ( ) ) ;
11.6. EXAMPLE: MOVING AVERAGE 245
11.6.3 Performance
I ran the code on the 16-core multicore machine again. Since this machine
has hyperthread degree 2 (Section 1.4.5.2), it may continue to give speedups
through 32 threads. For each number of threads, the experiment consisted
of first running the function runmean() (“running mean,” i.e., moving
average) from the R package caTools in order to establish a baseline run
time:
> n <− 1500000000
246 CHAPTER 11. PREFIX SCAN
I then ran rthma() for various numbers of cores. Note that I had to
choose between OpenMP and TBB, and first selected the former. Look at
the results in Figure 11.2.
Again owing to using C++, we already get a performance gain even with
a single core, compared to using runmean().
As the number of cores increases, we attain a modest improvement, up
through about 12 cores. After that it is flat, possibly with some deterio-
ration as seen for instance in Figure 2.2. A more thorough study would
require multiple run times at each value of the number of cores.
Well, what about TBB? Running rthma() with a TBB backend, and al-
lowing TBB to choose the number of cores for us, I had a run time of 34.233
seconds, slightly better than the best OpenMP time.
11.6. EXAMPLE: MOVING AVERAGE 247
This is about triple speedup, very nice, though not commensurate with
the large number of cores in the GPU. Presumably the communications
overhead here was very much a limiting factor.
If you have a compiler that allows C++11 lambda functions, these can make
life much simpler for you if you use Thrust, TBB or anything that uses
functors. Let’s see how this would work with our C++ function rthma()
above (scroll down to “changed code”).
#i n c l u d e <t h r u s t / d e v i c e v e c t o r . h>
#i n c l u d e <t h r u s t / scan . h>
#i n c l u d e <t h r u s t / t r a n s f o r m . h>
#i n c l u d e <t h r u s t / f u n c t i o n a l . h>
#i n c l u d e <t h r u s t / s e q u e n c e . h>
#i n c l u d e <Rcpp . h>
#i n c l u d e ” backend . h” // from Rth
xa . end ( ) ) ;
i n t xas = xa . s i z e ( ) ;
i f ( xas < wa) return 0 ;
t h r u s t : : d e v i c e vector<double>
csums ( xa . s i z e ( ) + 1 ) ;
t h r u s t : : e x c l u s i v e scan ( dx . b e g i n ( ) , dx . end ( ) ,
csums . b e g i n ( ) ) ;
csums [ xas ] = xa [ xas −1] + csums [ xas − 1 ] ;
Rcpp : : NumericVector xb ( xas − wa + 1 ) ;
// changed code
t h r u s t : : transform ( csums . b e g i n ( ) + wa , csums . end ( ) ,
csums . b e g i n ( ) , xb . b e g i n ( ) ,
// lambda function
[ = ] ( double& a , double& b ) { return ( ( a−b ) /wa ) ;
});
return xb ;
}
We are creating a function object here, as we did in the earlier version with
an operator() notation within a struct. But here we are doing so right
on the spot, in one of the arguments to thrust::transform(). It’s similar
to the concept of anonymous functions in R. Here are the details:
• The brackets in [=] tells the compiler that a function object is about
to begin. (The = sign will be explained shortly.)
Parallel Matrix
Operations
The matrix is one of the core types in the R language, so much so that there
is even an entire book titled Hands-on Matrix Algebra Using R (Hrishikesh
Vinod, World Scientific, 2011). And in recent years, the range of applica-
tions of matrices has expanded, from the traditional areas such as regression
analysis and principal components to image processing and the analysis of
random graphs in social networks.
In modern applications, matrices are often huge, with sizes of tens of thou-
sands of rows, and far more, being commonplace. Thus there is a large
demand for parallel matrix algorithms, the subject of this chapter.
There is a plethora of parallel matrix software available to R programmers.
For GPU, for instance, there are R packages gputools, gmatrix (this one
especially noteworthy because it enables us to follow the Principle of “Leave
It There,” Section 2.5.2), MAGMA and so on. Indeed, more and more
of them are being developed. See also PLASMA and HiPLAR, as well as
pdbDMAT.
Thus this chapter cannot provide comprehensive coverage of everything
that is out there for parallel linear algebra in R. In addition, in some special
cases the libraries don’t quite do what we need. Thus, our coverage will
consist mainly of basic, general principles, with examples using some of the
libraries. Selection of libraries for the examples will be based mainly on
ease of installation and use.
251
252 CHAPTER 12. MATRIX OPERATIONS
1 5 12
A= 0 3 6 (12.1)
4 8 2
and
0 2 5
B= 0 9 10 , (12.2)
1 1 2
so that
12 59 79
C = AB = 6 33 42 . (12.3)
2 82 104
We could partition A as
A11 A12
A= , (12.4)
A21 A22
where
1 5
A11 = , (12.5)
0 3
12
A12 = , (12.6)
6
A21 = 4 8 (12.7)
12.2. EXAMPLE: SNOWDOOP APPROACH 253
and
A22 = 2 . (12.8)
B11 B12
B= (12.9)
B21 B22
and
C11 C12
C= , (12.10)
C21 C22
B21 = 1 1 . (12.11)
The key point is that multiplication still works if we pretend that those
submatrices are numbers! The matrices A, B and C would all be thought
of as of “2 × 2” size, in which case we would have for example
(In that expression on the right, we are still treating A11 etc. as matrices,
even though we got the equation by pretending they were numbers.)
The reader should verify this really is correct as matrices, i.e., the compu-
tation on the right side really does yield a matrix equal to C11 . You can
see now what kind of compatibility is needed. For instance, the number of
rows in B11 must match the number of columns in A11 and so on.
process A in chunked form. Note that this is a special case of tiling. Here
we will see how to deal with this, using the GPU case for illustration.
Consider computation of a product Ax, where A is a matrix and x is a
conformable vector. Suppose A is too large to fit in our GPU. Our strategy
will be simple: Break A into tiles of rows, multiply x by each tile of A, and
concatenate the results to yield Ax, as we did in Section 1.4.4. The code is
equally simple, say with gputools (Section 6.7.1):
# GPUTiling .R
Note that in this (single) GPU context, although the work within a tile
is done in parallel, it is still serial between tiles. Moreover, we incur the
overhead of communication between the CPU and GPU. This may hamper
our efforts to achieve speedup.
On the other hand, if we have multiple GPUs, or if we have a Snowdoop
setting, this partitioning could be a win.
Recall the concept of breaking files into chunks, such as with the Hadoop
Distributed File System. Typical algorithms for the message-passing setting
assume that the matrices A and B in the product AB are stored in a
distributed manner across the nodes, in a cluster, and that the product will
be distributed too.
In addition to formal MapReduce settings, this situation could arise for
several reasons:
• The application is such that it is natural for each node to possess only
part of A and B.
• One node, say node 0, originally contains all of A and B, but in order
to conserve network communication time, it sends each node only
parts of those matrices.
• An entire matrix would not fit in the available memory at the indi-
vidual nodes.
Consider the node that has the responsibility for calculating block (i,j) of
the product C, which it calculates as
Cij = Ai1 B1j + Ai2 B2j + ... + Aii Bij + ... + Ai,m Bm,j (12.13)
It will be convenient here (only in this section) to number rows and columns
starting at 0 instead of 1, as the code uses the mod operator based on that
setting. Now rearrange (12.13) with Aii first:
Aii Bij +Ai,i+1 Bi+1,j +...+Ai,m−1 Bm−1,j +Ai0 B0j +Ai1 B1j +...+Ai,i−1 Bi−1,j
(12.14)
The algorithm is then as follows. The node which is handling the compu-
tation of Cij does this (in parallel with the other nodes which are working
with their own values of i and j):
The main idea is to have the various computational nodes repeatedly ex-
change submatrices with each other, timed so that a node receives the
submatrix it needs for its computation “just in time.”
The algorithm can be adapted in the obvious way to nonsquare matrices,
etc.
This parallelizes the outer loop. Let p denote the number of threads. Since
each value of i in that loop handles one row of A, what we are doing here
is breaking the matrix into p sets of rows of A. A thread computes the
product of a row of A and all of B, which becomes the corresponding row
of C.
Note, though, that from the discussion on page 122 we know the sets of
rows are not necessarily “chunks;” the rows of A processed by a given thread
may not be contiguous. We will return to this point shortly.
First and foremost, cache effects must be considered. (The reader may wish
to review Sections 2.3.1 and 5.8 before continuing.) Suppose for the moment
that we are working purely in C/C++, which uses row-major storage.
258 CHAPTER 12. MATRIX OPERATIONS
This should produce a good speedup. But we can do even better, much
better, as discussed below.
Here A and B are in the GPU global memory, and we copy chunks of them
into shared-memory arrays As and Bs. Since the code has been designed
so that As and Bs are accessed frequently during a certain period of the
execution, it is worth incurring the copying delay to exploit the fast shared
memory.
Fortunately, the authors of the CUBLAS library have already done all the
worrying about such matters for you. They have written very finely hand-
tuned code for good performance, making good use of GPU shared memory
and so on.
12.4.1 Overview
Thus even in ordinary serial computation, e.g. R’s %*% operator, the speed
of matrix operations may vary according to the version of BLAS that your
implementation of R was built with.
In our context here of parallel computation, one version of BLAS of spe-
cial interest is OpenBLAS, a multithreaded version that can bring perfor-
mance gains on multicore machines, and also includes various efficiencies
that greatly improve performance even in the serial case. We’ll take a closer
look at it in Section 12.5.
As noted, there is also CUBLAS, a version of BLAS for NVIDIA GPUs,
highly tailored to that platform. A number of R packages, such as gputools
and gmatrix, make use of this library, and of course you can write your
own special-purpose R interfaces to it, say using Rcpp to interface it to R.
For message-passing systems (clusters or multicore), there is PBLAS, de-
signed to run on top of MPI. The MAGMA library, with the R interface
magma, aims to obtain good performance on hybrid platforms, meaning
multicore systems that also have GPUs.
BLAS libraries also form the basis of libraries that perform more advanced
matrix operations, such as matrix inversion and eigenvalue computation.
LAPACK is a widely-used package for this. as is ScaLAPACK for PBLAS.
So, now whenever I run R, it loads OpenBLAS instead of the default BLAS.
OpenBLAS is a threaded application. It doesn’t use OpenMP to man-
age threads, but it does allow one to set the number of threads using the
OpenMP environment variable, e.g. in the bash shell,
$ e x p o r t OMP NUM THREADS=2
If the number of threads is not set, OpenBLAS will use all available cores.3
Note, though, that this may not be optimal, as we will see later.
It is important to note that that is all I had to do. From that point on, if I
wanted to compute the matrix product AB in parallel, I just used ordinary
R syntax:
> c <− a %∗% b
Let n denote the number of vertices in the graph. As before, define the
graph’s adjacency matrix A to be the n × n matrix whose element (i, j) is
equal to 1 if there is an edge connecting vertices i and j (i.e., i and j are
“adjacent”), and 0 otherwise. Let R(k) denote the matrix consisting of 1s
and 0s, with a 1 in element (i, j) signifying that we can reach j from i in k
steps. (Note that k is a superscript, not an exponent.)
Also, our main goal will be to compute the corresponding reachability ma-
trix R, whose (i, j) element is 1 or 0, depending on whether one can reach j
in some multistep path from i. In particular, we are interested in determin-
ing whether the graph is connected, meaning that every vertex eventually
leads to every other vertex, which will be true if and only if R consists of
all 1s. Let us consider the relationship between the R(k) and R.
12.6.1 Analysis
n−1
X
R = b( R(k) ) (12.15)
k=1
output, in this case in the principal eigenvalue. In such a setting, roundoff error could
be quite serious.
12.6. EXAMPLE: GRAPH CONNECTEDNESS 265
So, if we calculate all the R(k) , then we have R. But there is a trick we can
employ to avoid finding them all.
To see this, suppose there exists a path of 16 steps from vertex 8 to vertex
3. If the graph is directed, meaning that A is not symmetric, then it may
be the case that there is no path from vertex 3 that eventually returns to
that node. If so, then
(16)
R83 = 1 (12.16)
but
(k)
R83 = 0, k > 16 (12.17)
So it would seem that we do indeed need to calculate all the R(k) . But we
can actually avoid that, by first making a small modifcation to our graph:
We add artificial “edges” from every vertex to itself. In matrix terms, this
means setting the diagonal elements of A to 1.
Now, in our above example we have
(k)
R83 = 1, k ≥ 16 (12.18)
since, after step 16, we can keep going from vertex 3 to itself k − 16 times.
This saves us a lot of work—we need compute only R(n−1) . So, how can
we do this? We need one last ingredient:
Why is this true? For instance, think of trying to go from vertex 2 to vertex
7 in two steps. Here are the possibilities:
• ...
266 CHAPTER 12. MATRIX OPERATIONS
(2)
R27 = b(a21 a17 + a22 a27 + a23 a37 + ...) (12.20)
The key point is that the argument to b() here is (A2 )27 , demonstrating
(12.19).
So, the original graph connectivity problem reduces to a matrix power
problem. We simply compute An−1 (and then apply b()).
Moreover, we can save computation there as well, by using the “log trick.”
Say we want the 16th power of some matrix B. We can square it, yielding
B 2 , then square that, yielding B 4 . Two more squarings then give us B 8
and B 16 . That would be only four matrix-multiply operations, instead of
15. In general, for power 2m , we need m steps.
In our graph connectivity setting, we need power n − 1, but we might as
well go for power
Any parallel vehicle for matrix multiplication could be used for computing
the matrix powers, e.g. OpenBLAS or GPUs. In addition, my CRAN
package matpow, described in the next section, can be used to facilitate
the process.
By the way, one of this book’s internal reviewers raised the following ques-
tion: The matrix A likely can be diagonalized, i.e., a matrix C can be found
such that
A = C −1 DC (12.22)
Ak = C −1 Dk C (12.23)
12.7. SOLVING SYSTEMS OF LINEAR EQUATIONS 267
The matrix Dk is trivial to compute. So, if one has a parallel method for
finding the eigenvalues of a matrix (which also gives us C), then this would
be another parallel method for computing Ak .
However, the finding of eigenvalues is not an embarrassingly parallel oper-
ation. And in our graph connectivity application, we need exact numbers,
without roundoff.
Actually, one method for computing eigenvalues does involve matrix powers.
12.6.4.1 Features
where the xi are the unknowns to be solved for. Such systems occur often in
data science, such as in linear regression analysis, computation of Maximum
Likelihood Estimates, and so on.
268 CHAPTER 12. MATRIX OPERATIONS
Ax = b, (12.25)
A = LU
At present, R uses this method for its solve() function, which (by the user’s
choice) either solves Ax = b or simply finds A−1 .
All in all, row echelon form should save us some work, as we don’t have
to repeatedly work on the upper rows of the matrix as in the reduced row
echelon form.
With pivoting, the numerical stability with either form is good. But what
about parallelizability? This is easy in the reduced row echelon form, as we
can statically assign a group of rows to each thread. But in the row echelon
form, we have fewer and fewer rows to process as time goes on, making it
harder to keep all the threads busy.
270 CHAPTER 12. MATRIX OPERATIONS
1
xi = [bi − (ai1 x1 + ... + ai,i−1 xi−1 + ai,i+1 xi+1 + ... + ai,n xn )], i = 1, ..., n.
aii
(12.27)
This suggests a natural iterative algorithm for solving the equations: We
start with our guess being, say, xi = bi for all i. At our kth iteration, we
find our (k+1)st guess by plugging in our kth guess into the right-hand side
of (12.27). We keep iterating until the difference between successive guesses
is small enough to indicate convergence.
This algorithm is guaranteed to converge if each diagonal element of A is
larger in absolute value than the sum of the absolute values of the other
elements in its row. But it should work well if our initial guess is near the
true value of x.
That last condition may seem unrealistic, but in data science we have many
iterative algorithms that require a matrix inversion (or equivalent) at each
iteration, such as generalized linear models (glm() in R), multiparameter
Maximum Likelihood Estimation and so on. Jacobi may work well here for
updating the matrix inverse from one iteration to the next.
12.7.2.1 Parallelization
reduces the problem to one of matrix multiplication, and thus we can par-
allelize the Jacobi algorithm by utilizing a method for doing parallel matrix
multiplication.
12.7.4 QR Decomposition
A = QR
platform time
built-in BLAS 107.408
GPU 78.061
OpenBLAS, 1 thread 6.612
OpenBLAS, 2 threads 3.593
OpenBLAS, 4 threads 2.087
OpenBLAS, 6 threads 2.493
OpenBLAS, 8 threads 2.993
This is rather startling! Not only is OpenBLAS the clear champion here,
but in particular it bests the GPU. The latter is substantially better than
plain R, yes, but nowhere near what OpenBLAS gives us—even with just
one thread. GPUs are invaluable for embarrassingly parallel problems, but
obtaining excellent gains in other problems is difficult.
Also interesting is the pattern with respect to the number of threads. Af-
ter moving past four threads, performance deteriorated. This pattern was
confirmed in repeated trials, not shown here. This is not too surprising, as
there are only four actual cores on the machine, even though each one has
some limited ability to run two threads at once. To pursue this idea, these
experiments were run on the 16-core machine, again with a hyperthreading
degree of 2. Again performance appeared to saturate at about 16 cores.
2 0 0 0 0
1 1 8 0 0
0 1 5 8 0
(12.29)
0 0 0 8 8
0 0 0 3 5
Code to deal with such matrices can then access the nonzero elements based
on this knowledge.
In the second category, each matrix that our code handles will typically
have its nonzero entries in different, “random,” positions, as in the market
basket example. A number of methods have been developed for storing
amorphous sparse matrices, such as the Compressed Sparse Row format,
which we’ll code in this C struct, representing an m × n matrix A, with k
nonzero entries, for concrete illustration:
struct {
int m,n; // numbers of rows and columns of A
// the nonzero values of A, in row-major order
float *avals;
int *cols; // avals[i] is in column cols[i] in A; length k
int *rowplaces; // rowplaces[i] is the index in avals for
// the 1st nonzero element of row i in A
// (but the last element is k)
}
274 CHAPTER 12. MATRIX OPERATIONS
Since we’re expressing matters in C, our row and column indices will begin
at 0.
For the matrix in (12.29) (if we were not to exploit its tridiagonal nature,
and just treat it as amorphous):
• m,n: 5,5
• avals: 2,1,1,8,1,5,8,8,8,3,5
• cols: 0,0,1,2,1,2,3,3,4,3,4
• rowplaces: 0,1,4,7,9,11
Inherently Statistical
Approaches: Subset
Methods
A recurrent theme in this book has been that it is easy to speed up embar-
rassingly parallel (EP) applications, but the rest can be quite a challenge.
Fortunately, there exist methods for converting many non-EP statistical
problems to EP ones that are equivalent or reasonable substitutes.
I call this software alchemy. Such methods will be presented in this chapter,
with the main focus being on one method in particular, Chunk Averaging
(CA). Brief overviews will also be given of two other methods.
Let’s set some notation. It will be helpful to have at hand the concept of a
“typical rectangular data matrix,” with n rows, i.e., n observations, and p
columns, that is, p variables. Also, suppose we are estimating a population
parameter θ, possibly vector-valued. Our estimator for the full data set is
denoted by θ,
b which we will often call the “full estimator.”
CA has been treated in specialized form by various authors since 1999. The
general form presented here is adapted from my own research, Software
Alchemy: Turning Complex Statistical Computations into Embarrassingly-
275
276 CHAPTER 13. SUBSET METHODS
(a) Break the data into r chunks of rows. The first k = n/r observations
comprise the first chunk, the next k observations form the second
chunk, and so on.
(b) Apply g() to each chunk.
(c) Average the r results obtained in Step (b), thus producing our CA
estimator θe of θ.
Note that typically the result of g() is a vector, so that we are averaging
vectors in Step (c).
If n is not evenly divisible by r, we can take a weighted average, with
weights proportional to the chunk sizes. Specifically, let ni denote the size
of chunk i, and let θbi be the estimate of θ on that chunk. Then the CA
estimator is
r
X ni
θe = θbi (13.1)
i=1
n
It is assumed that the observations are i.i.d. Then the chunks are i.i.d.
as well, so one can also obtain the standard error, or more generally, an
estimated covariance matrix. Let Vi be that matrix for chunk i (obtained
from the output of g()). Then the estimated covariance matrix for θe is
r
X ni
( )2 Vi (13.2)
i=1
n
Like many procedures in statistics, this one has a tuning parameter: r, the
number of groups.
statistical in nature, and its usefulness derives from the fact that the CA
estimator θe is statistically equivalent to the full estimator θ̂.
It is easily proven that if the data are i.i.d. and the full estimator is
asymptotically multivariate normal, then the CA method again produces an
asymptotically multivariate normal estimator. And most importantly,
the chunked estimator has the same asymptotic covariance matrix as the
original nonchunked one.
This last statement, coupled with the fact that Step (b) is an embarrassingly
parallel operation, implies that CA does indeed perform software alchemy—
it turns a non-EP problem into a statistically equivalent EP one.
Suppose the work associated with computing θb is O(nc ), such as the O(n2 )
figure for our mutual outlinks example (Section 2.9). If the r chunks are
handled in parallel, CA reduces the time complexity of an O(nc ) problem to
roughly O(nc /rc ) for a statistically equivalent problem, whereas a speedup
that is linear in r would only reduce the time to O(nc /r).
If c > 1, then the speedup obtained from CA is greater than linear in
r, which is called superlinear. This is a term from the general parallel
processing field. When this occurs generally, the size of the effect is usually
small, and is due to cache effects and the like. But in our statistical context,
superlinearity will be commonplace, often with very large effects.
By the way, a similar analysis shows that CA can yield speedup even in the
serial case, in which the chunks are handled one at a time. The time here
c c
will be r O( nrc ) = O( rnc−1 ). So, for c > 1, CA may be faster than the full
estimator even in uniprocessor settings.
Note too that CA can be helpful even in EP settings. Say our function g()
is part of an existing software package. Even if the underlying algorithm
is EP, recoding it for parallelization may be an elaborate undertaking. CA
then gives us a quick, simple way to exploit the EP nature of the estimator.
This will be illustrated in Section 13.1.4.3.
For some applications, though, our speedup may be more modest. As dis-
cussed in Section 3.4.1, linear (or nonlinear but still parametric) regression
models present challenges. We’ll see an example later in this section.
278 CHAPTER 13. SUBSET METHODS
13.1.3 Code
The data set here is the famous forest groundcover data from the University
of California, Irvine Machine Learning Repository. Here we have seven
different kinds of ground cover, with covariates such as Hillside Shade at
Noon. The goal is to predict the type of ground cover based on remote
sensing of the covariates. There are 500000 observations in the subset of
the data analyzed here..
The timing experiments involved predicting Cover Type 1 from the first 10
covariates. Here is the code and timing results for the full data set:1
1 Since the raw data is ordered, a random permutation is applied.
13.1. CHUNK AVERAGING 279
as seen in Figure 13.2. The speedup starts out linear, then becomes less
dramatic but still quite good, especially in light of the factors mentioned
earlier.
But what about the accuracy? The theory tells us that the CA estimator
is statistically equivalent to the full estimator, but this is based on asymp-
totics. (Though it should be noted that even the full estimator is based on
asymptotics, since glm() itself is so.) Let’s see how well it worked here.
Table 13.1 shows the values of the estimated coefficient for the first pre-
dictor variable, for the different numbers of cores (1 core again means the
full-estimator case):
This is excellent agreement, even for the smallest chunk size.
13.1. CHUNK AVERAGING 281
cores βb1
1 0.006424
2 0.006424
4 0.006424
8 0.006426
12 0.006427
16 0.006427
For a continuous distribution with density f and cdf F , the hazard function
is defined as
f (t)
h(t) =
1 − F (t)
5000 contain the data for the women. Suppose r = 2. Then the distribution
in the first chunk is different from that of the second, i.e., the chunks are
not identically distributed, and the CA theory doesn’t hold.
Thus, if the analyst knows or suspects that the arrangement of the data is
ordered in some way, he/she should first apply a random permutation to
the n records. If the matrix x contains the original data, then one might
run, say
x <− x [ sample ( 1 : n , n , replace=FALSE ) , ]
apply glm() to each pair of predictors. Then for each future data point,
we would generate k predictions for that point, and simply use a majority
rule to predict the new Y value. For instance, if the majority of our k
predictions guess the new Y to be 1, then that would be our overall guess.
Instead of using just pairs of predictors, we might use triples, or in general,
m predictors for each small model fitted.
The motivation generally put forth for this approach is Richard Bellman’s
notion of the curse of dimensionality, which asserts that prediction be-
comes inordinately difficult in very high dimensions, that is with a very
large number of predictors. We try to circumvent this by combining many
predictions, each one in a low-dimensional space.
However, in our context here, one can view the above boosting method
as a way to parallelize our operations. Consider linear regression analysis,
for example, with p predictors. As discussed in Section 3.4.1, for fixed
sample size, the work needed is O(p3 ), or O(p2 ), depending on the numerical
method used. This time complexity grows more than linearly with p, so
boosting may save us computation time. This is especially true since we fit
our k models in parallel, an embarrassingly parallel setting.
Note that unlike the CA and BLB methods, boosting does not yield a
statistically equivalent estimator. But it does save us time (and may reduce
the chance of overfitting, accordingly to proponents).
This method has two tuning parameters: k and m.
Appendix A
This book assumes the reader has had a course in linear algebra (or has
self-studied it, always the better approach). This appendix is intended as a
review of basic matrix algebra, or a quick treatment for those lacking this
background.
If A is a square matrix, i.e., one with equal numbers n of rows and columns,
then its diagonal elements are aii , i = 1,...,n.
285
286 APPENDIX A. MATRIX REVIEW
v
u n
uX
k X k= t x2i (A.1)
i=1
• For two matrices have the same numbers of rows and same numbers
of columns, addition is defined elementwise, e.g.
1 5 6 2 7 7
0 3 + 0 1 = 0 4 (A.2)
4 8 4 0 8 8
7 7 2.8 2.8
0.4 0 4 = 0 1.6 (A.3)
8 8 3.2 3.2
n
X
xk yk (A.4)
k=1
n
X
cij = aik bkj (A.5)
k=1
A.2. MATRIX TRANSPOSE 287
For instance,
7 6 19 66
0 1 6
4 = 8 16 (A.6)
2 4
8 8 24 80
A(B + C) = AB + AC (A.9)
AB 6= BA (A.10)
• If A + B is defined, then
(A + B)0 = A0 + B 0 (A.12)
(AB)0 = B 0 A0 (A.13)
288 APPENDIX A. MATRIX REVIEW
a1 X1 + ... + ak Xk = 0 (A.14)
A.4 Determinants
n
X
det(A) = (−1)k+m det(A−(k,m) ) (A.15)
m=1
where
s t
det = sv − tu (A.16)
u v
• A−1 exists if and only if its rows (or columns) are linearly indepen-
dent.
Again, though, in some cases A is part of a more complex system, and the
inverse is not explicitly computed.
AX = λX (A.19)
value decomposition.
290 APPENDIX A. MATRIX REVIEW
U 0 AU = D (A.20)
[ 1 , ] 474
> # transpose , inverse
> t (a) # transpose
[ ,1] [ ,2]
[1 ,] 1 10
[2 ,] 2 11
[3 ,] 3 12
> u <− matrix ( runif ( 9 ) , nrow=3)
> u
[ ,1] [ ,2] [ ,3]
[ 1 , ] 0.08446154 0.86335270 0.6962092
[ 2 , ] 0.31174324 0.35352138 0.7310355
[ 3 , ] 0.56182226 0.02375487 0.2950227
> uinv
[ ,1] [ ,2] [ ,3]
[ 1 , ] 0 . 5 8 1 8 4 8 2 −1.594123 2 . 5 7 6 9 9 5
[ 2 , ] 2 . 1 3 3 3 9 6 5 −2.451237 1 . 0 3 9 4 1 5
[ 3 , ] −1.2798127 3 . 2 3 3 1 1 5 −1.601586
> u %∗% uinv # n o t e r o u n d o f f e r r o r
[ ,1] [ ,2] [ ,3]
[ 1 , ] 1 . 0 0 0 0 0 0 e+00 −1.680513 e −16 −2.283330 e −16
[ 2 , ] 6 . 6 5 1 5 8 0 e −17 1 . 0 0 0 0 0 0 e+00 4 . 4 1 2 7 0 3 e −17
[ 3 , ] 2 . 2 8 7 6 6 7 e −17 −3.539920 e −17 1 . 0 0 0 0 0 0 e+00
> # e i g e n v a l u e s and e i g e n v e c t o r s
> eigen ( u )
$values
[ 1 ] 1 . 2 4 5 6 2 2 0 + 0 . 0 0 0 0 0 0 0 i −0.2563082+0.2329172 i
−0.2563082 −0.2329172 i
$vectors
[ ,1] [ ,2]
[ ,3]
[ 1 , ] −0.6901599+0 i −0.6537478+0.0000000 i
−0.6537478+0.0000000 i
[ 2 , ] −0.5874584+0 i −0.1989163 −0.3827132 i
−0.1989163+0.3827132 i
[ 3 , ] −0.4225778+0 i 0 . 5 6 6 6 5 7 9 + 0 . 2 5 5 8 8 2 0 i
0.5666579 −0.2558820 i
> # d i a g o n a l m a t r i c e s ( o f f −d i a g o n a l s 0)
> diag ( 3 )
[ ,1] [ ,2] [ ,3]
[1 ,] 1 0 0
[2 ,] 0 1 0
292 APPENDIX A. MATRIX REVIEW
[3 ,] 0 0 1
> diag ( ( c ( 5 , 1 2 , 1 3 ) ) )
[ ,1] [ ,2] [ ,3]
[1 ,] 5 0 0
[2 ,] 0 12 0
[3 ,] 0 0 13
Appendix B
R Quick Start
B.1 Correspondences
aspect C/C++ R
assignment = <- (or =)
array terminology array vector, matrix, array
subscripts start at 0 start at 1
array notation m[2][3] m[2,3]
2-D array storage row-major order column-major order
mixed container struct list
return mechanism return return() or last value comp.
logical values true, false TRUE, FALSE
combining modules include, link library()
run method batch interactive, batch
293
294 APPENDIX B. R QUICK START
B.2 Starting R
# t e s t oddcount ( ) , b u t some t r a i t s o f v e c t o r s f i r s t
> y <− c ( 5 , 1 2 , 1 3 , 8 , 8 8 ) # c ( ) i s c o n c a t e n a t e
> y
[ 1 ] 5 12 13 8 88
> y [ 2 ] # R s u b s c r i p t s b e g i n a t 1 , not 0
[ 1 ] 12
> y [ 2 : 4 ] # e x t r a c t e l e m e n t s 2 , 3 and 4 o f y
[ 1 ] 12 13 8
> y [ c ( 1 , 3 : 5 ) ] # e l e m e n t s 1 , 3 , 4 and 5
[ 1 ] 5 13 8 88
> oddcount ( y ) # s h o u l d r e p o r t 2 odd numbers
[1] 2
# t r y i t on s u b s e t o f y , e l e m e n t s 2 t h r o u g h 3
296 APPENDIX B. R QUICK START
> oddcount ( y [ 2 : 3 ] )
[1] 1
> # t r y i t on s u b s e t o f y , e l e m e n t s 2 , 4 and 5
> oddcount ( y [ c ( 2 , 4 , 5 ) ] )
[1] 0
# f u r t h e r c o m p a c t i f y t h e code
> source ( ”odd .R” )
> oddcount
function ( x ) {
length ( x [ x %% 2 == 1 ] )
# l a s t v a l u e computed i s a u t o r e t u r n e d
}
> oddcount ( y ) # t e s t i t
[1] 2
$odds
[ 1 ] 5 13
$numodds
[1] 2
# s a v e t h e o u t p u t i n ocy , which w i l l be a l i s t
> ocy <− oddcount ( y )
> ocy
$odds
[ 1 ] 5 13
$numodds
[1] 2
> ocy$odds
[ 1 ] 5 13
> ocy [ [ 1 ] ]
[ 1 ] 5 13
# can g e t l i s t e l e m e n t s u s i n g [ [ ] ] i n s t e a d o f $
> ocy [ [ 2 ] ]
[1] 2
u s e r system e l a p s e d
0.008 0.000 0.006
> system . time ( { s <− 0 ;
f o r ( i i n 1 : 1 0 0 0 0 0 0 ) s <− s + x [ i ] } )
u s e r system e l a p s e d
2.776 0.004 2.859
[ ,1] [ ,2]
[1 ,] 3 5
[2 ,] 4 6
> m1 ∗ m3 # e l e m e n t w i s e m u l t i p l i c a t i o n
[ ,1] [ ,2]
[1 ,] 3 10
[2 ,] 20 48
> 2 . 5 ∗ m3 # s c a l a r m u l t i p l i c a t i o n ( b u t s e e b e l o w )
[ ,1] [ ,2]
[1 ,] 7.5 12.5
[ 2 , ] 10.0 15.0
> m1 %∗% m3 # l i n e a r a l g e b r a m a t r i x m u l t i p l i c a t i o n
[ ,1] [ ,2]
[1 ,] 11 17
[2 ,] 47 73
The “scalar multiplication” above is not quite what you may think, even
though the result may be. Here’s why:
In R, scalars don’t really exist; they are just one-element vectors. However,
R usually uses recycling, i.e., replication, to make vector sizes match. In
the example above in which we evaluated the express 2.5 * m3, the number
2.5 was recycled to the matrix
2.5 2.5
(B.1)
2.5 2.5
All three vector expressions must be the same length, though R will lengthen
some via recycling. The action will be to return a vector of the same length
(and if matrices are involved, then the result also has the same shape).
Each element of the result will be set to its corresponding element in vec-
torexpression2 or vectorexpression3, depending on whether the corre-
sponding element in vectorexpression1 is TRUE or FALSE.
In our example above,
> i f e l s e (m2 %%3 == 1 , 0 ,m2) # ( s e e b e l o w )
T F F
(B.2)
F T F
0 0 0
(B.3)
0 0 0
[1 ,] 1 2
[2 ,] 5 12
> rep ( 1 , 2 ) # ” r e p e a t , ” make m u l t i p l e c o p i e s
[1] 1 1
> ma %∗% rep ( 1 , 2 ) # m a t r i x m u l t i p l y
[ ,1]
[1 ,] 3
[2 ,] 17
> solve (ma, c ( 3 , 1 7 ) ) # s o l v e l i n e a r system
[1] 1 1
> solve (ma) # m a t r i x i n v e r s e
[ ,1] [ ,2]
[1 ,] 6 . 0 −1.0
[ 2 , ] −2.5 0 . 5
The R list type is, after vectors, the most important R construct. A list is
like a vector, except that the components are generally of mixed types.
$s
[ 1 ] ” abc ”
[ 1 ] ” abc ”
> f o r ( i i n 1 : length ( g ) ) print ( g [ [ i ] ] )
[1] 4 5 6
[ 1 ] ” abc ”
One often needs to combine elements of a list in some way. One approach
to this is to use Reduce():
> x <− l i s t ( 4 : 6 , c ( 1 , 6 , 8 ) )
> x
[[1]]
[1] 4 5 6
[[2]]
[1] 1 6 8
> sum( x )
E r r o r i n sum( x ) : i n v a l i d ’ type ’ ( l i s t ) o f argument
> Reduce (sum, x )
[ 1 ] 30
B.6.3 S3 Classes
$salary
[ 1 ] 55000
$union
[ 1 ] TRUE
attr ( , ” c l a s s ” )
[ 1 ] ” employee ”
304 APPENDIX B. R QUICK START
For example:
> z <− l i s t ( a = runif ( 5 0 ) ,
b = l i s t ( u=sample ( 1 : 1 0 0 , 2 5 ) , v=” b l u e sky ” ) )
> z
$a
[ 1 ] 0.301676229 0.679918518 0.208713522 0.510032893
0.405027042 0.412388038
[ 7 ] 0.900498062 0.119936222 0.154996457 0.251126218
0.928304164 0.979945937
[ 1 3 ] 0.902377363 0.941813898 0.027964137 0.992137908
0.207571134 0.049504986
[ 1 9 ] 0.092011899 0.564024424 0.247162004 0.730086786
0.530251779 0.562163986
[ 2 5 ] 0.360718988 0.392522242 0.830468427 0.883086752
0.009853107 0.148819125
[ 3 1 ] 0.381143870 0.027740959 0.173798926 0.338813042
0.371025885 0.417984331
[ 3 7 ] 0.777219084 0.588650413 0.916212011 0.181104510
0.377617399 0.856198893
[ 4 3 ] 0.629269146 0.921698394 0.878412398 0.771662408
0.595483477 0.940457376
[ 4 9 ] 0.228829858 0.700500359
$b
$b$u
B.7. DEBUGGING IN R 305
[ 1 ] 33 67 32 76 29 3 42 54 97 41 57 87 36 92 81
31 78 12 85 73 26 44
86 40 43
$b$v
[ 1 ] ” b l u e sky ”
> names( z )
[ 1 ] ” a ” ”b”
> str (z)
List of 2
$ a : num [ 1 : 5 0 ] 0 . 3 0 2 0 . 6 8 0 . 2 0 9 0 . 5 1 0 . 4 0 5 . . .
$ b : List of 2
. . $ u : i n t [ 1 : 2 5 ] 33 67 32 76 29 3 42 54 97 41 . . .
. . $ v : c h r ” b l u e sky ”
> names( z$b )
[ 1 ] ”u” ”v”
> summary( z )
Length C l a s s Mode
a 50 −none− numeric
b 2 −none− l i s t
B.7 Debugging in R
Introduction to C for R
Programmers
The C language is quite complex, and C++ is even more so. The goal of
this appendix is to give a start at a reading ability in the C language for
those familiar with R.
// i n c l u d e d e f i n i t i o n s needed f o r s t a n d a r d I /O
#include <s t d i o . h>
// f u n c t i o n t o s q u a r e t h e e l e m e n t s o f an a r r a y x ,
// o f l e n g t h n , in−p l a c e ; both arguments a r e o f
// i n t e g e r
int s q r ( int ∗x , int n ) {
// a l l o c a t e s p a c e f o r an i n t e g e r i
int i ;
// f o r loop , i = 0 , 1 , 2 , . . . , n−1
f o r ( i = 0 ; i < n ; i ++)
x[ i ] = x[ i ] ∗ x[ i ];
}
307
308 APPENDIX C. INTRODUCTION TO C
int main ( ) {
// a l l o c a t e s p a c e f o r an a r r a y y o f 10
// i n t e g e r s , and a s i n g l e i n t e g e r i
int y [ 1 0 ] , i ;
f o r ( i = 0 ; i < 5 ; i ++)
// i n p u t y [ i ]
s c a n f ( ”%d” ,&y [ i ] ) ;
s q r ( y , 1 0 ) ; // c a l l t h e f u n c t i o n
f o r ( i = 0 ; i < 5 ; i ++)
p r i n t f ( ”%d\n” , y [ i ] ) ;
}
Here is the compilation and sample run, using the gcc compiler:
$ g c c −g Learn . c
$ . /a . out
5 12 13 8 88
25
144
169
64
7744
C.0.2 Analysis
The comments explain most points, but a couple need detailed elabora-
tion. First, consider line 20. Every C program (or other executable binary
program, for that matter) is required to have a main() function, where
execution will start.
As you see in line 23 and other places, every variable needs to be declared,
meaning that we must request the compiler to make space for it. Since
array indices start at 0 in C, this means we have to set up y[0] through
y[9]. The compiler also needs to know the type of the variable, in this case
integer.
The scanf() function has two arguments, the first here being the character
string ”%d”, which defines the format. Here we are specifying to read in
one integer (’d’ refers to “decimal number”).
Things get more subtle in the second argument, where we see a major philo-
sophical difference between C and R. The latter prides itself in (usually)
not allowing side effects, meaning that in R one cannot change the value of
309
an argument. The R call sort(x), for instance, does NOT change x. What
about C?
Technically, C doesn’t allow direct changes to arguments either. But the key
is that C allows—and makes heavy use of—pointer variables. For example,
consider the code:
int u;
i n t ∗v = &u ;
∗v = 8 ;
The ampersand in &u means that the expression evaluates to the memory
address of u. The asterisk in v tells the compiler that we intend v to
contain a memory address. Finally, the line
∗v = 8 ;
says, “Put 8 in whatever memory location v points to.” This means that
u will now contain 8!
Now we can see what is happening in that scanf() call. Let’s rewrite it
this way:
i n t ∗z = &y [ i ] ;
s c a n f ( ”%d” , z ) ;
We are telling the compiler to produce machine code that will place the
value read in from the keyboard to whatever memory location is pointed
to by z—i.e., to place the value in y[i]. Convoluted, but this is how things
work. We don’t need it in line 29, as we are not changing y[i].
The same principles are at work in lines 12 and 17. In the former, we
state that x is a pointer variable. But how does that jibe with line 27?
Why doesn’t the latter write something like &y? The reason is that ar-
ray variables are considered pointers. The simple expression y (without a
subscript) actually means &y[0].
If you are having trouble with this, console yourself with the fact that point-
ers are by far the most difficult concept that beguile novice C programmers
(especially computer science students!). Just persist—you’ll quickly get
used to it.
310 APPENDIX C. INTRODUCTION TO C
C.1 C++
When the Object Oriented Programming wave came in, many in the C
world wanted OOP for C. Hence, C++ (originally called “C with Classes”).
The reader is referred to any of the excellent tutorials on C++ on the Web
and in books. But here is a very brief overview, just to give the reader an
inkling of what is involved.
C++ class structure is largely like R’s S4. We create a new class instance
using the keyword new, much like the call to new() we make in R to create
an S4 class object.
As with S4, C++ classes generally contain methods, i.e., functions defined
specific to that class. For instance, in our Rcpp code in this book, we
sometimes have the expression Rcpp::wrap, meaning the wrap() function
within the Rcpp class.
C++ continues C’s emphasis on pointers. One important keyword, for
example, is this. When invoked from some method of a class instance, it
is a pointer to that instance.
Statistics
The R Series
Parallel Computing for Data Science: With Examples in R, C++ and CUDA is
one of the first parallel computing books to concentrate exclusively on parallel data
With Examples in
structures, algorithms, software tools, and applications in data science. It includes
examples not only from the classic “n observations, p variables” matrix format but
also from time series, network graph models, and numerous other structures com-
mon in data science. The book also discusses software packages that span more
than one type of hardware and can be used from more than one type of program-
ming language.
R, C++ and CUDA
Features
• Focuses on applications in the data sciences, including statistics, data mining,
and machine learning
• Discusses structures common in data science, such as network data models
• Emphasizes general principles throughout, such as avoiding factors that
reduce the speed of parallel programs
• Covers the main types of computing platforms: multicore, cluster, and
graphics processing unit (GPU)
• Explains how the Thrust package eases the programming of multicore
machines and GPUs and enables the same code to be used on either platform
• Provides code for the examples on the author’s web page
Norman Matloff
K20322
w w w. c rc p r e s s . c o m