GPUMap - A Transparently GPU-Accelerated Python Map Function
GPUMap - A Transparently GPU-Accelerated Python Map Function
Function
Ivan Pachev Chris Lupo
California Polytechnic State University California Polytechnic State University
San Luis Obispo, California, USA San Luis Obispo, California, USA
ipachev@calpoly.edu clupo@calpoly.edu
4.1 Requirements
3.3 Numba
The implementation of GPUMap imposes the following require-
Numba is a just-in-time (JIT) compiler for Python targeted towards ments to support GPU execution.
scientifc computing. Numba provides limited GPU-acceleration,
allowing programmers to operate on NumPy arrays in parallel • Objects must contain only integers, foating point numbers,
using the GPU [4]. GPU-acceleration with Numba is much easier booleans, or other objects.
than writing CUDA code because Numba provides considerable • Objects of the same class must have the same felds and
simplifcations to the traditional GPU programming model. the same types in their corresponding felds, i.e. must be
Numba does not require programmers to be well-versed in CUDA homogeneous.
programming. Instead, the code that is to be translated can be • Objects cannot contain members of the their own class either
written in a subset of Python. The reason only a subset of Python directly or indirectly.
is allowed is because Numba must be to be able to infer all types • Lists must contain only one class of objects.
in order to generate the proper GPU code [4]. This means that • Functions or methods must be passed arguments of the same
object-oriented code cannot be used. type every time they are called.
• Functions or methods must return the same type every time
they are called.
3.4 Multi-GPU MapReduce (GPMR) • When a function is called, the function must call the same
Multi-GPU MapReduce is a fexible MapReduce implementation functions or methods every time.
that works on clusters where the compute nodes have GPUs. Multi-
GPU MapReduce outperforms the normal CPU implementation 4.2 Invocation
of MapReduce, as well as other single-GPU implementations of When the programmer calls gpumap(f, L), the following steps are
MapReduce [14]. Although the framework performs well and is taken in order to perform the desired map operation:
fexible, Multi-GPU MapReduce does not provide a high-level ab- (1) f is applied to the frst element of L, L 0 , to produce L 0′ and
straction for the underlying system, which may cause difculty for runtime inspection is performed to analyze every function
users without GPU programming experience. call.
3
(2) The felds of L 0 and L 0′ are inspected to collect data about 4.5 Code Generation
the classes of L 0 and L 0′ . In order to operate on a list L by applying a function f to each
(3) If f is a closure, any objects that are included in the clo- element in the list on the GPU, the necessary CUDA C++ class
sure bindings are also inspected and information is collected defnitions and function/method defnitions must be generated.
about their classes. This process is non-trivial, and requires the emulation of Python’s
(4) CUDA C++ class defnitions are created for the necessary pass-by-reference behavior by passing all objects as references to
classes by using the information collected during runtime functions.
inspection and object inspection. Classes, methods, and functions are generated from the data
(5) Any functions and methods, including constructors, that are structures extracted from runtime inspection.
called when applying f to L 0 are translated into CUDA C++.
(6) All of the elements of L 1...n are serialized to the GPU. Any Built-in Functions. Support was added for built-in functions such
of the objects or lists that have closure bindings in f are also as math functions, len, print, and others. Some of these built-in
serialized. functions have existing counterparts in CUDA C++, such as the
(7) The map kernel, which includes all class, function, and method math functions. Other functions that do not have existing counter-
translations, is compiled and executed on the GPU, applying parts, such as len, are implemented in C++ and are supplied in a
the translation of f , f ′ , to each element in the serialized header during compilation. Due to the fact that the names of built-
version of L 1...n . in Python functions do not always match up with built-in CUDA
(8) The serialized input list, L 1...n , and any closure objects or C++ functions, translating the names of built-in Python functions
lists are deserialized and the data is re-incorporated into the may be necessary.
original objects.
(9) The output list L 1...n
′ is deserialized and is used to populate 4.6 Kernel Generation
a list of objects based on the structure of L 0′ . The fnal step of code generation is fnalizing the CUDA kernel
(10) L 0′ is prepended to L 1...n
′ to form L ′ as desired and L ′ is function that will be executed on the GPU. The kernel function is a
returned. function that is executed by each GPU thread. In order to parallelize
the map process, the GPU thread will apply the top-level function
4.3 Implementation Details to a diferent list item.
There are many details and important algorithms used in the im-
plementation of GPUMap. Some of the most pertinent steps are 4.7 Serialization
described here, but the full discussion of the implementation is be- When calling gpumap(f, L) with a function f and a list L, L and
yond the scope of this paper. For full details of the implementation, any closure variables of f must be serialized and copied to the GPU.
the reader is referred to Reference [8], and the source repository The list L and any closure variables are not cached on the GPU and
on GitHub. must be serialized for every call to gpumap, although this may be
addressed by future work to improve performance.
4.4 Runtime Inspection After the translated code is executed, L and f ’s closure variables
Prior to doing any code translation or serialization, some data must must be copied back to the host and deserialized.
be collected about the functions and methods that will need to 4.7.1 Serializing Objects. Prior to copying an object to the GPU,
be called, as well as the objects that will need to be operated on. the object must be serialized into the proper format so that the
This data is acquired through inspection of the felds of objects and object can be processed by the translated CUDA code. Serializing
tracing function execution when the given function is applied to a Python object involves collecting all of its non-contiguous parts
the frst element in the list. Runtime inspection consists of call and and collecting them in a contiguous section of memory as normal
object inspection, which are performed to extract representations binary data, as depicted in Figure 2.
of functions, methods, and objects. In order to collect all the data in an object, including the data in
Several pieces of information need to be known about a class of its nested objects, the object’s felds can be recursively examined.
objects in order to generate the appropriate CUDA class defnitions The order in which the felds are accessed must be in the same order
and properly serialize objects of that class. This information is as the class defnition that is created during the class generation
stored in a Class Representation. phase.
The Class Representation is a recursive data structure that may
contain multiple other Class Representations for each feld. A Class 4.7.2 Serializing Lists of Objects. The process for serializing an
Representation is extracted by examining all of the felds of a sample entire list is similar to serializing a single object so that objects,
object, obj, by iterating through obj’s felds as a normal Python dict, whether or not they originate from a list, can be accessed and
using obj.__dict__. This dict contains a mapping from obj’s feld manipulated the same way, using the same C++ class defnition. The
names to the objects contained in those felds. For each entry in same is true for deserialization. However, in the case of deserializing
this dict, the feld name is recorded, and a Class Representation is the output list of the map operation, the objects do not yet exist
extracted and recorded for the object contained in the feld. Figure 1 in the Python code, so a slightly diferent approach must be taken
depicts a sample extraction of a Class Representation of an object that also involves object creation, so the data can be unpacked into
of a class called classA. objects.
4
Figure 1: Sample Extraction of a Class Representation
For each object to be instantiated in the output list, the frst a partition. The mapPartitions method accepts a function that
output object, which was created during runtime inspection, is takes an iterator of the input type and produces an iterator of the
deep copied and is used as a skeleton for the new object. transformed type, making д an acceptable candidate.
The object is then inserted into the output list. Once the correct The closure д that is implemented inside the body of map returns
number of objects have been created, populated, and inserted into a map generator that applies the function passed to map, f , to each
the output list, the output list is returned. element returned by the iterator supplied to д. Map generators
are created by using Python’s built-in map function. Each time the
4.8 Integration in Spark map generator is iterated upon, the generator lazily applies f to a
new element from the iterator passed to д and produces the output.
GPUMap can be integrated in a variety of transformations and
In order to obtain an iterator to an entire partition, д is passed
actions that can be performed on Spark RDDs. This section describes
to mapPartitions, where д will be given the partition iterator.
the implementation of GPURDD, which is a class that extends RDD
The call to mapPartitions returns a handle to an RDD that will
and provides alternative implementations of map, flter, and foreach.
eventually contain the items transformed using д, once the RDD is
The restrictions described at the beginning of this section regarding
evaluated, and this is returned from the RDD’s map method.
lists also extend to Spark RDDs.
In order to incorporate GPUMap into GPURDD’s map function,
4.8.1 Map. The map method on a Spark RDD allows the pro- the implementation of д must be altered to create д ′ . The function
grammer to perform a transformation on the RDD by individually д ′ must still take an iterator of f ’s input type and return an iterator
applying a function f to each element of the RDD to produce a new of f ’s output type. The application of f on many items from the
RDD containing transformed elements. iterator must be evaluated immediately in parallel, rather than
The existing implementation of RDD’s map method defnes a producing a map generator to evaluate the application of f to each
function д that takes an iterator of the input type of f and returns element from an iterator in sequence. This means that access to all
an iterator of the output type of f . Because f is stored in the closure the elements in a partition simultaneously is necessary, which can
bindings of д, f can be passed along with д. This closure д is then be achieved by exhausting the iterator into a list. This list can then
passed to mapPartitions so that f is applied to each element in be passed into GPUMap, along with f , in order to apply f in parallel
5
and produce a transformed list. Because GPUMap outputs a list iterator over a partition, so that the elements of the partition can be
and not an iterator, д ′ must return an iterator over the list, rather fltered. An RDD handle is returned by the call to mapPartitions.
than just the list itself. The last step of GPURDD’s map method is The purpose of incorporating GPUMap into GPURDD’s filter
to return the return value of the call to mapPartitions, which is a method is to attempt to speed up the evaluation of f on each
handle to an RDD that will eventually contain the values that will element of the partition. Due to the fact that when using GPUMap,
have been transformed using д ′ . the input list and output list must have a one-to-one correspondence,
Once an action is performed on the resulting GPURDD in order to GPUMap cannot be directly used to flter the elements. However, the
evaluate all of the transformations, д ′ will be called with an iterator results of applying f to each item can be computed using GPUMap
to the partition elements in order to transform them. Although and can be subsequently used to remove elements.
the partition iterator is exhausted by д ′ , Spark’s lazy evaluation In order to implement GPURDD’s filter method, a function
model is still preserved because д ′ is passed to mapPartitions. д ′ must be created to be used with mapPartitions, similar to
The mapPartitions method will only call д ′ when the time comes GPURDD’s implementation of map. First, the iterator passed to
to evaluate the RDD. RDDs are evaluated when an action, such as д ′ must be exhausted to produce a list of items that can be operated
collect, count, or foreach is performed on them. on in parallel. Then, gpumap is called with f and the list of items
to produce a list of boolean values indicating whether or not to
4.8.2 Foreach. The foreach method on a Spark RDD allows
keep an entry in the list of items. Once the list of items and the
the programmer to apply a function f to each element of an RDD
list of booleans are available, Python’s zip iterator can be used to
without transforming the RDD. Instead, the foreach method is
provide tuples consisting of the item itself and its corresponding
used to produce side-efects in the elements of the RDD.
boolean. Then, Python’s built-in filter function is used to create a
The existing implementation of foreach makes use of the RDD’s
flter iterator from the zip iterator. This flter iterator will not return
mapPartitions method, similarly to the way map method does. The
tuples where the second feld of the tuple, the boolean, is false. Then
foreach method simply defnes a function д that iterates through
the tuples returned by the flter iterator can be converted back into
a partition iterator, applying f to each item of the partition, ig-
items by using a map generator. A map generator is created by
noring the return value of the call to f . In order to comply with
using Python’s built-in map function that maps a tuple yielded by
the fact that any function passed to mapPartitions must return
the flter iterator to the tuple’s frst feld, which is the item itself.
an iterator, д returns an empty iterator. This function, д, is passed
This map generator serves as an iterator over the fltered items and
to mapPartitions, which creates a handle to an empty, dummy
is returned by д ′ . This process is illustrated in Figure 3.
RDD. The reason an empty RDD is created is because д returns an
Then д ′ is passed to mapPartitions and the resulting RDD
empty iterator. When this dummy RDD is evaluated, f is applied
handle is returned. Once an action is performed on the resulting
to each element of the source RDD. In order to force evaluation of
RDD, then the RDD will be evaluated and д ′ will be called on an
the dummy RDD, the foreach method calls the count method of
iterator over each partition, as with GPURDD’s implementation of
this dummy RDD. The handle to the dummy RDD is not returned
map.
from foreach, as a handle to the dummy RDD is not useful.
GPUMap preserves side-efects, and can be efectively incorpo-
rated into foreach. The approach taken to incorporate GPUMap is
5 EXPERIMENTS
similar to the approach taken with GPURDD’s map method. A func- In order to determine the types of workloads that can be accelerated
tion д ′ is defned that takes an iterator over a partition and returns using GPUMap, performance benchmarks are performed. The tests
an empty iterator. The body of д ′ simply consists of exhausting the used for performance benchmarking GPUMap are the n-body test
partition iterator into a list, and calling gpumap with f and the list and the bubble sort test. The tests used for performance bench-
created by exhausting the iterator. The return value of the call to marking GPURDD are the shell sort test and the pi estimation tests.
gpumap can be discarded as it is not useful. These benchmarks use a variety of algorithms with diferent time
Then, д ′ is passed to mapPartitions to create a handle to a complexities, allowing us to examine the viability of GPUMap or
dummy RDD and, similarly to RDD’s foreach method, evaluation GPURDD in these diferent scenarios.
of the dummy RDD is forced using the handle’s count method in The experimental setup consists of machines ftted with:
order to apply д ′ to each partition. • Intel Xeon E5-2695 v3 CPU @ 2.30GHz
4.8.3 Filter. The filter method on a Spark RDD allows the • NVIDIA GeForce GTX 980
programmer to transform an RDD by removing elements from • 32 GB Memory
the RDD by applying a function f to each element that returns a • CentOS 7
boolean indicating whether or not to keep the element. The NVIDIA GeForce GTX 980 has 4GB of memory and 2048
The filter method is implemented very similarly to how map CUDA cores running at 1126 MHz. The GTX 980 supports CUDA
is implemented, incorporating the use of mapPartitions. This Compute Capability 5.2 which allows up to 1024 threads per block
method defnes a closure д that takes an iterator that provides and a maximum one-dimensional grid size of 231 − 1. Although the
elements of the RDD and returns an iterator that provides elements maximum number of threads per block is 1024, the benchmarks all
that did not get fltered. The closure д calls Python’s built-in filter use a block size of 512 × 1 × 1. The grid size is ⌈n/512⌉ × 1 × 1 where
function with f to create an iterator that produces items from an n is the size of the input list. This confguration allows GPUMap to
iterable for which a f returns true and simply returns this iterator. achieve an occupancy of 100%, meaning that all 2048 CUDA cores
Then, д is passed to mapPartitions, which provides д with an on the GTX 980 are able to be used concurrently.
6
Figure 3: GPURDD flter method
REFERENCES
[1] Jefrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplifed data processing
on large clusters. Commun. ACM 51, 1 (2008), 107–113.
[2] Gary Frost and Mohammed Ibrahim. 2015. Aparapi Documentation. Technical
Report. AMD Open Source Zone. https://aparapi.github.io/http://developer.amd.
com/tools-and-sdks/open-source/
[3] Andreas Klöckner, Nicolas Pinto, Yunsup Lee, B. Catanzaro, Paul Ivanov, and
Ahmed Fasih. 2012. PyCUDA and PyOpenCL: A Scripting-Based Approach
to GPU Run-Time Code Generation. Parallel Comput. 38, 3 (2012), 157–174.
https://doi.org/10.1016/j.parco.2011.09.001
[4] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A LLVM-based
Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler
Infrastructure in HPC (LLVM ’15). ACM, New York, NY, USA, Article 7, 6 pages.
https://doi.org/10.1145/2833157.2833162
[5] Y. Lin, S. Okur, and C. Radoi. 2012. Hadoop+Aparapi: Making heterogenous MapRe-
duce programming easier. Technical Report. hgpu.org. http://www.semihokur.
com/docs/okur2012-hadoop_aparapi.pdf
[6] NVIDIA Corporation. 2017. CUDA C Programming Guide. Technical Report.
NVIDIA Corporation. http://docs.nvidia.com/cuda/cuda-c-programming-guide
[7] John D Owens, Mike Houston, David Luebke, Simon Green, John E Stone, and
Figure 10: Stage duration using GPURDD to sort lists of size James C Phillips. 2008. GPU computing. Proc. IEEE 96, 5 (2008), 879–899.
10000 using shell sort [8] Ivan Pachev. 2017. GPUMap: A Transparently GPU-Accelerated Map Function.
Master’s thesis. California Polytechnic State University.
[9] Philip C Pratt-Szeliga, James W Fawcett, and Roy D Welch. 2012. Rootbeer:
Seamlessly using GPUs from Java. In High Performance Computing and Communi-
order to attempt to provide the programmer with a GPU-accelerated cation & 2012 IEEE 9th International Conference on Embedded Software and Systems
(HPCC-ICESS), 2012 IEEE 14th International Conference on. IEEE, Liverpool, United
map function that does not require any extra efort to use. Kingdom, 375–380.
GPUMap achieves a considerable performance improvement for [10] Python Software Foundation. 2017. Data Model. Technical Report. Python
certain types of programs. For compatible algorithms that have Software Foundation. https://docs.python.org/3/reference/datamodel.html
[11] Robert Sedgewick. 1996. Analysis of Shellsort and Related Algorithms. In Proceed-
considerably larger time complexity than O(n) and a large enough ings of the Fourth Annual European Symposium on Algorithms (ESA ’96). Springer-
data set, GPUMap may provide performance improvements. Dur- Verlag, London, UK, UK, 1–11. http://dl.acm.org/citation.cfm?id=647906.739656
[12] Oren Segal, Philip Colangelo, Nasibeh Nasiri, Zhuo Qian, and Martin Margala.
ing benchmarking of GPUMap using the O(n 2 ) n-body algorithm, 2015. SparkCL: A Unifed Programming Framework for Accelerators on Hetero-
GPUMap was able to produce a speed up of 249 times on the largest geneous Clusters. CoRR abs/1505.01120 (2015). http://arxiv.org/abs/1505.01120
data set. However, for algorithms with O(n) time complexity or [13] Artjoms Šinkarovs, Sven-Bodo Scholz, Robert Bernecky, Roeland Douma, and
Clemens Grelck. 2014. SaC/C formulations of the all-pairs N-body problem and
better, GPUMap will likely not yield any considerable speed-ups their performance on SMPs and GPGPUs. Concurrency and Computation: Practice
due to O(n) serialization complexity. and Experience 26, 4 (2014), 952–971.
GPURDD was created to incorporate the simplifed GPU-acceleration [14] J. A. Stuart and J. D. Owens. 2011. Multi-GPU MapReduce on GPU Clusters. In
2011 IEEE International Parallel Distributed Processing Symposium. 1068–1079.
provided by GPUMap into Apache Spark. GPURDD’s map and https://doi.org/10.1109/IPDPS.2011.102
foreach methods use GPUMap to apply a given function to each [15] Yonghong Yan, Max Grossman, and Vivek Sarkar. 2009. JCUDA: A programmer-
friendly interface for accelerating Java programs with CUDA. In Euro-Par 2009
item in a partition. In the case of the filter method, GPUMap is Parallel Processing: 15th International Euro-Par Conference, Delft, The Netherlands,
used to apply the fltering function to each item in a partition to August 25-28, 2009. Proceedings. Springer, Springer Berlin Heidelberg, Berlin,
determine whether each element should be kept. The elements that Heidelberg, 887–899. https://doi.org/10.1007/978-3-642-03869-3_82
[16] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
should be not kept are then pruned outside of GPUMap. Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Re-
silient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing. In Presented as part of the 9th USENIX Symposium on Networked Sys-
6.1 Future Work tems Design and Implementation (NSDI 12). USENIX, San Jose, CA, 15–28. https:
There are some Python language features, data structures, and built- //www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
[17] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion
in functions that are unsupported in code translated by GPUMap, Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10 (2010),
primarily because GPUMap does not make use of CUDA’s thread- 10–10.
level dynamic allocation as it does not perform well when many
threads attempt to allocate memory simultaneously. This means
that variable length data structures such as lists, dicts, and strings
are unsupported by GPUMap. However, GPUMap supports limited
usage of lists that are included as input list elements or closure
variables. By using an alternative thread-level dynamic allocation
scheme, it may be possible to incorporate dynamic allocation into
GPUMap so that many more Python features can be implemented.
There are further performance improvements that can be made
to GPUMap by caching input lists and closure variables and par-
allelizing the serialization process, which may help GPUMap per-
form better overall by decreasing the serialization time. Because
10