Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0
Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0
Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0
Task Level Parallelization of All Pair Shortest Path Algorithm in OpenMP 3.0
Abstract—OpenMP is a standard parallel programming lan- One such API is OpenMP [11].
guage to develop parallel applications on shared memory ma- OpenMP contains a set of compiler directives and
chines. OpenMP is very suitable for designing parallel libraries to execute specific instructions in parallel and to
algorithms for regular applications where the amount of work divide the work among threads. OpenMP employs a fork-
is known apriori and therefore, distribution of work among the join paradigm. The program starts with one thread called the
threads can be done at compile time. In irregular applications, master thread. Then, whenever there is a parallel region in
the load changes dynamically at runtime and distribution of the program, the master thread invokes a set of slave threads
work among the threads can be done only at runtime. In the and distributes the work among them. This operation is
literature, it has been shown that OpenMP produces poor
called fork. After forking, the threads are allocated to the
performance for irreg-ular applications. In 2008, the OpenMP
3.0 version introduced new features such as ”tasks” to handle
processors by the runtime environment and work
irregular computations. Not much work has gone into studying concurrently to solve the problem. Once the slave threads
irregular algorithms in OpenMP 3.0. In this paper, we consider have completed their work, they are destroyed and the
one graph problem, the all pair shortest path problem and its master thread continues until it encounters another parallel
implementation in OpenMP 3.0. We show that for large region. This operation is called join.
number of vertices, the algorithm running on OpenMP 3.0 OpenMP is very suitable for designing algorithms for
surpasses the one on OpenMP 2.5 by 1.6 times. regular applications. The data structures used in these
problems are structured (such as an array). The program
Keywords-OpenMP 3.0; All Pair Shortest Path; Task Paral- flow and memory access patterns are also very structured
lelization and are known apriori. An example of a regular problem is
matrix-vector multiplication, where A is a dense matrix, x is
I. INTRODUCTION a vector and b is the resultant vector. In this example, the
computations or operations required producing the output
Homogeneous multicore architectures have been used and data access patterns are known beforehand. On a
widely in the past decade. This is due to the need to have multiprocessor system, each processor can be assigned the
machines with high performance that are more computation- same vector x with certain number of data elements (a row
ally powerful than uniprocessor machines. In a or a given number of rows) to compute an element(s) in b.
homogeneous multi-core architecture, many identical All processors perform the same computations to produce
processors or cores work together to perform complex tasks. the resultant vector but with different data sets. As a result,
Many companies such as Intel, have moved towards these problems can be optimized to run on any type of
increasing the processor’s power by adding more cores on a architecture relatively easily. These problems are also
single chip. Most of the commodity homogeneous classified as data parallel applications.
architectures have many duplicated CPUs on a single chip The same is not true for irregular applications. Irregular
with a shared memory. The different CPUs interact with applications rely on pointer or graph-based data structures.
each other through shared variables. The algorithms used to solve irregular applications are
Shared memory machines can be categorized as either referred to as irregular algorithms. Graph problems, list
Uniform Memory Access (UMA) or Non-Uniform Memory ranking and unstructured grid problems are examples of
Access (NUMA) architectures. In UMA machines, the irregular computations. In these computations [15], [12],
CPUs have same access time to a shared primary memory. [6], [10], the data size changes dynamically at runtime,
On the other hand, each CPU in NUMA has its own leading to non-uniform memory access and communication
memory. This memory can be accessed by the CPU that it latencies. The load or amount of work to be distributed to
belongs to or by other CPUs. The memory access time is, the threads is not known apriori. We could consider the
therefore, non-uniform. Modern homogeneous multicore matrix-vector multiplication as an irregular problem, if A is
architectures with a shared memory system are also a sparse matrix. Since A is instance specific, the structure of
multithreaded. The cores have the capabilities of handling A is unknown at compile time. A matrix is not necessarily
several threads concurrently. These architectures exploit the correct data structure to use since there may be many 0‘s
both instruction level parallelism and thread level in the matrix wasting memory resources. In such problems,
parallelism. There are many parallel programming accesses to data often have poor spatial and temporal
languages or APIs that support a shared memory paradigm. locality leading to ineffective use of the memory hierarchy
© 2013. The authors - Published by Atlantis Press 109
110
200 [5] Nadira Jasika, Naida Alispahic, Arslanagic Elma, Kurtovic Ilvana,
Lagumdzija Elma, and Novica Nosovic. Dijkstra’s shortest path
150 algorithm serial and parallel execution performance analysis. In
Proceedings of the 35th International Convention on Information and
100 Communication Technology, Electronics and Microelectronics
(MIPRO 2012), Opatija, Croatia, pages 1811 –1815, 21–25 May
2012.
50
[6] Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Rama-
narayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism
0
requires abstractions. In Proceedings of the 2007 ACM SIGPLAN
0 1;000 2;000 3;000 4;000 con-ference on Programming language design and implementation
(PLDI 2007), San Diego, CA, USA, pages 211–222, 2007.
Number of vertices
[7] Jian Ma, Ke ping Li, and Li yan Zhang. A parallel Floyd-Warshall
Fig. 1: Execution Time for SSAC#2 algorithm based on TBB. In In proceesing of The 2nd IEEE In-
ternational Conference on Information Management and Engineering
(ICIME 2010), Bangkok, Thailand, pages 429–433, 2010.
250 OpenMP 3.0 [8] Timothy G. Mattson. How good is openmp. Scientific Programming,
OpenMP 2.5 11(2):81–93, 2003.
[9] Aaftab Munshi. The opencl specification. Khronos OpenCL
Execution time in seconds
200
Working Group, 1:l1–15, 2009.
150 [10] Jarek Nieplocha, Andres` Marquez,´ John Feo, Daniel Chavarr´ıa-
Miranda, George Chin, Chad Scherrer, and Nathaniel Beagley.
Evaluating the potential of multithreaded platforms for irregular
100
scientific computa-tions. In Proceedings of the 4th international
conference on Computing frontiers (CF 2007), Ischia, Italy, pages 47–
50 58, 7–9 May 2007.
[11] OpenMP. The OpenMP API specification for parallel programming.
0 http://openmp.org/wp/, 1998.
0 1;000 2;000 3;000 4;000 [12] Simone Secchi, Antonino Tumeo, and Oreste Villa. A bandwidth-
optimized multi-core architecture for irregular applications. In
Number of vertices
Proceed-ings of 12th IEEE/ACM International Symposium on
Cluster, Cloud and Grid Computing, (CCGrid 2012), Ottawa, ON,
Fig. 2: Execution Time for R-MAT Canada, pages 580–587, 13–16 May 2012.
[13] Michael Sußand¨ Claudia Leopold. Implementing irregular parallel
algorithms with OpenMP. In Proceedings of the 12th international
V. CONCLUSION conference on Parallel Processing (Euro-Par 2006), Dresden,
In this paper we implemented one graph problem, all Germany, pages 635–644, 2006.
pair shortest path problem in OpenMP 3.0. We showed that [14] Gayathri Venkataraman, Sartaj Sahni, and Srabani Mukhopadhyaya.
A blocked all-pairs shortest-paths algorithm. Journal on Experimental
the algorithm run 1.6 times faster than the OpenMP 2.5 Algorithmics, 8, December 2003.
version for two different types of graphs.
[15] Zheng Zhang and Josep Torrellas. Speeding up irregular applications
in shared-memory multiprocessors: memory binding and group
111
112