Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0

2nd International Conference on Advances in Computer Science and Engineering (CSE 2013)
Task Level Parallelization of All Pair Shortest Path Algorithm in OpenMP 3.0
Eid Albalawi Parimala Thulasiraman Ruppa Thulasiram

Department of Computer Science Department of Computer Science Department of Computer Science
University Of Manitoba University Of Manitoba University Of Manitoba
Winnipeg, Manitoba Winnipeg, Manitoba Winnipeg, Manitoba
Email: albalawi@cs.umanitoba.ca Email: thulasir@cs.umanitoba.ca Email: tulsi@cs.umanitoba.ca
Abstract—OpenMP is a standard parallel programming lan- One such API is OpenMP [11].
guage to develop parallel applications on shared memory ma- OpenMP contains a set of compiler directives and
chines. OpenMP is very suitable for designing parallel libraries to execute specific instructions in parallel and to
algorithms for regular applications where the amount of work divide the work among threads. OpenMP employs a fork-
is known apriori and therefore, distribution of work among the join paradigm. The program starts with one thread called the
threads can be done at compile time. In irregular applications, master thread. Then, whenever there is a parallel region in
the load changes dynamically at runtime and distribution of the program, the master thread invokes a set of slave threads
work among the threads can be done only at runtime. In the and distributes the work among them. This operation is
literature, it has been shown that OpenMP produces poor
called fork. After forking, the threads are allocated to the
performance for irreg-ular applications. In 2008, the OpenMP
3.0 version introduced new features such as ”tasks” to handle
processors by the runtime environment and work
irregular computations. Not much work has gone into studying concurrently to solve the problem. Once the slave threads
irregular algorithms in OpenMP 3.0. In this paper, we consider have completed their work, they are destroyed and the
one graph problem, the all pair shortest path problem and its master thread continues until it encounters another parallel
implementation in OpenMP 3.0. We show that for large region. This operation is called join.
number of vertices, the algorithm running on OpenMP 3.0 OpenMP is very suitable for designing algorithms for
surpasses the one on OpenMP 2.5 by 1.6 times. regular applications. The data structures used in these
problems are structured (such as an array). The program
Keywords-OpenMP 3.0; All Pair Shortest Path; Task Paral- flow and memory access patterns are also very structured
lelization and are known apriori. An example of a regular problem is
matrix-vector multiplication, where A is a dense matrix, x is
I. INTRODUCTION a vector and b is the resultant vector. In this example, the
computations or operations required producing the output
Homogeneous multicore architectures have been used and data access patterns are known beforehand. On a
widely in the past decade. This is due to the need to have multiprocessor system, each processor can be assigned the
machines with high performance that are more computation- same vector x with certain number of data elements (a row
ally powerful than uniprocessor machines. In a or a given number of rows) to compute an element(s) in b.
homogeneous multi-core architecture, many identical All processors perform the same computations to produce
processors or cores work together to perform complex tasks. the resultant vector but with different data sets. As a result,
Many companies such as Intel, have moved towards these problems can be optimized to run on any type of
increasing the processor’s power by adding more cores on a architecture relatively easily. These problems are also
single chip. Most of the commodity homogeneous classified as data parallel applications.
architectures have many duplicated CPUs on a single chip The same is not true for irregular applications. Irregular
with a shared memory. The different CPUs interact with applications rely on pointer or graph-based data structures.
each other through shared variables. The algorithms used to solve irregular applications are
Shared memory machines can be categorized as either referred to as irregular algorithms. Graph problems, list
Uniform Memory Access (UMA) or Non-Uniform Memory ranking and unstructured grid problems are examples of
Access (NUMA) architectures. In UMA machines, the irregular computations. In these computations [15], [12],
CPUs have same access time to a shared primary memory. [6], [10], the data size changes dynamically at runtime,
On the other hand, each CPU in NUMA has its own leading to non-uniform memory access and communication
memory. This memory can be accessed by the CPU that it latencies. The load or amount of work to be distributed to
belongs to or by other CPUs. The memory access time is, the threads is not known apriori. We could consider the
therefore, non-uniform. Modern homogeneous multicore matrix-vector multiplication as an irregular problem, if A is
architectures with a shared memory system are also a sparse matrix. Since A is instance specific, the structure of
multithreaded. The cores have the capabilities of handling A is unknown at compile time. A matrix is not necessarily
several threads concurrently. These architectures exploit the correct data structure to use since there may be many 0‘s
both instruction level parallelism and thread level in the matrix wasting memory resources. In such problems,
parallelism. There are many parallel programming accesses to data often have poor spatial and temporal
languages or APIs that support a shared memory paradigm. locality leading to ineffective use of the memory hierarchy

© 2013. The authors - Published by Atlantis Press 109

[15]. III. IMPLEMENTATION AND RESULTS

It is important to find efficient solutions in solving There are two algorithms to find APSP which are Floyd-
irregular problems. Irregular adaptive methods [1], [6], for Warshall and Dijkstra algorithms. As mentioned in section
example, have their applications in many science and 2, Dijkstra is not an efficient algorithm to be used in
engineering problems. With muticores becoming very parallel. Therefore, in this work we consider Floyed-
popular, having a standard programming language that Warshall’s algo-rithm. We use the new directive
addresses both irregular and regular applications is very called ”collapse” available in OpenMP 3.0 to handle nested
important. OpenMP is one such language. In the literature, loops. This directive deals efficiently with multi-
some works [13], [3], [4] have shown that OpenMP dimensional loops. In other words, it combines multiple
produces reduced performance when dealing with irregular loops into single loop. Thus, by using “collapse” directive,
computations. The earlier versions of OpenMP were not we avoid the overhead of spawning of the nested loop in the
meant to handle irregular computations [8]. In 2008, the algorithm. Also, we create a task for each vertex and
OpenMP 3.0 version introduced a directive called “task” to process them in parallel since each vertex is independent of
help develop parallel algorithms for irregular applications. each other. Algorithm 1 shows our proposed parallel APSP.
The directive “task” creates independent work units to be
Algorithm 1: Parallel APSP Algorithm
executed. The task in OpenMP 3.0 is nothing but a thread
that can be created and destroyed as needed. It can also Input: G = (V; E)
spawn other tasks that are not possible under the previous begin
version of OpenMP. Spawning threads allows dynamic Cost(i; j) 1
creation of threads incorporating fine grained parallelism Wight(i; j) = Cost(i; j)
and exploiting load balancing at runtime which is important for i 0 to n do in parallel
for performance improvement in irregular computations. Collapse (2)
In this paper we focus on one graph problem, all pair for j 0 to n do
shortest path (APSP) problem and its implementation on for k 0 to n do
OpenMP 3.0. Cost(j; k) =
min(Cost(j; k); Cost(j; i) + Cost(i; k))
II. RELATED WORK
APSP can be solved using Floyd Warshall’s algorithm.
Venkataraman et al. [14] proposed a blocked algorithm to
find APSP. Their algorithm exploits cache locality to IV. RESULTS
optimize cache performance. The algorithm divides the This section shows our results for our parallel APSP
adjacency ma-trix into blocks of B×B and each block algorithm. We report results on an AMD Accelerated
processes individually in B iterations. They tested their Processing Unit (APU) 8 quad-core machine. Each core has
blocked algorithm on two different machines, Sun Ultra clocks speed of 3.0 MHz and 48GB of RAM memory. We
Entrprise 4000/5000 and SGI O2. Their blocked algorithm used GCC 4.4 compiler to compile and run the algorithm.
delivers a speedup between 1.6 to 1.9 for graphs that are We implemented our algorithm on two types of graphs:
between 480 to 3200 vertices on Sun Ultra Entrprise • R-MAT graphs: These are random graphs
4000/5000 and 1.6 to 2 on SGI O2 for graphs that are [2] allowing high and low degree vertices.
between 240 to 1200 vertices. Likewise, Ma et al. [7]
developed parallel Floyd Warshall’s algorithm for multi- • SSCA#2 graphs: Graphs in this category
core architecture on threading building blocks (TBB). TBB have high connected cliques. The size of
is a parallel programming model for C++ code. It is a the clique is distributed uniformly. Then,
runtime based programming model that specifies tasks. The they generate edges of inter-clique with a
task is mapped to threads. However, unlike Venkatraman et chosen probability.
al., Ma et al. use task and data level parallelism available in We used undirected graphs for our experiments. We start
the algorithm to find all pair shortest paths. The results from 16 vertices and increase the number of vertices to
reveal that the parallel algorithm surpasses both serial and 4096. We compare with OpenMP 2.5 and newer OpenMP
single threaded algorithms by 57.26% and 50.06% 3.0 versions for both types of graphs.
respectively.
Recently, Jasika et al. [5] used Dijkstra’s algorithm for TABLE I: The execution time on SSAC#2
APSP. They used OpenMP to parallelize Dijkstra algorithm.
They use the algorithm to find the single source shortest Number of vertices OpenMP 3.0 OpenMP 2.5
16 0.002 0.001
path for every vertex. They compared the OpenMP 32 0.003 0.001
implementation to OpenCL [9] and showed that there was 64 0.01 0.004
no gain in perfor-mance in the two implementations. This 128 0.03 0.01
256 0.11 0.07
they showed is due to the inherent sequential nature of 512 0.53 0.50
Dijkstra’s algorithm problems which makes this algorithm 1024 3.06 4.06
very difficult to be efficiently parallelized. 2048 19.59 31.81
4096 158.85 257.47

110

TABLE II: The execution time on R-MAT ACKNOWLEDGMENTS

Number of vertices OpenMP 3.0 OpenMP 2.5 A special thanks to the Saudi Cultural Bureau in Canada
16 0.002 0.001
32 0.003 0.001 that facilitated everything for us and provided the expenses
64 0.01 0.004 of the equipments to success this project.
128 0.03 0.01
256 0.11 0.07
512 0.73 0.52
1024 4.08 3.91 REFERENCES
2048 21.56 31.02
4096 154.21 251.12 [1] R. Biswas and R. C. Strawn. A new procedure for dynamic adaption
of three-dimensional unstructured grids. Applied Numerical
As shown in table I and table II and its subsequent Mathematics, 13:437–452, 1994.
figures 1 and 2 respectively, the algorithm runs a bit slower [2] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. R-
on OpenMP 3.0 for small number of vertices. However, for MAT: A Recursive Model for Graph Mining. In Proceedings of the
large number of vertices, the algorithm on OpenMP 3.0 Fourth SIAM International Conference on Data Mining (2004), Lake
Buean Vista, FL, USA, 22–24 April 2004.
surpasses the one on OpenMP 2.5 by 1.6 times. The new
directive allows effective use of the OpenMP 3.0 threads. [3] Eugen Dedu, Stephane´ Vialle, and Claude Timsit. Comparison of
OpenMP and classical multi-threading parallelization for regular and
By collapsing the loops we make efficient use of the irregular algorithms. In In proceesing of Software Engineering Ap-
resources and also eliminate any sychronization issues plied to Networking & Parallel/Distributed Computing (SNPD 2000),
between the two for loops. Champagne-Ardenne, France, pages 53–60, 19–21 May 2000.
[4] Dixie Hisley, Gagan Agrawal, Punyam Satya-narayana, and Lori
OpenMP 3.0 Pol-lock. Porting and performance evaluation of irregular codes using
250 OpenMP. In In proceesing of First European Workshop on OpenMP
OpenMP 2.5
(EWOMP 1999), Lund, Sweden, pages 47–59, 1999.
Execution time in seconds
200 [5] Nadira Jasika, Naida Alispahic, Arslanagic Elma, Kurtovic Ilvana,
Lagumdzija Elma, and Novica Nosovic. Dijkstra’s shortest path
150 algorithm serial and parallel execution performance analysis. In
Proceedings of the 35th International Convention on Information and
100 Communication Technology, Electronics and Microelectronics
(MIPRO 2012), Opatija, Croatia, pages 1811 –1815, 21–25 May
2012.
50
[6] Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Rama-
narayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism
0
requires abstractions. In Proceedings of the 2007 ACM SIGPLAN
0 1;000 2;000 3;000 4;000 con-ference on Programming language design and implementation
(PLDI 2007), San Diego, CA, USA, pages 211–222, 2007.
Number of vertices
[7] Jian Ma, Ke ping Li, and Li yan Zhang. A parallel Floyd-Warshall
Fig. 1: Execution Time for SSAC#2 algorithm based on TBB. In In proceesing of The 2nd IEEE In-
ternational Conference on Information Management and Engineering
(ICIME 2010), Bangkok, Thailand, pages 429–433, 2010.
250 OpenMP 3.0 [8] Timothy G. Mattson. How good is openmp. Scientific Programming,
OpenMP 2.5 11(2):81–93, 2003.
[9] Aaftab Munshi. The opencl specification. Khronos OpenCL
Execution time in seconds
200
Working Group, 1:l1–15, 2009.
150 [10] Jarek Nieplocha, Andres` Marquez,´ John Feo, Daniel Chavarr´ıa-
Miranda, George Chin, Chad Scherrer, and Nathaniel Beagley.
Evaluating the potential of multithreaded platforms for irregular
100
scientific computa-tions. In Proceedings of the 4th international
conference on Computing frontiers (CF 2007), Ischia, Italy, pages 47–
50 58, 7–9 May 2007.
[11] OpenMP. The OpenMP API specification for parallel programming.
0 http://openmp.org/wp/, 1998.
0 1;000 2;000 3;000 4;000 [12] Simone Secchi, Antonino Tumeo, and Oreste Villa. A bandwidth-
optimized multi-core architecture for irregular applications. In
Number of vertices
Proceed-ings of 12th IEEE/ACM International Symposium on
Cluster, Cloud and Grid Computing, (CCGrid 2012), Ottawa, ON,
Fig. 2: Execution Time for R-MAT Canada, pages 580–587, 13–16 May 2012.
[13] Michael Sußand¨ Claudia Leopold. Implementing irregular parallel
algorithms with OpenMP. In Proceedings of the 12th international
V. CONCLUSION conference on Parallel Processing (Euro-Par 2006), Dresden,
In this paper we implemented one graph problem, all Germany, pages 635–644, 2006.
pair shortest path problem in OpenMP 3.0. We showed that [14] Gayathri Venkataraman, Sartaj Sahni, and Srabani Mukhopadhyaya.
A blocked all-pairs shortest-paths algorithm. Journal on Experimental
the algorithm run 1.6 times faster than the OpenMP 2.5 Algorithmics, 8, December 2003.
version for two different types of graphs.
[15] Zheng Zhang and Josep Torrellas. Speeding up irregular applications
in shared-memory multiprocessors: memory binding and group

111

prefetch-ing. In Proceedings of the 22nd annual international

symposium on computer architecture (ISCA 1995), S. Margherita
Ligure, Italy, pages 188–199, 1995.

112

Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0

Uploaded by

Copyright:

Available Formats

Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Task Level Parallelization of All Pair Shortest Path Algorithm in Openmp 3.0

Uploaded by

Copyright:

Available Formats

2nd International Conference on Advances in Computer Science and Engineering (CSE 2013)

Eid Albalawi Parimala Thulasiraman Ruppa Thulasiram

[15]. III. IMPLEMENTATION AND RESULTS

TABLE II: The execution time on R-MAT ACKNOWLEDGMENTS

prefetch-ing. In Proceedings of the 22nd annual international

You might also like