Revisiting the Parallel Strategy for DOACROSS Loops

Liu, Song; Cui, Yuan-Zhen; Zou, Nian-Jun; Zhu, Wen-Hao; Zhang, Dong; Wu, Wei-Guo

doi:10.1007/s11390-019-1919-7

Revisiting the Parallel Strategy for DOACROSS Loops

Regular Paper
Published: 22 March 2019

Volume 34, pages 456–475, (2019)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Song Liu¹,
Yuan-Zhen Cui¹,
Nian-Jun Zou¹,
Wen-Hao Zhu²,
Dong Zhang^3,4 &
…
Wei-Guo Wu¹

128 Accesses
3 Citations
Explore all metrics

Abstract

DOACROSS loops are significant parts in many important scientific and engineering applications, which are generally exploited pipeline/wave-front parallelism by loop transformations. However, previous work almost statically performs iterations in parallel threads, thus causing a waste of computing resources in thread synchronization. This paper proposes a brand-new parallel strategy for DOACROSS loops that provides a dynamic task assignment with reduced dependences to achieve wave-front parallelism through loop tiling. The proposed strategy uses a master-slave parallel mode and some customized structures to realize dynamic and flexible parallelization, which effectively avoids threads from waiting in communication. An efficient tile size selection (TSS) approach is also proposed to preserve data reuse in cache for tiled codes. The experimental results show that the proposed parallel strategy obtains good and stable speedups over six typical benchmarks with different problem sizes and different numbers of threads on an Intel^® Xeon^® 32-core server. And it outperforms two static strategies, a barrier-based strategy and a post/wait-based strategy, by 32% and 20% in average performance, respectively. This strategy also yields a better performance than a mutex-based dynamic strategy. Besides, it has been demonstrated that the proposed TSS approach can achieve a near-optimal performance and is comparable with a state-of-the-art TSS approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Cytron R. DOACROSS: Beyond vectorization for multiprocessors. In Proc. the 15th Int. Conf. Parallel Processing, August 1986, pp.836-844.
Hackbusch W. Iterative Solution of Large Sparse Systems of Equations (2nd edition). Springer, 2016.
Quarteroni A, Valli A. Numerical Approximation of Partial Differential Equations (1st edition). Springer, 1994.
Versteeg H K, Malalasekera W. An Introduction to Computational Fluid Dynamics: The Finite Volume Method. London: Longman Scientific and Technical, 1995.
Google Scholar
Midkiff S, Padua D. Compiler algorithms for synchronization. IEEE Trans. Computers, 1987, C-36(12): 1485-1495.
Article MATH Google Scholar
Wolfe M. Multiprocessor synchronization for concurrent loops. Software IEEE, 1988, 5(1): 34-42.
Article Google Scholar
Su H M, Yew P. On data synchronization for multiprocessors. In Proc. the 16th Annual Int. Symp. Computer Architecture, May 1989, pp.416-423.
Chen D, Torrellas J, Yew P. An efficient algorithm for the run-time parallelization of DOACROSS loops. In Proc. ACM/IEEE Supercomputing, November 1994, pp.518-527.
Xue J. Loop Tiling for Parallelism. Springer, 2000.
Wolf M, Lam S. A data locality optimizing algorithm. In Proc. the 12th ACM SIGPLAN Conf. Programming Language Design and Implementation, June 1991, pp.30-44.
Baghdadi R, Cohen A, Verdoolaege S, Trifunović K. Improved loop tiling based on the removal of spurious false dependences. ACM Trans. Architecture and Code Optimization, 2013, 9(4): Article No. 52.
Wonnacott D, Strout M. On the scalability of loop tiling techniques. In Proc. the 3rd Int. Workshop on Polyhedral Compilation Techniques, January 2013, pp.3-11.
Bondhugula U, Baskaran M, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Proc. the 17th Int. Conf. Compiler Construction, March 2008, pp.132-146.
Unnikrishnan P, Shirako J, Barton K, Chatterjee S, Silvera R, Sarkar V. A practical approach to DOACROSS parallelization. In Proc. the 18th Int. Conf. Parallel Processing, August 2012, pp.219-231.
Krothapalli V P, Sadayappan P. Removal of redundant dependences in DOACROSS loops with constant dependences. IEEE Trans. Parallel and Distributed Systems, 1991, 2(3): 281-289.
Article Google Scholar
Rajamony R, Cox A L. Optimally synchronizing DOACROSS loops on shared memory multiprocessors. In Proc. Int. Conf. Parallel Architectures and Compilation Techniques, November 1997, pp.214-224.
Chen D, Yew P. Statement re-ordering for DO-ACROSS loops. In Proc. Int. Conf. Parallel Processing, August 1994, pp.24-28.
Chen D, Yew P. On effective execution of nonuniform DOACROSS loops. IEEE Trans. Parallel and Distributed Systems, 1996, 7(5): 463-476.
Article Google Scholar
Chen D, Yew P. Redundant synchronization elimination for DOACROSS loops. In Proc. the 8th Int. Parallel Processing Symp., April 1994, pp.477-481.
Kwok Y K, Ahmad I. Dynamic critical-path scheduling: An effective technique for allocating task graphs to multiprocessors. IEEE Trans. Parallel and Distributed Systems, 1996, 7(5): 506-521.
Article Google Scholar
Chase D, Lev Y. Dynamic circular work-stealing deque. In Proc. the 17th Annual ACM Symp. Parallelism in Algorithms and Architectures, July 2005, pp.21-28.
Guo Y, Zhao J, Cave V, Sarkar V. SLAW: A scalable locality-aware adaptive work-stealing scheduler for multicore systems. In Proc. the 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, January 2010, pp.341-342.
Cui Y, Liu S, Zou N, Wu W. A dynamic parallel strategy for DOACROSS loops. In Proc. Int. Conf. High Performance Computing in Asia-Pacific Region, January 2018, pp.108-115.
Renganarayanan L, Kim D, Strout M M, Rajopadhye S. Parameterized loop tiling. ACM Trans. Programming Languages and Systems, 2012, 34(1): Article No. 3.
Chame J, Moon S. A tile selection algorithm for data locality and cache interference. In Proc. the 13th Int. Conf. Supercomputing, June 1999, pp.492-499.
Fraguela B B, Carmueja M G, Andrade D. Optimal tile size selection guided by analytical models. In Proc. Int. Conf. Parallel Computing, September 2005, pp.565-572.
Yuki T, Renganarayanan L, Rajopadhye S, Anderson C, Eichenberger A E, O’Brien K. Automatic creation of tile size selection models. In Proc. the 8th Annual IEEE/ACM Int. Symp. Code Generation and Optimization, April 2010, pp.190-199.
Mehta S, Beeraka G, Yew P. Tile size selection revisited. ACM Trans. Architecture and Code Optimization, 2013, 10(4): Article No. 35.
Mehta S, Garg R, Trivedi N, Yew P. Turbo tiling: Leveraging prefetching to boost performance of tiled codes. In Proc. the 30th Int. Conf. Supercomputing, June 2016, Article No. 38.
Rivera G, Tseng C W. Tiling optimizations for 3D scientific computations. In Proc. ACM/IEEE Conf. Supercomputing, November 2000, Article No. 32.
Fu H, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao F, Zhao W, Yin X, Hou C, Zhang C, Ge W, Zhang J, Wang Y, Zhou C, Yang G. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences, 2016, 59(7): 072001.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronic Information and Engineering, Xi’an Jiaotong University, Xi’an, 710049, China
Song Liu, Yuan-Zhen Cui, Nian-Jun Zou & Wei-Guo Wu
School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China
Wen-Hao Zhu
Xi’an Research Institute of Surveying and Mapping, Xi’an, 710054, China
Dong Zhang
State Key Laboratory of Geo-Information Engineering, Xi’an, 710054, China
Dong Zhang

Authors

Song Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuan-Zhen Cui
View author publications
You can also search for this author in PubMed Google Scholar
Nian-Jun Zou
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Hao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Dong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Guo Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei-Guo Wu.

Electronic supplementary material

ESM 1

(PDF 134 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, S., Cui, YZ., Zou, NJ. et al. Revisiting the Parallel Strategy for DOACROSS Loops. J. Comput. Sci. Technol. 34, 456–475 (2019). https://doi.org/10.1007/s11390-019-1919-7

Download citation

Received: 10 November 2017
Revised: 25 January 2019
Published: 22 March 2019
Issue Date: March 2019
DOI: https://doi.org/10.1007/s11390-019-1919-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Revisiting the Parallel Strategy for DOACROSS Loops

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications

Just in Time Load Balancing

Polygonal Iteration Space Partitioning

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Revisiting the Parallel Strategy for DOACROSS Loops

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Partition Scheduling on Heterogeneous Multicore Processors for Multi-dimensional Loops Applications

Just in Time Load Balancing

Polygonal Iteration Space Partitioning

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation