Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Efficient and Scalable Execution of Fine-Grained Dynamic Linear Pipelines

Published: 18 April 2019 Publication History

Abstract

We present Pipelite, a dynamic scheduler that exploits the properties of dynamic linear pipelines to achieve high performance for fine-grained workloads. The flexibility of Pipelite allows the stages and their data dependences to be determined at runtime. Pipelite unifies communication, scheduling, and synchronization algorithms with suitable data structures. This unified design introduces the local suspension mechanism and a wait-free enqueue operation, which allow efficient dynamic scheduling. The evaluation on a 44-core machine, using programs from three widely used benchmark suites, shows that Pipelite implies low overhead and significantly outperforms the state of the art in terms of speedup, scalability, and memory usage.

References

[1]
SPEC. 2006. Standard Performance Evaluation Corporation (SPEC). Retrieved March 22, 2019 from https://www.spec.org.
[2]
2015. Parallel BZIP2 (PBZIP2) Data Compression Software. Retrieved March 22, 2019 from http://compression.ca/pbzip2.
[3]
Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University, Princeton, NJ.
[4]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 72--81.
[5]
Christian Bienia and Kai Li. 2012. Characteristics of workloads using the pipeline programming model. In Revised Selected Papers from the 3rd Workshop on Emerging Applications and Many-Core Architecture, Held in Conjunction with the 37th International Symposium on Computer Architecture (ISCA’10). 161--171.
[6]
Guy E. Blelloch and Margaret Reid-Miller. 1997. Pipelining with futures. In Proceedings of the 9th Symposium on Parallel Algorithms and Architectures (SPAA’97). 249--259.
[7]
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th Symposium on Principles and Practice of Parallel Programming (PPoPP’95). 207--216.
[8]
Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks. 2012. HELIX: Automatic parallelization of irregular programs for chip multiprocessing. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). 84--93.
[9]
Dimitrios Chasapis, Marc Casas, Miquel Moretó, Raul Vidal, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. PARSECSs: Evaluating the impact of task parallelism in the PARSEC benchmark suite. ACM Transactions on Architecture and Code Optimization 12, 4 (Dec. 2015), Article 41, 22 pages.
[10]
Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup. 2010. Understanding and effectively preventing the ABA problem in descriptor-based lock-free designs. In Proceedings of the 13th International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing (ISORC’10). IEEE, Los Alamitos, CA, 185--192.
[11]
Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable task parallelism for NUMA: A uniform abstraction for coordinated scheduling and memory management. In Proceedings of the 25th International Conference on Parallel Architectures and Compilation (PACT’16). 125--137.
[12]
Alejandro Duran, Eduard Ayguadé, Rosa M. Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21, 2 (June 2011), 173--193.
[13]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the 1998 Conference on Programming Language Design and Implementation (PLDI’98). 212--223.
[14]
Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). 151--162.
[15]
R. L. Graham, E. L. Lawler, J. K. Lenstra, and A. H. G. Rinnooy Kan. 1979. Optimization and approximation in deterministic sequencing and scheduling: A survey. Annals of Discrete Mathematics 4 (1979), 287--326.
[16]
Jialu Huang, Arun Raman, Thomas B. Jablin, Yun Zhang, Tzu-Han Hung, and David I. August. 2010. Decoupled software pipelining creates parallelization opportunities. In Proceedings of the 8th International Symposium on Code Generation and Optimization (CGO’10). 121--130.
[17]
Intel. 2012. Threading Building Blocks Reference Manual.
[18]
Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2013. Load-balanced pipeline parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’13). Article 14, 12 pages.
[19]
Ralf Karrenberg and Sebastian Hack. 2012. Improving performance of OpenCL on CPUs. In Proceedings of the 21st International Conference on Compiler Construction (CC’12). 1--20.
[20]
Alex Kogan and Erez Petrank. 2011. Wait-free queues with multiple enqueuers and dequeuers. In Proceedings of the 16th Symposium on Principles and Practice of Parallel Programming (PPoPP’11). 223--234.
[21]
Nhat Minh Lê, Antoniu Pop, Albert Cohen, and Francesco Zappa Nardelli. 2013. Correct and efficient work-stealing for weak memory models. In Proceedings of the 18th Symposium on Principles and Practice of Parallel Programming (PPoPP’13). 69--80.
[22]
I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. 2015. On-the-fly pipeline parallelism. ACM Transactions on Parallel Computing 2, 3 (2015), Article 17, 42 pages.
[23]
Aristeidis Mastoras and Thomas R. Gross. 2016. Unifying fixed code and fixed data mapping of load-imbalanced pipelined loops. (Poster Abstract) In Proceedings of the 21st Symposium on Principles and Practice of Parallel Programming (PPoPP’16). Article 53, 2 pages.
[24]
Aristeidis Mastoras and Thomas R. Gross. 2018. Understanding parallelization tradeoffs for linear pipelines. In Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM’18). 1--10.
[25]
Aristeidis Mastoras and Thomas R. Gross. 2018. Unifying fixed code mapping, communication, synchronization and scheduling algorithms for efficient and scalable loop pipelining. IEEE Transactions on Parallel and Distributed Systems 29, 9 (2018), 2136--2149.
[26]
Aristeidis Mastoras and Thomas R. Gross. 2019. Load-balancing for load-imbalanced fine-grained linear pipelines. Parallel Computing. In press.
[27]
Maged M. Michael and Michael L. Scott. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Symposium on Principles of Distributed Computing (PODC’96). 267--275.
[28]
Changwoo Min and Young Ik Eom. 2013. DANBI: Dynamic scheduling of irregular stream programs for many-core systems. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). 189--200.
[29]
Cupertino Miranda, Antoniu Pop, Philippe Dumont, Albert Cohen, and Marc Duranton. 2010. Erbium: A deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes. In Proceedings of the 2010 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’10). 11--20.
[30]
Adam Morrison and Yehuda Afek. 2013. Fast concurrent queues for x86 processors. In Proceedings of the 18th Symposium on Principles and Practice of Parallel Programming (PPoPP’13). 103--112.
[31]
Angeles Navarro, Rafael Asenjo, Siham Tabik, and Calin Cascaval. 2009. Analytical modeling of pipeline parallelism. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT’09). 281--290.
[32]
OpenMP Architecture Review Board. 2015. OpenMP Application Program Interface. Version 4.5. Retrieved March 22, 2019 from https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf.
[33]
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005. Automatic thread extraction with decoupled software pipelining. In Proceedings of the 38th International Symposium on Microarchitecture (MICRO-38). 105--118.
[34]
Antoniu Pop and Albert Cohen. 2011. A stream-computing extension to OpenMP. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC’11). 5--14.
[35]
Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization 9, 4 (Jan. 2013), Article 53, 25 pages.
[36]
Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I. August. 2008. Parallel-stage decoupled software pipelining. In Proceedings of the 6th International Symposium on Code Generation and Optimization (CGO’08). 114--123.
[37]
Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04). 177--188.
[38]
Eric C. Reed, Nicholas Chen, and Ralph E. Johnson. 2011. Expressing pipeline parallelism using TBB constructs: A case study on what works and what doesn’t. In Proceedings of the SPLASH’11 Workshop on Transitioning to MultiCore (TMC’11). 133--138.
[39]
Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic fine-grained scheduling of pipeline parallelism. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 22--32.
[40]
Scott Schneider and Kun-Lung Wu. 2017. Low-synchronization, mostly lock-free, elastic scheduling for streaming runtimes. In Proceedings of the 38th Conference on Programming Language Design and Implementation (PLDI’17). 648--661.
[41]
Jim Sukha. 2013. Piper: Experimental Support for Parallel Pipelines in Intel Cilk Plus. Retrieved March 22, 2019 from https://www.cilkplus.org/sites/default/files/experimental-software/PiperReferenceGuideV1.0_0.pdf.
[42]
Jim Sukha. 2015. Brief announcement: A compiler-runtime application binary interface for pipe-while loops. In Proceedings of the 27th Symposium on Parallelism in Algorithms and Architectures (SPAA’15). 83--85.
[43]
M. Aater Suleman, Moinuddin K. Qureshi, Khubaib, and Yale N. Patt. 2010. Feedback-directed pipeline parallelism. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 147--156.
[44]
Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August. 2007. Speculative decoupled software pipelining. In Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT’07). 49--59.
[45]
Hans Vandierendonck, Kallia Chronaki, and Dimitrios S. Nikolopoulos. 2013. Deterministic scale-free pipeline parallelism with hyperqueues. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’13). Article 32, 12 pages.
[46]
Yuanming Zhang, Gang Xiao, and Takanobu Baba. 2014. Accelerating sequential programs on commodity multi-core processors. Journal of Parallel and Distributed Computing 74, 4 (April 2014), 2257--2265.

Cited By

View all
  • (2023)Studying the expressiveness and performance of parallelization abstractions for linear pipelinesProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582522(29-38)Online publication date: 25-Feb-2023
  • (2022)Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and PerformanceACM Transactions on Architecture and Code Optimization10.1145/356165220:1(1-23)Online publication date: 17-Nov-2022
  • (2022)Nonblocking execution in GraphBLAS2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW55747.2022.00051(230-233)Online publication date: May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 2
June 2019
317 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3325131
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2019
Accepted: 01 January 2019
Revised: 01 December 2018
Received: 01 August 2018
Published in TACO Volume 16, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Dynamic linear pipeline
  2. dynamic communication
  3. dynamic partitioning
  4. dynamic synchronization
  5. multi-threading: dynamic scheduling
  6. parallelization directives

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)114
  • Downloads (Last 6 weeks)19
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Studying the expressiveness and performance of parallelization abstractions for linear pipelinesProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582522(29-38)Online publication date: 25-Feb-2023
  • (2022)Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and PerformanceACM Transactions on Architecture and Code Optimization10.1145/356165220:1(1-23)Online publication date: 17-Nov-2022
  • (2022)Nonblocking execution in GraphBLAS2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW55747.2022.00051(230-233)Online publication date: May-2022
  • (2019)Chunking for Dynamic Linear PipelinesACM Transactions on Architecture and Code Optimization10.1145/336381516:4(1-25)Online publication date: 18-Nov-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media