research-article

Open access

Efficient and Scalable Execution of Fine-Grained Dynamic Linear Pipelines

Authors:

Aristeidis Mastoras,

Thomas R. GrossAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 16, Issue 2

Article No.: 8, Pages 1 - 26

https://doi.org/10.1145/3307411

Published: 18 April 2019 Publication History

All formats PDF

Abstract

We present Pipelite, a dynamic scheduler that exploits the properties of dynamic linear pipelines to achieve high performance for fine-grained workloads. The flexibility of Pipelite allows the stages and their data dependences to be determined at runtime. Pipelite unifies communication, scheduling, and synchronization algorithms with suitable data structures. This unified design introduces the local suspension mechanism and a wait-free enqueue operation, which allow efficient dynamic scheduling. The evaluation on a 44-core machine, using programs from three widely used benchmark suites, shows that Pipelite implies low overhead and significantly outperforms the state of the art in terms of speedup, scalability, and memory usage.

References

[1]

SPEC. 2006. Standard Performance Evaluation Corporation (SPEC). Retrieved March 22, 2019 from https://www.spec.org.

[2]

2015. Parallel BZIP2 (PBZIP2) Data Compression Software. Retrieved March 22, 2019 from http://compression.ca/pbzip2.

[3]

Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University, Princeton, NJ.

Digital Library

[4]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 72--81.

Digital Library

[5]

Christian Bienia and Kai Li. 2012. Characteristics of workloads using the pipeline programming model. In Revised Selected Papers from the 3rd Workshop on Emerging Applications and Many-Core Architecture, Held in Conjunction with the 37th International Symposium on Computer Architecture (ISCA’10). 161--171.

Digital Library

[6]

Guy E. Blelloch and Margaret Reid-Miller. 1997. Pipelining with futures. In Proceedings of the 9th Symposium on Parallel Algorithms and Architectures (SPAA’97). 249--259.

Digital Library

[7]

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th Symposium on Principles and Practice of Parallel Programming (PPoPP’95). 207--216.

Digital Library

[8]

Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks. 2012. HELIX: Automatic parallelization of irregular programs for chip multiprocessing. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). 84--93.

Digital Library

[9]

Dimitrios Chasapis, Marc Casas, Miquel Moretó, Raul Vidal, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. PARSECSs: Evaluating the impact of task parallelism in the PARSEC benchmark suite. ACM Transactions on Architecture and Code Optimization 12, 4 (Dec. 2015), Article 41, 22 pages.

Digital Library

[10]

Damian Dechev, Peter Pirkelbauer, and Bjarne Stroustrup. 2010. Understanding and effectively preventing the ABA problem in descriptor-based lock-free designs. In Proceedings of the 13th International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing (ISORC’10). IEEE, Los Alamitos, CA, 185--192.

Digital Library

[11]

Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable task parallelism for NUMA: A uniform abstraction for coordinated scheduling and memory management. In Proceedings of the 25th International Conference on Parallel Architectures and Compilation (PACT’16). 125--137.

Digital Library

[12]

Alejandro Duran, Eduard Ayguadé, Rosa M. Badia, Jesús Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21, 2 (June 2011), 173--193.

[13]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the cilk-5 multithreaded language. In Proceedings of the 1998 Conference on Programming Language Design and Implementation (PLDI’98). 212--223.

Digital Library

[14]

Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). 151--162.

Digital Library

[15]

R. L. Graham, E. L. Lawler, J. K. Lenstra, and A. H. G. Rinnooy Kan. 1979. Optimization and approximation in deterministic sequencing and scheduling: A survey. Annals of Discrete Mathematics 4 (1979), 287--326.

[16]

Jialu Huang, Arun Raman, Thomas B. Jablin, Yun Zhang, Tzu-Han Hung, and David I. August. 2010. Decoupled software pipelining creates parallelization opportunities. In Proceedings of the 8th International Symposium on Code Generation and Optimization (CGO’10). 121--130.

Digital Library

[17]

Intel. 2012. Threading Building Blocks Reference Manual.

[18]

Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2013. Load-balanced pipeline parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’13). Article 14, 12 pages.

Digital Library

[19]

Ralf Karrenberg and Sebastian Hack. 2012. Improving performance of OpenCL on CPUs. In Proceedings of the 21st International Conference on Compiler Construction (CC’12). 1--20.

Digital Library

[20]

Alex Kogan and Erez Petrank. 2011. Wait-free queues with multiple enqueuers and dequeuers. In Proceedings of the 16th Symposium on Principles and Practice of Parallel Programming (PPoPP’11). 223--234.

Digital Library

[21]

Nhat Minh Lê, Antoniu Pop, Albert Cohen, and Francesco Zappa Nardelli. 2013. Correct and efficient work-stealing for weak memory models. In Proceedings of the 18th Symposium on Principles and Practice of Parallel Programming (PPoPP’13). 69--80.

[22]

I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. 2015. On-the-fly pipeline parallelism. ACM Transactions on Parallel Computing 2, 3 (2015), Article 17, 42 pages.

Digital Library

[23]

Aristeidis Mastoras and Thomas R. Gross. 2016. Unifying fixed code and fixed data mapping of load-imbalanced pipelined loops. (Poster Abstract) In Proceedings of the 21st Symposium on Principles and Practice of Parallel Programming (PPoPP’16). Article 53, 2 pages.

Digital Library

[24]

Aristeidis Mastoras and Thomas R. Gross. 2018. Understanding parallelization tradeoffs for linear pipelines. In Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM’18). 1--10.

Digital Library

[25]

Aristeidis Mastoras and Thomas R. Gross. 2018. Unifying fixed code mapping, communication, synchronization and scheduling algorithms for efficient and scalable loop pipelining. IEEE Transactions on Parallel and Distributed Systems 29, 9 (2018), 2136--2149.

[26]

Aristeidis Mastoras and Thomas R. Gross. 2019. Load-balancing for load-imbalanced fine-grained linear pipelines. Parallel Computing. In press.

[27]

Maged M. Michael and Michael L. Scott. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Symposium on Principles of Distributed Computing (PODC’96). 267--275.

Digital Library

[28]

Changwoo Min and Young Ik Eom. 2013. DANBI: Dynamic scheduling of irregular stream programs for many-core systems. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). 189--200.

Digital Library

[29]

Cupertino Miranda, Antoniu Pop, Philippe Dumont, Albert Cohen, and Marc Duranton. 2010. Erbium: A deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes. In Proceedings of the 2010 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’10). 11--20.

Digital Library

[30]

Adam Morrison and Yehuda Afek. 2013. Fast concurrent queues for x86 processors. In Proceedings of the 18th Symposium on Principles and Practice of Parallel Programming (PPoPP’13). 103--112.

Digital Library

[31]

Angeles Navarro, Rafael Asenjo, Siham Tabik, and Calin Cascaval. 2009. Analytical modeling of pipeline parallelism. In Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT’09). 281--290.

Digital Library

[32]

OpenMP Architecture Review Board. 2015. OpenMP Application Program Interface. Version 4.5. Retrieved March 22, 2019 from https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf.

[33]

Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005. Automatic thread extraction with decoupled software pipelining. In Proceedings of the 38th International Symposium on Microarchitecture (MICRO-38). 105--118.

Digital Library

[34]

Antoniu Pop and Albert Cohen. 2011. A stream-computing extension to OpenMP. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC’11). 5--14.

Digital Library

[35]

Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization 9, 4 (Jan. 2013), Article 53, 25 pages.

Digital Library

[36]

Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I. August. 2008. Parallel-stage decoupled software pipelining. In Proceedings of the 6th International Symposium on Code Generation and Optimization (CGO’08). 114--123.

Digital Library

[37]

Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04). 177--188.

Digital Library

[38]

Eric C. Reed, Nicholas Chen, and Ralph E. Johnson. 2011. Expressing pipeline parallelism using TBB constructs: A case study on what works and what doesn’t. In Proceedings of the SPLASH’11 Workshop on Transitioning to MultiCore (TMC’11). 133--138.

Digital Library

[39]

Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic fine-grained scheduling of pipeline parallelism. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 22--32.

Digital Library

[40]

Scott Schneider and Kun-Lung Wu. 2017. Low-synchronization, mostly lock-free, elastic scheduling for streaming runtimes. In Proceedings of the 38th Conference on Programming Language Design and Implementation (PLDI’17). 648--661.

Digital Library

[41]

Jim Sukha. 2013. Piper: Experimental Support for Parallel Pipelines in Intel Cilk Plus. Retrieved March 22, 2019 from https://www.cilkplus.org/sites/default/files/experimental-software/PiperReferenceGuideV1.0_0.pdf.

[42]

Jim Sukha. 2015. Brief announcement: A compiler-runtime application binary interface for pipe-while loops. In Proceedings of the 27th Symposium on Parallelism in Algorithms and Architectures (SPAA’15). 83--85.

Digital Library

[43]

M. Aater Suleman, Moinuddin K. Qureshi, Khubaib, and Yale N. Patt. 2010. Feedback-directed pipeline parallelism. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 147--156.

Digital Library

[44]

Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August. 2007. Speculative decoupled software pipelining. In Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT’07). 49--59.

Digital Library

[45]

Hans Vandierendonck, Kallia Chronaki, and Dimitrios S. Nikolopoulos. 2013. Deterministic scale-free pipeline parallelism with hyperqueues. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’13). Article 32, 12 pages.

Digital Library

[46]

Yuanming Zhang, Gang Xiao, and Takanobu Baba. 2014. Accelerating sequential programs on commodity multi-core processors. Journal of Parallel and Distributed Computing 74, 4 (April 2014), 2257--2265.

Digital Library

Cited By

Mastoras AYzelman AChen QHuang ZSi M(2023)Studying the expressiveness and performance of parallelization abstractions for linear pipelinesProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582522(29-38)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3582514.3582522
Mastoras AAnagnostidis SYzelman A(2022)Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and PerformanceACM Transactions on Architecture and Code Optimization10.1145/356165220:1(1-23)Online publication date: 17-Nov-2022
https://dl.acm.org/doi/10.1145/3561652
Mastoras AAnagnostidis SYzelman A(2022)Nonblocking execution in GraphBLAS2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW55747.2022.00051(230-233)Online publication date: May-2022
https://doi.org/10.1109/IPDPSW55747.2022.00051
Show More Cited By

Index Terms

Efficient and Scalable Execution of Fine-Grained Dynamic Linear Pipelines
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

Chunking for Dynamic Linear Pipelines

Dynamic scheduling and dynamic creation of the pipeline structure are crucial for efficient execution of pipelined programs. Nevertheless, dynamic systems imply higher overhead than static systems. Therefore, chunking is the key to decrease the ...
Understanding Parallelization Tradeoffs for Linear Pipelines
PMAM'18: Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores

Pipelining techniques execute some loops with cross-iteration dependences in parallel, by partitioning the loop body into a sequence of stages such that the data dependences are not violated. Obtaining good performance for all kinds of loops is ...
Load-balancing for load-imbalanced fine-grained linear pipelines
Highlights
- A practical technique to achieve load-balancing for linear pipelines is presented.
Abstract
Pipelining is a well-known technique to overlap loop iterations by partitioning the loop body into a sequence of stages. A large class of programs can be expressed as linear pipelines if data dependences only flow from earlier to later ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 16, Issue 2

June 2019

317 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3325131

Editor:
Koen De Bosschere
Ghent University, Belgium

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 April 2019

Accepted: 01 January 2019

Revised: 01 December 2018

Received: 01 August 2018

Published in TACO Volume 16, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
724
Total Downloads

Downloads (Last 12 months)114
Downloads (Last 6 weeks)19

Reflects downloads up to 03 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mastoras AYzelman AChen QHuang ZSi M(2023)Studying the expressiveness and performance of parallelization abstractions for linear pipelinesProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582522(29-38)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3582514.3582522
Mastoras AAnagnostidis SYzelman A(2022)Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and PerformanceACM Transactions on Architecture and Code Optimization10.1145/356165220:1(1-23)Online publication date: 17-Nov-2022
https://dl.acm.org/doi/10.1145/3561652
Mastoras AAnagnostidis SYzelman A(2022)Nonblocking execution in GraphBLAS2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW55747.2022.00051(230-233)Online publication date: May-2022
https://doi.org/10.1109/IPDPSW55747.2022.00051
Mastoras AGross T(2019)Chunking for Dynamic Linear PipelinesACM Transactions on Architecture and Code Optimization10.1145/336381516:4(1-25)Online publication date: 18-Nov-2019
https://dl.acm.org/doi/10.1145/3363815

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents