Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architectural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target.
Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, networking, and encryption. Streaming computation is represented as a graph of independent computation nodes that communicate explicitly over data channels. Our techniques operate on contiguous regions of the stream graph where the input and output rates of the nodes are statically determinable. Within a static region, the compiler first automatically adjusts the granularity and then exploits data, task, and pipeline parallelism in a holistic fashion. We introduce techniques that data-parallelize nodes that operate on overlapping sliding windows of their input, translating serializing state into minimal and parametrized inter-core communication. Finally, for nodes that cannot be data-parallelized due to state, we are the first to apply software-pipelining techniques at a coarse granularity to exploit pipeline parallelism between stateful nodes.
Our framework is evaluated in the context of the StreamIt programming language. StreamIt is a high-level stream programming language that has been shown to improve programmer productivity in implementing streaming algorithms. We employ the StreamIt Core benchmark suite of 12 real-world applications to demonstrate the effectiveness of our techniques for varying multi-core architectures. For a 16-core distributed memory multicore, we achieve a 14.9x mean speedup. For benchmarks that include sliding-window computation, our sliding-window data-parallelization techniques are required to enable scalable performance for a 16-core SMP multicore (14x mean speedup) and a 64-core distributed shared memory multicore (52x mean speedup). (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)
Cited By
- Gonnord L, Henrio L, Morel L and Radanne G (2023). A Survey on Parallelism and Determinism, ACM Computing Surveys, 55:10, (1-28), Online publication date: 31-Oct-2023.
- Mansouri F, Huet S and Houzet D (2016). A domain-specific high-level programming model, Concurrency and Computation: Practice & Experience, 28:3, (750-767), Online publication date: 10-Mar-2016.
- Millo J, Kofman E and Simone R (2015). Modeling and Analyzing Dataflow Applications on NoC-Based Many-Core Architectures, ACM Transactions on Embedded Computing Systems, 14:3, (1-25), Online publication date: 21-May-2015.
- Ko Y, Burgstaller B and Scholz B (2015). LaminarIR: compile-time queues for structured streams, ACM SIGPLAN Notices, 50:6, (121-130), Online publication date: 7-Aug-2015.
- Ko Y, Burgstaller B and Scholz B LaminarIR: compile-time queues for structured streams Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, (121-130)
- Wang Z, Tournavitis G, Franke B and O'boyle M (2014). Integrating profile-driven parallelism detection and machine-learning-based mapping, ACM Transactions on Architecture and Code Optimization, 11:1, (1-26), Online publication date: 1-Feb-2014.
- Bartenstein T and Liu Y Rate types for stream programs Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, (213-232)
- Bosboom J, Rajadurai S, Wong W and Amarasinghe S StreamJIT Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, (177-195)
- Bartenstein T and Liu Y (2014). Rate types for stream programs, ACM SIGPLAN Notices, 49:10, (213-232), Online publication date: 31-Dec-2015.
- Bosboom J, Rajadurai S, Wong W and Amarasinghe S (2014). StreamJIT, ACM SIGPLAN Notices, 49:10, (177-195), Online publication date: 31-Dec-2015.
- Yusuf I and Schmidt H Parameterised architectural patterns for providing cloud service fault tolerance with accurate costings Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering, (121-130)
- Bartenstein T and Liu Y Green streams for data-intensive software Proceedings of the 2013 International Conference on Software Engineering, (532-541)
- Min C and Eom Y DANBI Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, (189-200)
- Bui D and Lee E StreaMorph Proceedings of the Eleventh ACM International Conference on Embedded Software, (1-10)
- Hashemi M, Foroozannejad M, Ghiasi S and Etzel C FORMLESS Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, (71-78)
- Hashemi M, Foroozannejad M, Ghiasi S and Etzel C (2012). FORMLESS, ACM SIGPLAN Notices, 47:5, (71-78), Online publication date: 18-May-2012.
- Cohen A, Gérard L and Pouzet M Programming parallelism with futures in lustre Proceedings of the tenth ACM international conference on Embedded software, (197-206)
- Thies W and Amarasinghe S An empirical characterization of stream programs and its implications for language and compiler design Proceedings of the 19th international conference on Parallel architectures and compilation techniques, (365-376)
Recommendations
Efficient Compilation of Stream Programs for Heterogeneous Architectures: A Model-Checking based approach
SCOPES '15: Proceedings of the 18th International Workshop on Software and Compilers for Embedded SystemsStream programming based on the synchronous data flow (SDF) model naturally exposes data, task and pipeline parallelism. Statically scheduling stream programs for homogeneous architectures has been an area of extensive research. With graphic processing ...
Massively LDPC Decoding on Multicore Architectures
Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed ...
Synergistic execution of stream programs on multicores with accelerators
LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsThe StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...