research-article

HELIX-RC: an architecture-compiler co-design for automatic parallelization of irregular programs

Authors:

Simone Campanoni,

Kevin Brownell,

Timothy M. Jones,

David BrooksAuthors Info & Claims

ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture

Pages 217 - 228

Published: 14 June 2014 Publication History

Abstract

Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks

References

[1]

Randy Allen and Ken Kennedy. Optimizing compilers for modern architectures. Morgan Kaufmann, 2002.

Digital Library

[2]

Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Rankin, P. S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: An integrated solution to high-speed parallel computing. In Supercomputing, 1988.

Digital Library

[3]

Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas Jablin, and David I. August. Revisiting the sequential programming model for multicore. In MICRO, 2007.

Digital Library

[4]

Doug Burger, James R. Goodman, and Alain Kägi. Memory bandwidth limitations of future microprocessors. In ISCA, 1996.

Digital Library

[5]

Simone Campanoni, Giovanni Agosta, Stefano Crespi Reghizzi, and Andrea Di Biagio. A Highly Flexible, Parallel Virtual Machine: Design and Experience of ILDJIT. In Software: Practice and Experience, 2010.

Digital Library

[6]

Simone Campanoni, Timothy M. Jones, Glenn Holloway, Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks. HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing. In CGO, 2012.

Digital Library

[7]

Simone Campanoni, Timothy M. Jones, Glenn Holloway, Gu-Yeon Wei, and David Brooks. HELIX: Making the Extraction of Thread-Level Parallelism Mainstream. In IEEE Micro, 2012.

Digital Library

[8]

Ramkrishna Chatterjee, Barbara G. Ryder, andWilliam A. Landi. Relevant Context Inference. In POPL, 1999.

Digital Library

[9]

Lynn Choi and Pen-Chung Yew. Compiler and hardware support for cache coherence in large-scale multiprocessors: Design considerations and performance study. In ISCA, 1996.

Digital Library

[10]

Ron Cytron. DOACROSS: Beyond vectorization for multiprocessors. In ICPP, 1986.

[11]

Alain Deutsch. A storeless model of aliasing and its abstractions using finite representations of right-regular equivalence relations. In ICCL, 1992.

[12]

Paul Gratz, Changkyu Kim, Karthikeyan Sankaralingam, Heather Hanson, Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. On-Chip Interconnection Networks of the TRIPS Chip. In IEEE Micro, 2007.

Digital Library

[13]

Bolei Guo, Matthew J. Bridges, Spyridon Triantafyllis, Guilherme Ottoni, Easwaran Raman, and David I. August. Practical and accurate low-level pointer analysis. In CGO, 2005.

Digital Library

[14]

Greg Hamerly, Erez Perelman, and Brad Calder. How to use simpoint to pick simulation points. In ACM SIGMETRICS Performance Evaluation Review, 2004.

Digital Library

[15]

Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu, Michael K. Chen, and Kunle Olukotun. The Stanford Hydra CMP. In IEEE Micro, 2000.

Digital Library

[16]

Jialu Huang, Arun Raman, Thomas B. Jablin, Yun Zhang, Tzu-Han Hung, and David I. August. Decoupled software pipelining creates parallelization opportunities. In CGO, 2010.

Digital Library

[17]

Natalie Enright Jerger and Li-Shiuan Peh. On-Chip Networks. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.

Digital Library

[18]

Troy A. Johnson, Rudolf Eigenmann, and T. N. Vijaykumar. Speculative thread decomposition through empirical optimization. In PPoPP, 2007.

Digital Library

[19]

Svilen Kanev, Gu-Yeon Wei, and David Brooks. XIOSim: powerperformance modeling of mobile x86 cores. In ISLPED, 2012.

Digital Library

[20]

Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau, and Josep Torrellas. POSH: A TLS compiler that exploits program structure. In PPoPP, 2006.

Digital Library

[21]

Gabriel H Loh, Samantika Subramaniam, and Yuejian Xie. Zesto: A cycle-level simulator for highly detailed microarchitecture exploration. In ISPASS, 2009.

[22]

Stephen F. Lundstrom and George H. Barnes. A controllable MIMD architecture. In Advanced computer architecture, 1986.

Digital Library

[23]

Milo M. K. Martin. Token coherence. PhD thesis, University of Wisconsin- Madison, 2003.

[24]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. CACTI 6.0: A tool to model large caches. Technical Report 85, HP Laboratories, 2009.

[25]

Alexandru Nicolau, Guangqiang Li, and Arun Kejariwal. Techniques for efficient placement of synchronization primitives. In PPoPP, 2009.

Digital Library

[26]

Alexandru Nicolau, Guangqiang Li, Alexander V. Veidenbaum, and Arun Kejariwal. Synchronization optimizations for efficient execution on multicores. In ICS, 2009.

Digital Library

[27]

Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. Automatic thread extraction with decoupled software pipelining. In MICRO, 2005.

Digital Library

[28]

David K. Poulsen and Pen-Chung Yew. Data prefetching and data forwarding in shared memory multiprocessors. In ICPP, 1994.

Digital Library

[29]

Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. Speculative parallelization using software multi-threaded transactions. In ASPLOS, 2010.

Digital Library

[30]

Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I. August. Parallel-stage decoupled software pipelining. In CGO, 2008.

Digital Library

[31]

Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August. Performance scalability of decoupled software pipelining. In ACM TACO, 2008.

Digital Library

[32]

Behnam Robatmil, Dong Li, Hadi Esmaeilzadeh, Sibi Govindan, Aaron Smith, Andrew Putnam, Doug Burger, and Stephen W. Keckler. How to Implement Effective Prediction and Forwarding for Fusable Dynamic Multicore Architectures. In HPCA, 2013.

Digital Library

[33]

Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. In IEEE Computer Architecture Letters, 2011.

Digital Library

[34]

Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In ASPLOS, 2010.

Digital Library

[35]

Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger, Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP. In ACM TACO, 2004.

Digital Library

[36]

Steven L. Scott. Synchronization and Communication in the T3E Multiprocessor. In ASPLOS, 1996.

Digital Library

[37]

Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: a many-core x86 architecture for visual computing. In ACM Transactions on Graphics, 2008.

Digital Library

[38]

Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. Multiscalar processors. In ISCA, 1995.

Digital Library

[39]

J. Gregory Steffan, Christopher Colohan, Antonia Zhai, and Todd C. Mowry. The STAMPede approach to thread-level speculation. In ACM Transactions on Computer Systems, 2005.

Digital Library

[40]

J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. Improving value communication for thread-level speculation. In HPCA, 2002.

Digital Library

[41]

Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Ararwal. The RAW microprocessor: A computational fabric for software circuits and general-purpose programs. In IEEE Micro, 2002.

Digital Library

[42]

Michael Bedford Taylor, Walter Lee, Saman P. Amarasinghe, and Anant Agarwal. Scalar Operand Networks. In IEEE Transactions on Parallel Distributed Systems, 2005.

Digital Library

[43]

Georgios Tournavitis, Zheng Wang, Björn Franke, and Michael F. P. O'Boyle. Towards a holistic approach to auto-parallelization. In PLDI, 2009.

Digital Library

[44]

Rob F. van der Wijngaart, Timothy G. Mattson, and Werner Haas. Lightweight communications on Intel's single-chip cloud computer processor. In SIGOPS Operating Systems Review, 2011.

Digital Library

[45]

Hans Vandierendonck, Sean Rul, and Koen De Bosschere. The paralax infrastructure: Automatic parallelization with a helping hand. In PACT, 2010.

Digital Library

[46]

David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown, III, and Anant Agarwal. On-chip interconnection architecture of the tile processor. In IEEE Micro, 2007.

Digital Library

[47]

Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan, and Todd C. Mowry. Compiler optimization of scalar value communication between speculative threads. In ASPLOS, 2002.

Digital Library

[48]

Antonia Zhai, J. Gregory Steffan, Christopher B. Colohan, and Todd C. Mowry. Compiler and hardware support for reducing the synchronization of speculative threads. In ACM TACO, 2008.

Digital Library

[49]

Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman, and Scott Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA, 2008.

Cited By

Stitt GCampbell D(2020)PANDORAACM Transactions on Embedded Computing Systems10.1145/339189919:5(1-17)Online publication date: 11-Nov-2020
https://dl.acm.org/doi/10.1145/3391899
Wang YLee VWei GBrooks D(2019)Predicting New Workload or CPU Performance by Analyzing Public DatasetsACM Transactions on Architecture and Code Optimization10.1145/328412715:4(1-21)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3284127
Deiana ESt-Amour VDinda PHardavellas NCampanoni S(2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173181
Show More Cited By

HELIX-RC: an architecture-compiler co-design for automatic parallelization of irregular programs
1. Software and its engineering
  1. Software notations and tools

Recommendations

HELIX-UP: relaxing program semantics to unleash parallelization
CGO '15: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent ...
HELIX-RC: an architecture-compiler co-design for automatic parallelization of irregular programs
ISCA '14

Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize ...
HELIX: automatic parallelization of irregular programs for chip multiprocessing
CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture

June 2014

566 pages

ISBN:9781479943944

General Chairs:
Pen-Chung Yew
University of Minnesota
,
Antonia Zhai
University of Minnesota
,
Program Chair:
Steve Keckler
NVIDIA/University of Texas at Austin

ACM SIGARCH Computer Architecture News Volume 42, Issue 3
ISCA '14
June 2014
552 pages
ISSN:0163-5964
DOI:10.1145/2678373
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Press

Publication History

Published: 14 June 2014

Check for updates

Qualifiers

Research-article

Funding Sources

Division of Information and Intelligent Systems

Conference

ISCA'14

Sponsor:

IEEE TCCA
SIGARCH

ISCA'14: The 41st Annual International Symposium on Computer Architecture

June 14 - 18, 2014

Minnesota, Minneapolis, USA

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
581
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)3

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Stitt GCampbell D(2020)PANDORAACM Transactions on Embedded Computing Systems10.1145/339189919:5(1-17)Online publication date: 11-Nov-2020
https://dl.acm.org/doi/10.1145/3391899
Wang YLee VWei GBrooks D(2019)Predicting New Workload or CPU Performance by Analyzing Public DatasetsACM Transactions on Architecture and Code Optimization10.1145/328412715:4(1-21)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3284127
Deiana ESt-Amour VDinda PHardavellas NCampanoni S(2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173181
Deiana ESt-Amour VDinda PHardavellas NCampanoni SShen XTuck JBianchini RSarkar V(2018)Unconventional Parallelization of Nondeterministic ApplicationsProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173181(432-447)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173181
Kondguli SHuang M(2018)A Case for a More Effective, Power-Efficient Turbo BoostingACM Transactions on Architecture and Code Optimization10.1145/317043315:1(1-22)Online publication date: 22-Mar-2018
https://dl.acm.org/doi/10.1145/3170433
Voitsechov DPort OEtsion YOskin MInoue K(2018)Inter-thread communication in multithreaded, reconfigurable coarse-grain arraysProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00013(42-54)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00013
Campanoni SBrownell KKanev SJones TWei GBrooks D(2017)Automatically accelerating non-numerical programs by architecture-compiler co-designCommunications of the ACM10.1145/313946160:12(88-97)Online publication date: 27-Nov-2017
https://dl.acm.org/doi/10.1145/3139461
Georgiev PLane NMascolo CChu DChoudhury TKo SCampbell AGanesan D(2017)Accelerating Mobile Audio Sensing Algorithms through On-Chip GPU OffloadingProceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services10.1145/3081333.3081358(306-318)Online publication date: 16-Jun-2017
https://dl.acm.org/doi/10.1145/3081333.3081358
Dublish SNagarajan VTopham N(2016)Cooperative Caching for GPUsACM Transactions on Architecture and Code Optimization10.1145/300158913:4(1-25)Online publication date: 12-Dec-2016
https://dl.acm.org/doi/10.1145/3001589
Murphy NJones TMullins RCampanoni SZaks AHermenegildo M(2016)Performance implications of transient loop-carried data dependences in automatically parallelized loopsProceedings of the 25th International Conference on Compiler Construction10.1145/2892208.2892214(23-33)Online publication date: 17-Mar-2016
https://dl.acm.org/doi/10.1145/2892208.2892214
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents