Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2665671.2665705acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

HELIX-RC: an architecture-compiler co-design for automatic parallelization of irregular programs

Published: 14 June 2014 Publication History

Abstract

Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85x performance speedup for six SPEC CINT2000 benchmarks

References

[1]
Randy Allen and Ken Kennedy. Optimizing compilers for modern architectures. Morgan Kaufmann, 2002.
[2]
Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Rankin, P. S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: An integrated solution to high-speed parallel computing. In Supercomputing, 1988.
[3]
Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas Jablin, and David I. August. Revisiting the sequential programming model for multicore. In MICRO, 2007.
[4]
Doug Burger, James R. Goodman, and Alain Kägi. Memory bandwidth limitations of future microprocessors. In ISCA, 1996.
[5]
Simone Campanoni, Giovanni Agosta, Stefano Crespi Reghizzi, and Andrea Di Biagio. A Highly Flexible, Parallel Virtual Machine: Design and Experience of ILDJIT. In Software: Practice and Experience, 2010.
[6]
Simone Campanoni, Timothy M. Jones, Glenn Holloway, Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks. HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing. In CGO, 2012.
[7]
Simone Campanoni, Timothy M. Jones, Glenn Holloway, Gu-Yeon Wei, and David Brooks. HELIX: Making the Extraction of Thread-Level Parallelism Mainstream. In IEEE Micro, 2012.
[8]
Ramkrishna Chatterjee, Barbara G. Ryder, andWilliam A. Landi. Relevant Context Inference. In POPL, 1999.
[9]
Lynn Choi and Pen-Chung Yew. Compiler and hardware support for cache coherence in large-scale multiprocessors: Design considerations and performance study. In ISCA, 1996.
[10]
Ron Cytron. DOACROSS: Beyond vectorization for multiprocessors. In ICPP, 1986.
[11]
Alain Deutsch. A storeless model of aliasing and its abstractions using finite representations of right-regular equivalence relations. In ICCL, 1992.
[12]
Paul Gratz, Changkyu Kim, Karthikeyan Sankaralingam, Heather Hanson, Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. On-Chip Interconnection Networks of the TRIPS Chip. In IEEE Micro, 2007.
[13]
Bolei Guo, Matthew J. Bridges, Spyridon Triantafyllis, Guilherme Ottoni, Easwaran Raman, and David I. August. Practical and accurate low-level pointer analysis. In CGO, 2005.
[14]
Greg Hamerly, Erez Perelman, and Brad Calder. How to use simpoint to pick simulation points. In ACM SIGMETRICS Performance Evaluation Review, 2004.
[15]
Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu, Michael K. Chen, and Kunle Olukotun. The Stanford Hydra CMP. In IEEE Micro, 2000.
[16]
Jialu Huang, Arun Raman, Thomas B. Jablin, Yun Zhang, Tzu-Han Hung, and David I. August. Decoupled software pipelining creates parallelization opportunities. In CGO, 2010.
[17]
Natalie Enright Jerger and Li-Shiuan Peh. On-Chip Networks. Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009.
[18]
Troy A. Johnson, Rudolf Eigenmann, and T. N. Vijaykumar. Speculative thread decomposition through empirical optimization. In PPoPP, 2007.
[19]
Svilen Kanev, Gu-Yeon Wei, and David Brooks. XIOSim: powerperformance modeling of mobile x86 cores. In ISLPED, 2012.
[20]
Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau, and Josep Torrellas. POSH: A TLS compiler that exploits program structure. In PPoPP, 2006.
[21]
Gabriel H Loh, Samantika Subramaniam, and Yuejian Xie. Zesto: A cycle-level simulator for highly detailed microarchitecture exploration. In ISPASS, 2009.
[22]
Stephen F. Lundstrom and George H. Barnes. A controllable MIMD architecture. In Advanced computer architecture, 1986.
[23]
Milo M. K. Martin. Token coherence. PhD thesis, University of Wisconsin- Madison, 2003.
[24]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. CACTI 6.0: A tool to model large caches. Technical Report 85, HP Laboratories, 2009.
[25]
Alexandru Nicolau, Guangqiang Li, and Arun Kejariwal. Techniques for efficient placement of synchronization primitives. In PPoPP, 2009.
[26]
Alexandru Nicolau, Guangqiang Li, Alexander V. Veidenbaum, and Arun Kejariwal. Synchronization optimizations for efficient execution on multicores. In ICS, 2009.
[27]
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. Automatic thread extraction with decoupled software pipelining. In MICRO, 2005.
[28]
David K. Poulsen and Pen-Chung Yew. Data prefetching and data forwarding in shared memory multiprocessors. In ICPP, 1994.
[29]
Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. Speculative parallelization using software multi-threaded transactions. In ASPLOS, 2010.
[30]
Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I. August. Parallel-stage decoupled software pipelining. In CGO, 2008.
[31]
Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August. Performance scalability of decoupled software pipelining. In ACM TACO, 2008.
[32]
Behnam Robatmil, Dong Li, Hadi Esmaeilzadeh, Sibi Govindan, Aaron Smith, Andrew Putnam, Doug Burger, and Stephen W. Keckler. How to Implement Effective Prediction and Forwarding for Fusable Dynamic Multicore Architectures. In HPCA, 2013.
[33]
Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. In IEEE Computer Architecture Letters, 2011.
[34]
Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In ASPLOS, 2010.
[35]
Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger, Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP. In ACM TACO, 2004.
[36]
Steven L. Scott. Synchronization and Communication in the T3E Multiprocessor. In ASPLOS, 1996.
[37]
Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: a many-core x86 architecture for visual computing. In ACM Transactions on Graphics, 2008.
[38]
Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. Multiscalar processors. In ISCA, 1995.
[39]
J. Gregory Steffan, Christopher Colohan, Antonia Zhai, and Todd C. Mowry. The STAMPede approach to thread-level speculation. In ACM Transactions on Computer Systems, 2005.
[40]
J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. Improving value communication for thread-level speculation. In HPCA, 2002.
[41]
Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Ararwal. The RAW microprocessor: A computational fabric for software circuits and general-purpose programs. In IEEE Micro, 2002.
[42]
Michael Bedford Taylor, Walter Lee, Saman P. Amarasinghe, and Anant Agarwal. Scalar Operand Networks. In IEEE Transactions on Parallel Distributed Systems, 2005.
[43]
Georgios Tournavitis, Zheng Wang, Björn Franke, and Michael F. P. O'Boyle. Towards a holistic approach to auto-parallelization. In PLDI, 2009.
[44]
Rob F. van der Wijngaart, Timothy G. Mattson, and Werner Haas. Lightweight communications on Intel's single-chip cloud computer processor. In SIGOPS Operating Systems Review, 2011.
[45]
Hans Vandierendonck, Sean Rul, and Koen De Bosschere. The paralax infrastructure: Automatic parallelization with a helping hand. In PACT, 2010.
[46]
David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown, III, and Anant Agarwal. On-chip interconnection architecture of the tile processor. In IEEE Micro, 2007.
[47]
Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan, and Todd C. Mowry. Compiler optimization of scalar value communication between speculative threads. In ASPLOS, 2002.
[48]
Antonia Zhai, J. Gregory Steffan, Christopher B. Colohan, and Todd C. Mowry. Compiler and hardware support for reducing the synchronization of speculative threads. In ACM TACO, 2008.
[49]
Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman, and Scott Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA, 2008.

Cited By

View all
  • (2020)PANDORAACM Transactions on Embedded Computing Systems10.1145/339189919:5(1-17)Online publication date: 11-Nov-2020
  • (2019)Predicting New Workload or CPU Performance by Analyzing Public DatasetsACM Transactions on Architecture and Code Optimization10.1145/328412715:4(1-21)Online publication date: 8-Jan-2019
  • (2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
  • Show More Cited By
  1. HELIX-RC: an architecture-compiler co-design for automatic parallelization of irregular programs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture
    June 2014
    566 pages
    ISBN:9781479943944

    Sponsors

    Publisher

    IEEE Press

    Publication History

    Published: 14 June 2014

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ISCA'14
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)PANDORAACM Transactions on Embedded Computing Systems10.1145/339189919:5(1-17)Online publication date: 11-Nov-2020
    • (2019)Predicting New Workload or CPU Performance by Analyzing Public DatasetsACM Transactions on Architecture and Code Optimization10.1145/328412715:4(1-21)Online publication date: 8-Jan-2019
    • (2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
    • (2018)Unconventional Parallelization of Nondeterministic ApplicationsProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173181(432-447)Online publication date: 19-Mar-2018
    • (2018)A Case for a More Effective, Power-Efficient Turbo BoostingACM Transactions on Architecture and Code Optimization10.1145/317043315:1(1-22)Online publication date: 22-Mar-2018
    • (2018)Inter-thread communication in multithreaded, reconfigurable coarse-grain arraysProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00013(42-54)Online publication date: 20-Oct-2018
    • (2017)Automatically accelerating non-numerical programs by architecture-compiler co-designCommunications of the ACM10.1145/313946160:12(88-97)Online publication date: 27-Nov-2017
    • (2017)Accelerating Mobile Audio Sensing Algorithms through On-Chip GPU OffloadingProceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services10.1145/3081333.3081358(306-318)Online publication date: 16-Jun-2017
    • (2016)Cooperative Caching for GPUsACM Transactions on Architecture and Code Optimization10.1145/300158913:4(1-25)Online publication date: 12-Dec-2016
    • (2016)Performance implications of transient loop-carried data dependences in automatically parallelized loopsProceedings of the 25th International Conference on Compiler Construction10.1145/2892208.2892214(23-33)Online publication date: 17-Mar-2016
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media