Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Integrating profile-driven parallelism detection and machine-learning-based mapping

Published: 01 February 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Compiler-based auto-parallelization is a much-studied area but has yet to find widespread application. This is largely due to the poor identification and exploitation of application parallelism, resulting in disappointing performance far below that which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection, we overcome the limitations of static analysis, enabling the identification of more application parallelism, and only rely on the user for final approval. We then replace the traditional target-specific and inflexible mapping heuristics with a machine-learning-based prediction mechanism, resulting in better mapping decisions while automating adaptation to different target architectures. We have evaluated our parallelization strategy on the NAS and SPEC CPU2000 benchmarks and two different multicore platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields significant improvements when compared with state-of-the-art parallelizing compilers but also comes close to and sometimes exceeds the performance of manually parallelized codes. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform and gains a significant speedup for the IBM Cell platform, demonstrating the potential of profile-guided and machine-learning- based parallelization for complex multicore platforms.

    References

    [1]
    NAS Parallel Benchmarks 2.3, OpenMP C version. (2004). http://www.hpcs.cs.tsukuba.ac.jp/omni-compiler/download/download-benchmarks.html.
    [2]
    Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. 2009. A view of the parallel computing landscape. Communications of ACM 52, 10 (2009), 56--67.
    [3]
    Amina Aslam and Laurie Hendren. 2010. McFLAT: A profile-based framework for MATLAB loop analysis and transformations. In Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing (LCPC'10). 1--15.
    [4]
    Vishal Aslot, Max J. Domeika, Rudolf Eigenmann, Greg Gaertner, Wesley B. Jones, and Bodo Parady. 2001. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming (WOMPAT'01). 1--10.
    [5]
    D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks—summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing'91). 158--165.
    [6]
    Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer.
    [7]
    Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual Workshop on Computational Learning Theory (COLT'92). 144--152.
    [8]
    T. Brandes, S. Chaumette, M. C. Counilh, J. Roman, A. Darte, F. Desprez, and J. C. Mignot. 1997. HPFIT: A set of integrated tools for the parallelization of applications using high performance Fortran. PART I: HPFIT and the TransTOOL environment. Parallel Comput. 23 (1997), 71--87. Issue 1--2.
    [9]
    Matthew Bridges, Neil Vachharajani, Yun Zhang, Thomas Jablin, and David August. 2007. Revisiting the sequential programming model for multi-core. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). 69--84.
    [10]
    Michael Burke and Ron Cytron. 1986. Interprocedural dependence analysis and parallelization. In Proceedings of the 1986 SIGPLAN Symposium on Compiler Construction. 162--175.
    [11]
    Michael K. Chen and Kunle Olukotun. 2003. The Jrpm system for dynamically parallelizing Java programs. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03). 434--446.
    [12]
    Tong Chen, Jin Lin, Xiaoru Dai, Wei-Chung Hsu, and Pen-Chung Yew. 2004. Data dependence profiling for speculative optimizations. In Compiler Construction. 57--72.
    [13]
    Julita Corbalán, Xavier Martorell, and Jesús Labarta. 2000. Performance-driven processor allocation. In Proceedings of the 4th Conference on Operating System Design and Implementation (OSDI'00). 5--17.
    [14]
    CoSy. 2009. CoSy compiler development system. Retrieved from http://www.ace.nl/compiler/.
    [15]
    Chirag Dave and Rudolf Eigenmann. 2009. Automatically tuning parallel and parallelized programs. In Proceedings of the 22nd International Conference on Languages and Compilers for Parallel Computing (LCPC'09). 126--139.
    [16]
    Chen Ding, Xipeng Shen, Kirk Kelsey, Chris Tice, Ruke Huang, and Chengliang Zhang. 2007. Software behavior oriented parallelization. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'07). 223--234.
    [17]
    Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI'03).
    [18]
    Jialin Dou and Marcelo Cintra. 2004. Compiler estimation of load imbalance overhead in dpeculative parallelization. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT'04). 203--214.
    [19]
    Zhao-Hui Du, Chu-Cheow Lim, Xiao-Feng Li, Chen Yang, Qingyu Zhao, and Tin-Fook Ngai. 2004. A cost-driven compilation framework for speculative parallelization of sequential programs. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI'04). 71--81.
    [20]
    Vector Fabrics. 2013. Homepage. Retrieved from http://www.vectorfabrics.com/.
    [21]
    Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI'98). 212--223.
    [22]
    Saturnino Garcia, Donghwan Jeon, Christopher M. Louie, and Michael Bedford Taylor. 2011. Kremlin: Rethinking and rebooting gprof for the multicore age. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Fesign and Implementation (PLDI'11). 458--469.
    [23]
    Michael I. Gordon. 2010. Compiler Techniques for Scalable Performance of Stream Programs on Multicore Architectures. Ph.D. Thesis. Massachusetts Institute of Technology.
    [24]
    Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. 2002. A stream compiler for communication-exposed architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X). 291--303.
    [25]
    Ryan E. Grant and Ahmad Afsahi. 2007. A comprehensive analysis of OpenMP applications on dual-core Intel Xeon SMPs. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007). 1--8.
    [26]
    Dominik Grewe, Zheng Wang, and Michael F.P. O'Boyle. 2013. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In CGO'13.
    [27]
    Dominik Grewe, Zheng Wang, and Michael F. P. O'Boyle. 2011. A workload-aware mapping approach for data-parallel programs. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC'11). 117--126.
    [28]
    Dominik Grewe, Zheng Wang, and Michael F. P. O'Boyle. 2013. OpenCL task partitioning in the presence of GPU contention. In LCPC'13.
    [29]
    M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, Shih-Wei Liao, and E. Bu. 1996. Maximizing multiprocessor performance with the SUIF compiler. Computer 29, 12 (1996), 84--89.
    [30]
    Parry Husbands, Costin Iancu, and Katherine Yelick. 2003. A performance analysis of the Berkeley UPC compiler. In Proceedings of the 17th Annual International Conference on Supercomputing (ICS'03). 63--73.
    [31]
    François Irigoin, Pierre Jouvelot, and Rémi Triolet. 1991. Semantical interprocedural parallelization: an overview of the PIPS project. In Proceedings of the 5th International Conference on Supercomputing (ICS'91). 244--251.
    [32]
    Makoto Ishihara, Hiroki Honda, and Mitsuhisa Sato. 2006. Development and implementation of an interactive parallelization assistance tool for OpenMP: iPat/OMP. IEICE Transactions on Information and Systems E89-D, 2 (2006), 399--407.
    [33]
    Hanjun Johnson, Nick P. Kim, Prakash Prabhu, Ayal Zaks, and David I. August. 2012. Speculative separation for Privatization and Reductions. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'12).
    [34]
    Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann.
    [35]
    Ken Kennedy, Kathryn McKinley, and Chau-Wen Tseng. 1991. Interactive parallel programming using the ParaScope editor. IEEE Transactions on Parallel and Distributed Systems 2, 3 (1991).
    [36]
    Minjang Kim, Hyesoon Kim, and Chi-Keung Luk. 2010. SD3: A scalable approach to dynamic data-dependence profiling. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'43). 535--546.
    [37]
    D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. 1981. Dependence graphs and compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'81). 207--218.
    [38]
    Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. 2007. Optimistic parallelism requires abstractions. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI'07). 211--222.
    [39]
    Leslie Lamport. 1974. The parallel execution of DO loops. Communications of ACM 17, 2 (1974), 83--93.
    [40]
    Shih-Wei Liao, Amer Diwan, Robert P. Bosch, Jr., Anwar Ghuloum, and Monica S. Lam. 1999. SUIF Explorer: an interactive and interprocedural parallelizer. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'99). 37--48.
    [41]
    Amy W. Lim and Monica S. Lam. 1997. Maximizing parallelism and minimizing synchronization with affine transforms. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'97). 201--214.
    [42]
    Open64. 2013. Homepage. Retrieved from http://www.open64.net.
    [43]
    Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005. Automatic thread extraction with decoupled software pipelining. In MICRO 38. 105--118.
    [44]
    David A. Padua, Rudolf Eigenmann, Jay Hoeflinger, Paul Petersen, Peng Tu, Stephen Weatherford, and Keith Faigin. 1993. Polaris: A New-Generation Parallelizing Compiler for MPPs. Technical Report. University of Illinois at Urbana-Champaign.
    [45]
    P. Peterson and David A. Padua. 1993. Dynamic dependence analysis: A novel method for data dependence evaluation. In Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing. 64--81.
    [46]
    William Morton Pottenger. 1995. Induction Variable Substitution and Reduction Recognition in the Polaris Parallelizing Compiler. Technical Report. University of Illinois at Urbana-Champaign.
    [47]
    Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, and P. Sadayappan. 2010. Combined iterative and model-driven optimization in an automatic parallelization framework. In Conference on Supercomputing (SC'10).
    [48]
    Manohar K. Prabhu and Kunle Olukotun. 2005. Exposing speculative thread parallelism in SPEC2000. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'05). 142--152.
    [49]
    Graham Price and Manish Vachharajani. 2010. Large program trace analysis and compression with ZDDs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'10).
    [50]
    J. Ramanujam and P. Sadayappan. 1989. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing'89).
    [51]
    Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT'04). 177--188.
    [52]
    Lawrence Rauchwerger, Nancy M. Amato, and David A. Padua. 1995. Run-time methods for parallelizing partially parallel loops. In Proceedings of the 9th International Conference on Supercomputing (ICS'95). 137--146.
    [53]
    Lawrence Rauchwerger, Francisco Arzu, and Koji Ouchi. 1998. Standard Templates Adaptive Parallel Library (STAPL). In Languages, Compilers, and Run-Time Systems for Scalable Computers. Lecture Notes in Computer Science, Vol. 1511. 402--409.
    [54]
    Lawrence Rauchwerger and David Padua. 1995. The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI'95). 218--232.
    [55]
    Sean Rul, Hans Vandierendonck, and Koen De Bosschere. 2008. A dynamic analysis tool for finding coarse-grain parallelism. In HiPEAC Industrial Workshop.
    [56]
    Silvius Rus, Maikel Pennings, and Lawrence Rauchwerger. 2007. Sensitivity analysis for automatic parallelization on multi-cores. In Proceedings of the 21st Annual International Conference on Supercomputing (ICS'07). 263--273.
    [57]
    Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. 2003. Hybrid analysis: Static & dynamic memory reference analysis. International Journal of Parallel Programming 31, 4 (2003), 251--283.
    [58]
    Vijay A. Saraswat, Vivek Sarkar, and Christoph von Praun. 2007. X10: concurrent programming for modern architectures. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'07). 271--271.
    [59]
    William Thies, Vikram Chandrasekhar, and Saman Amarasinghe. 2007. A practical approach to exploiting coarse-grained pipeline parallelism in C programs. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). 356--369.
    [60]
    Georgios Tournavitis and Björn Franke. 2010. Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10). 377--388.
    [61]
    Georgios Tournavitis, Zheng Wang, Björn Franke, and Michael F. P. O'Boyle. 2009. Towards a holistic approach to auto-parallelization: Integrating profile-driven parallelism detection and machine-learning based mapping. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'09). 177--187.
    [62]
    Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August. 2007. Speculative decoupled software pipelining. In PACT'07. 49--59.
    [63]
    Hans Vandierendonck, Sean Rul, and Koen De Bosschere. 2010. The Paralax infrastructure: automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10). 389--400.
    [64]
    Zheng Wang and Michael F. P. O'Boyle. 2009. Mapping parallelism to multi-cores: a machine learning based approach. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'09).
    [65]
    Zheng Wang and Michael F. P. O'Boyle. 2010. Partitioning streaming parallelism for multi-cores: A machine learning based approach. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10).
    [66]
    Zheng Wang and Michael F. P. O'Boyle. 2013. Using machine learning to partition streaming programs. ACM ACM Transactions on Architecture and Code Optimization 10, 3 (2013), 1--25.
    [67]
    Peng Wu, Arun Kejariwal, and Călin Caşcaval. 2008. Compiler-driven dependence profiling to guide program parallelization. In Languages and Compilers for Parallel Computing. 232--248.
    [68]
    Heidi Ziegler and Mary Hall. 2005. Evaluating heuristics in automatically mapping multi-loop applications to FPGAs. In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-programmable Gate Arrays (FPGA'05). 184--195.

    Cited By

    View all
    • (2024)A Survey on Automatic Source Code Transformation for Green Software GenerationEncyclopedia of Sustainable Technologies10.1016/B978-0-323-90386-8.00122-4(765-779)Online publication date: 2024
    • (2023)Program State Element CharacterizationProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580011(199-211)Online publication date: 17-Feb-2023
    • (2023)Investigating the superiority of Intel oneAPI IFX compiler on Intel CPUs using different optimization levels: A case study on a CFD system2023 4th IEEE Global Conference for Advancement in Technology (GCAT)10.1109/GCAT59970.2023.10353473(1-9)Online publication date: 6-Oct-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 1
    February 2014
    373 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2591460
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 February 2014
    Accepted: 01 September 2013
    Revised: 01 July 2013
    Received: 01 June 2012
    Published in TACO Volume 11, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Auto-parallelization
    2. OpenMP
    3. machine-learning-based parallelism mapping
    4. profile-driven parallelism detection

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)107
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Survey on Automatic Source Code Transformation for Green Software GenerationEncyclopedia of Sustainable Technologies10.1016/B978-0-323-90386-8.00122-4(765-779)Online publication date: 2024
    • (2023)Program State Element CharacterizationProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580011(199-211)Online publication date: 17-Feb-2023
    • (2023)Investigating the superiority of Intel oneAPI IFX compiler on Intel CPUs using different optimization levels: A case study on a CFD system2023 4th IEEE Global Conference for Advancement in Technology (GCAT)10.1109/GCAT59970.2023.10353473(1-9)Online publication date: 6-Oct-2023
    • (2022)Compiler Optimization Parameter Selection Method Based on Ensemble LearningElectronics10.3390/electronics1115245211:15(2452)Online publication date: 6-Aug-2022
    • (2022)Profile-Guided Parallel Task Extraction and Execution for Domain Specific Heterogeneous SoC2022 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom57177.2022.00121(913-920)Online publication date: Dec-2022
    • (2022)Pattern-based Autotuning of OpenMP Loops using Graph Neural Networks2022 IEEE/ACM International Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S)10.1109/AI4S56813.2022.00010(26-31)Online publication date: Dec-2022
    • (2022)Method for Profile-Guided Optimization of Android Applications Using Random ForestIEEE Access10.1109/ACCESS.2022.321497110(109652-109662)Online publication date: 2022
    • (2021)Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474085(878-883)Online publication date: 1-Feb-2021
    • (2021)Adaptive Computation Offloading for Mobile Augmented RealityProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/34949585:4(1-30)Online publication date: 27-Dec-2021
    • (2021)Reverse engineering for reduction parallelization via semiring polynomialsProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454079(820-834)Online publication date: 19-Jun-2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media