Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation

Published: 15 June 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time- and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. This work contributes APPrime, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters to create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS Parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPrime benchmarks. They retain the original applications' performance characteristics, in particular their relative performance across platforms. Also, the result benchmarks, already released online, are much more compact and easy-to-port compared to the original applications.

    References

    [1]
    APPrime Website. http://www.apprimecodes.org/.
    [2]
    DOE INCITE. http://www.doeleadershipcomputing.org/awards/.
    [3]
    OLCF Titan. https://www.olcf.ornl.gov/titan/.
    [4]
    H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, and F. Zheng. DataStager: Scalable Data Staging Services for Petascale Applications. In Cluster Computing, 2010.
    [5]
    H. Adalsteinsson, S. Cranford, D. A. Evensky, J. P. Kenny, J. Mayo, A. Pinar, and C. L. Janssen. A Simulator for Large-Scale Parallel Computer Architectures. In International Journal of Distributed Systems and Technologies, 2010.
    [6]
    L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 2010.
    [7]
    D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The nas parallel benchmark: Summary and preliminary results. In Supercomputing, pages 158--165, New York, NY, USA, 1991.
    [8]
    H. Brunst, H.-C. Hoppe, W. E. Nagel, and M. Winkler. Performance Optimization for Large Scale Computing: The Scalable VAMPIR Approach. In International Conference on Computational Science, 2001.
    [9]
    M. Casas, R. M. Badia, and J. Labarta. Automatic Phase Detection and Structure Extraction of MPI Applications. International Journal of High Performance Computing Applications, 2010.
    [10]
    D. P. Doane. Aesthetic Frequency Classifications. The American Statistician, 1976.
    [11]
    J. Dujmović. Automatic Generation of Benchmark and Test Workloads. In WOSP/SIPEW, 2010.
    [12]
    E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In 11th European PVM/MPI Users' Group Meeting, pages 97--104, Budapest, Hungary, September 2004.
    [13]
    M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca Performance Toolset Architecture. In Concurrency and Computation: Practice and Experience, 2010.
    [14]
    GTC2link. GTC-benchmark in NERSC-8 suite, 2013.
    [15]
    C. L. Janssen, H. Adalsteinsson, and J. P. Kenny. Using Simulation to Design Extremescale Applications and Architectures: Programming Model Exploration. ACM SIGMETRICS, 2011.
    [16]
    A. M. Joshi, L. Eeckhout, and L. K. John. The Return of Synthetic Benchmarks. In SPEC Benchmark Workshop, 2008.
    [17]
    J. P. Kenny, G. Hendry, B. Allan, and D. Zhang. Dumpi: The mpi profiler from the sst simulator suite. https://bitbucket.org/jpkenny/dumpi, 2011.
    [18]
    A. Knupfer, R. Brendel, H. Brunst, H. Mix, and W. Nagel. Introducing the Open Trace Format (OTF). In V. Alexandrov, G. Albada, P. Sloot, and J. Dongarra, editors, International Conference on Computational Science, 2006.
    [19]
    R. Latham, C. Daley, W. keng Liao, K. Gao, R. Ross, A. Dubey, and A. Choudhary. A case study for scientific i/o: improving the flash astrophysics code. CSD, 5(1):015001, 2012.
    [20]
    J. Logan, S. Klasky, H. Abbasi, Q. Liu, G. Ostrouchov, M. Parashar, N. Podhorszki, Y. Tian, and M. Wolf. Understanding I/O Performance Using I/O Skeletal Applications. In Euro-Par, 2012.
    [21]
    M. Noeth, F. Mueller, M. Schulz, and B. de Supinski. Scalable compression and replay of communication traces in massively parallel environments. In IPDPS, 2007.
    [22]
    M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable Compression and Replay of Communication Traces for High-Performance Computing. J. Parallel Distrib. Comput., 2009.
    [23]
    F. Pachet, P. Roy, and G. Barbieri. Finite-length Markov Processes with Constraints. In IJCAI, 2011.
    [24]
    M. Pedersoli, A. Vedaldi, and J. Gonzalez. A coarse-to-fine approach for fast deformable object detection. In Computer Vision and Pattern Recognition, pages 1353--1360, 2011.
    [25]
    L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. ASSP Magazine, pages 4--15, January 1986.
    [26]
    S. Ku, C. S. Chang, and P. H. Diamond. Full-f Gyrokinetic Particle Simulation of Centrally Heated Global ITG Turbulence from Magnetic Axis to Edge Pedestal Top in A Realistic Tokamak Geometry. Nuclear Fusion, 2009.
    [27]
    M. Seltzer, D. Krinsky, K. Smith, and X. Zhang. The Case for Application-Specific Benchmarking. In Hot Topics in Operating Systems, 1999.
    [28]
    S. Shao, A. K. Jones, and R. Melhem. A Compiler-based Communication Analysis Approach for Multiprocessor Systems. In IPDPS, 2006.
    [29]
    S. Shende and A. D. Malony. TAU: The tau parallel performance system. International Journal of High Performance Computing Applications, 20(2), 2006.
    [30]
    H. A. Sturges. The Choice of a Class Interval. Journal of the American Statistical Association, 1926.
    [31]
    G. Vahala, M. Soe, B. Zhang, J. Yepez, L. Vahala, J. Carter, and S. Ziegeler. Unitary Qubit Lattice Simulations of Multiscale Phenomena in Quantum Turbulence. In Supercomputing, 2011.
    [32]
    L. Van Ertvelde and L. Eeckhout. Dispersing Proprietary Applications as Benchmarks Through Code Mutation. ACM SIGOPS OSR, 2008.
    [33]
    J. Vetter and F. Mueller. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In IPDPS, 2002.
    [34]
    J. S. Vetter and M. O. McCracken. Statistical Scalability Analysis of Communication Operations in Distributed Applications. ACM SIGPLAN, 2001.
    [35]
    W. X. Wang and Z. Lin and W. M. Tang and W. W. Lee and S. Ethier and J. L. V. Lewandowski and G. Rewoldt and T. S. Hahm and J. Manickam. Gyro-kinetic Simulation of Global Turbulent Transport Properties in Tokamak Experiments. Physics of Plasmas, 2006.
    [36]
    X. Wu, V. Deshpande, and F. Mueller. ScalaBenchGen: Auto-Generation of Communication Benchmarks Traces. In IPDPS, 2012.
    [37]
    X. Wu and F. Mueller. ScalaExtrap: Trace-based Communication Extrapolation for SPMD Programs. In ACM PPoPP, 2011.
    [38]
    X. Wu, K. Vijayakumar, F. Mueller, X. Ma, and P. Roth. Probabilistic communication and i/o tracing with deterministic replay at scale. In ICPP, 2011.
    [39]
    Q. Xu and J. Subhlok. Construction and Evaluation of Coordinated Performance Skeletons. In HiPC, 2008.
    [40]
    Q. Xu, J. Subhlok, R. Zheng, and S. Voss. Logicalization of Communication Traces from Parallel Execution. In IISWC, 2009.
    [41]
    L. T. Yang, X. Ma, and F. Mueller. Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution. In Supercomputing, 2005.
    [42]
    F. Yu, M. Alkhalaf, and T. Bultan. Stranger: An Automata-Based String Analysis Tool for PHP. In J. Esparza and R. Majumdar, editors, Lecture Notes in Computer Science. 2010.
    [43]
    J. Zhai, J. Hu, X. Tang, X. Ma, and W. Chen. Cypress: Combining static and dynamic analysis for top-down communication trace compression. In Supercomputing, 2014.
    [44]
    J. Zhai, T. Sheng, J. He, W. Chen, and W. Zheng. FACT: Fast Communication Trace Collection for Parallel Applications Through Program Slicing. In Supercomputing, 2009.

    Cited By

    View all
    • (2021)Lossy Compression of Communication Traces Using Recurrent Neural NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3132417(1-1)Online publication date: 2021
    • (2020)GIFTProceedings of the 18th USENIX Conference on File and Storage Technologies10.5555/3386691.3386702(103-120)Online publication date: 24-Feb-2020
    • (2020)Uncovering access, reuse, and sharing characteristics of I/O-intensive files on large-scale production HPC systemsProceedings of the 18th USENIX Conference on File and Storage Technologies10.5555/3386691.3386701(91-102)Online publication date: 24-Feb-2020
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMETRICS Performance Evaluation Review
    ACM SIGMETRICS Performance Evaluation Review  Volume 43, Issue 1
    Performance evaluation review
    June 2015
    468 pages
    ISSN:0163-5999
    DOI:10.1145/2796314
    Issue’s Table of Contents
    • cover image ACM Conferences
      SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
      June 2015
      488 pages
      ISBN:9781450334860
      DOI:10.1145/2745844
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 June 2015
    Published in SIGMETRICS Volume 43, Issue 1

    Check for updates

    Author Tags

    1. asynchronous i/o
    2. benchmark generation
    3. hpc applications
    4. markov chain model
    5. phase identification
    6. traces

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)3

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Lossy Compression of Communication Traces Using Recurrent Neural NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3132417(1-1)Online publication date: 2021
    • (2020)GIFTProceedings of the 18th USENIX Conference on File and Storage Technologies10.5555/3386691.3386702(103-120)Online publication date: 24-Feb-2020
    • (2020)Uncovering access, reuse, and sharing characteristics of I/O-intensive files on large-scale production HPC systemsProceedings of the 18th USENIX Conference on File and Storage Technologies10.5555/3386691.3386701(91-102)Online publication date: 24-Feb-2020
    • (2019)Automatic generation of benchmarks for I/O-intensive parallel applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.10.004124(1-13)Online publication date: Feb-2019
    • (2018)BenchBox: A User-Driven Benchmarking Framework for Fat-Client Storage SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.281965729:10(2191-2205)Online publication date: 1-Oct-2018
    • (2017)P4Proceedings of the 36th International Conference on Computer-Aided Design10.5555/3199700.3199791(683-690)Online publication date: 13-Nov-2017
    • (2017) P 4 : Phase-based power/performance prediction of heterogeneous systems via neural networks 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)10.1109/ICCAD.2017.8203843(683-690)Online publication date: Nov-2017
    • (2016)Replicating HPC I/O workloads with proxy applicationsProceedings of the 1st Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems10.5555/3019046.3019049(13-18)Online publication date: 13-Nov-2016
    • (2016)Replicating HPC I/O Workloads with Proxy Applications2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)10.1109/PDSW-DISCS.2016.007(13-18)Online publication date: Nov-2016
    • (2020)Taming I/O variation on QoS-less HPC storageProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433715(1-13)Online publication date: 9-Nov-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media