Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2807591.2807644acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results

Published: 15 November 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Measuring and reporting performance of parallel computers constitutes the basis for scientific advancement of high-performance computing (HPC). Most scientific reports show performance improvements of new techniques and are thus obliged to ensure reproducibility or at least interpretability. Our investigation of a stratified sample of 120 papers across three top conferences in the field shows that the state of the practice is lacking. For example, it is often unclear if reported improvements are deterministic or observed by chance. In addition to distilling best practices from existing work, we propose statistically sound analysis and reporting techniques and simple guidelines for experimental design in parallel computing and codify them in a portable benchmarking library. We aim to improve the standards of reporting research results and initiate a discussion in the HPC field. A wide adoption of our minimal set of rules will lead to better interpretability of performance results and improve the scientific culture in HPC.

    References

    [1]
    D. G. Altman. Statistical reviewing for medical journals. Statistics in medicine, 17(23):2661--2674, 1998.
    [2]
    J. N. Amaral. How did this get published? Pitfalls in experimental evaluation of computing systems. Languages, Compilers, Tools and Theory for Embedded Systems, 2012.
    [3]
    D. H. Bailey. Twelve ways to fool the masses when giving performance results on parallel computers. Supercomputing Review, pages 54--55, August 1991.
    [4]
    D. H. Bailey. Misleading performance claims in parallel computations. In Proceedings of the 46th Annual Design Automation Conference, DAC '09, pages 528--533, New York, NY, USA, 2009. ACM.
    [5]
    D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The NAS parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991.
    [6]
    R. A. Bailey. Design of Comparative Experiments. Cambridge University Press, 2008.
    [7]
    S. M. Blackburn et al. Can you trust your experimental results? Technical report, Evaluate Collaboratory, TR #1, February 2012.
    [8]
    S. M. Blackburn, K. S. McKinley, R. Garner, C. Hoffmann, A. M. Khan, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanovik, T. VanDrunen, D. von Dincklage, and B. Wiedermann. Wake up and smell the coffee: Evaluation methodology for the 21st century. Commun. ACM, 51(8):83--89, Aug. 2008.
    [9]
    J.-Y. L. Boudec. Performance Evaluation of Computer and Communication Systems. EPFL Press, 2011.
    [10]
    G. E. Box, W. G. Hunter, and J. S. Hunter. Statistics for experimenters: Design, Innovation, and Discovery, 2nd Edition. Wiley-Interscience, 2005.
    [11]
    A. Carpen-Amarie, A. Rougier, and F. Lübbe. Stepping stones to reproducible research: A study of current practices in parallel computing. In Euro-Par 2014: Parallel Processing Workshops, volume 8805 of Lecture Notes in Computer Science, pages 499--510. Springer International Publishing, 2014.
    [12]
    T. Chen, Y. Chen, Q. Guo, O. Temam, Y. Wu, and W. Hu. Statistical performance comparisons of computers. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture, HPCA '12, pages 1--12, Washington, DC, USA, 2012. IEEE Computer Society.
    [13]
    R. Coe. It's the effect size, stupid: What effect size is and why it is important. 2002.
    [14]
    L. Crowl. How to measure, present, and compare parallel performance. Parallel Distributed Technology: Systems Applications, IEEE, 2(1):9--25, Spring 1994.
    [15]
    A. C. Davison and D. V. Hinkley. Bootstrap Methods and their Application. Cambridge University Press, October 1997.
    [16]
    A. B. de Oliveira, S. Fischmeister, A. Diwan, M. Hauswirth, and P. F. Sweeney. Why you should care about quantile regression. SIGARCH Comput. Archit. News, 41(1):207--218, Mar. 2013.
    [17]
    B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, 1993.
    [18]
    S. Few. Show Me the Numbers: Designing Tables and Graphs to Enlighten. Analytics Press, June 2012.
    [19]
    P. J. Fleming and J. J. Wallace. How not to lie with statistics: The correct way to summarize benchmark results. Commun. ACM, 29(3):218--221, Mar. 1986.
    [20]
    A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous java performance evaluation. SIGPLAN Not., 42(10):57--76, Oct. 2007.
    [21]
    W. Gropp and E. L. Lusk. Reproducible measurements of MPI performance characteristics. In Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 11--18, London, UK, UK, 1999. Springer-Verlag.
    [22]
    P. W. Gwanyama. The HM-GM-AM-QM inequalities. The College Mathematics Journal, 35(1):47--50, January 2004.
    [23]
    R. W. Hockney. The Science of Computer Benchmarking. Software, environments, tools. Society for Industrial and Applied Mathematics, 1996.
    [24]
    T. Hoefler, W. Gropp, M. Snir, and W. Kramer. Performance Modeling for Systematic Performance Tuning. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11), SotP Session, Nov. 2011.
    [25]
    T. Hoefler, T. Schneider, and A. Lumsdaine. Accurately measuring collective operations at massive scale. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1--8, April 2008.
    [26]
    T. Hoefler, T. Schneider, and A. Lumsdaine. Characterizing the influence of system noise on large-scale applications by simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.
    [27]
    S. Hunold, A. Carpen-Amarie, and J. L. Träff. Reproducible MPI micro-benchmarking isn't as easy as you think. In Proceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA '14, pages 69:69--69:76, New York, NY, USA, 2014. ACM.
    [28]
    S. Hunold and J. L. Träff. On the state and importance of reproducible experimental research in parallel computing. CoRR, abs/1308.3648, 2013.
    [29]
    J. P. Ioannidis. Why most published research findings are false. Chance, 18(4):40--47, 2005.
    [30]
    R. Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, 1991.
    [31]
    L. Kirkup. Experimental Methods: An Introduction to the Analysis and Presentation of Data. John Wiley & Sons, February 1995.
    [32]
    B. Kitchenham, S. Pfleeger, L. Pickard, P. Jones, D. Hoaglin, K. El Emam, and J. Rosenberg. Preliminary guidelines for empirical research in software engineering. Software Engineering, IEEE Transactions on, 28(8):721--734, Aug 2002.
    [33]
    R. Koenker. Quantile regression. Cambridge University Press, May 2005.
    [34]
    W. Kramer and C. Ryan. Performance Variability of Highly Parallel Architectures, volume 2659 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2003.
    [35]
    W. H. Kruskal and W. A. Wallis. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association, 47(260):583--621, 1952.
    [36]
    S. Kurkowski, T. Camp, and M. Colagrosso. Manet simulation studies: The incredibles. SIGMOBILE Mob. Comput. Commun. Rev., 9(4):50--61, Oct. 2005.
    [37]
    J. T. Leek and R. D. Peng. Statistics: P values are just the tip of the iceberg. Nature, 520(7549), Apr. 2015.
    [38]
    D. J. Lilja. Measuring computer performance: a practitioner's guide. Cambridge University Press, 2000.
    [39]
    I. Manolescu, L. Afanasiev, A. Arion, J. Dittrich, S. Manegold, N. Polyzotis, K. Schnaitter, P. Senellart, S. Zoupanos, and D. Shasha. The repeatability experiment of SIGMOD 2008. ACM SIGMOD Record, 37(1):39--45, 2008.
    [40]
    R. McGill, J. W. Tukey, and W. A. Larsen. Variations of Box Plots. The American Statistician, 32(1):12--16, 1978.
    [41]
    F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), Kartause Ittingen, Switzerland, May 2015. USENIX Association.
    [42]
    D. C. Montgomery. Design and Analysis of Experiments. Wiley, 2012.
    [43]
    T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! SIGPLAN Not., 44(3):265--276, Mar. 2009.
    [44]
    S. Pakin. Conceptual: a network correctness and performance testing language. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, pages 79--, April 2004.
    [45]
    S. Pakin. The design and implementation of a domain-specific language for network performance testing. IEEE Trans. Parallel Distrib. Syst., 18(10):1436--1449, Oct. 2007.
    [46]
    PARKBENCH Committee/Assembled by R. Hockney (Chairman) and M. Berry (Secretary). PARKBENCH report: Public international benchmarks for parallel computers. 3(2):101--146, Summer 1994.
    [47]
    F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC '03, pages 55--, New York, NY, USA, 2003. ACM.
    [48]
    N. M. Razali and Y. B. Wah. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics 2, (1):21--33, 2011.
    [49]
    R. Reussner, P. Sanders, L. Prechelt, and M. Müller. Skampi: A detailed, accurate MPI benchmark. In Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 52--59, London, UK, UK, 1998. Springer-Verlag.
    [50]
    B. Settles. Active Learning. Morgan & Claypool Publishers, July 2012.
    [51]
    S. S. Shapiro and M. B. Wilk. An analysis of variance test for normality (complete samples). Biometrika, pages 591--611, 1965.
    [52]
    D. Skinner and W. Kramer. Understanding the causes of performance variability in hpc workloads. In Workload Characterization Symposium, 2005. Proceedings of the IEEE International, pages 137--149, Oct 2005.
    [53]
    J. E. Smith. Characterizing computer performance with a single number. Commun. ACM, 31(10):1202--1206, Oct. 1988.
    [54]
    V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to reproducible: Reproducibility in computational and experimental mathematics. Technical report, ICERM report, February 2013.
    [55]
    D. Trafimow and M. Marks. Editorial. Basic and Applied Social Psychology, 37(1):1--2, 2015.
    [56]
    E. Tufte. The Visual Display of Quantitative Information. Graphics Pr; 2nd edition, January 2001.
    [57]
    J. D. Ullman. Experiments as research validation - have we gone too far?, July 2013.
    [58]
    J. Vitek and T. Kalibera. Repeatability, reproducibility, and rigor in systems research. In Proceedings of the Ninth ACM International Conference on Embedded Software, EMSOFT '11, pages 33--38, New York, NY, USA, 2011. ACM.
    [59]
    R. C. Whaley and A. M. Castaldo. Achieving accurate and context-sensitive timing for code optimization. Software: Practice and Experience, 38(15):1621--1642, 2008.
    [60]
    S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, Apr. 2009.
    [61]
    B. Wilson. An Introduction to Scientific Research. Dover Publications, January 1991.
    [62]
    T. Worsch, R. Reussner, and W. Augustin. On benchmarking collective MPI operations. In Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 271--279, London, UK, UK, 2002. Springer-Verlag.
    [63]
    N. Wright, S. Smallen, C. Olschanowsky, J. Hayes, and A. Snavely. Measuring and understanding variation in benchmark performance. In DoD High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2009, pages 438--443, June 2009.
    [64]
    C. Zannier, G. Melnik, and F. Maurer. On the success of empirical studies in the international conference on software engineering. In Proceedings of the 28th International Conference on Software Engineering, ICSE '06, pages 341--350, New York, NY, USA, 2006. ACM.

    Cited By

    View all
    • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
    • (2024)Arrowhead Factorization of Real Symmetric Matrices and its Applications in Optimized EigendecompositionProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3659914.3659918(1-12)Online publication date: 3-Jun-2024
    • (2024)HKPoly: A Polystore Architecture to Support Data Linkage and Queries on Distributed and Heterogeneous DataProceedings of the 20th Brazilian Symposium on Information Systems10.1145/3658271.3658322(1-10)Online publication date: 20-May-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2015
    985 pages
    ISBN:9781450337236
    DOI:10.1145/2807591
    • General Chair:
    • Jackie Kern,
    • Program Chair:
    • Jeffrey S. Vetter
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 November 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. benchmarking
    2. data analysis
    3. parallel computing
    4. statistics

    Qualifiers

    • Research-article

    Conference

    SC15
    Sponsor:

    Acceptance Rates

    SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)335
    • Downloads (Last 6 weeks)40
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
    • (2024)Arrowhead Factorization of Real Symmetric Matrices and its Applications in Optimized EigendecompositionProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3659914.3659918(1-12)Online publication date: 3-Jun-2024
    • (2024)HKPoly: A Polystore Architecture to Support Data Linkage and Queries on Distributed and Heterogeneous DataProceedings of the 20th Brazilian Symposium on Information Systems10.1145/3658271.3658322(1-10)Online publication date: 20-May-2024
    • (2024)TraceUpscaler: Upscaling Traces to Evaluate Systems at High LoadProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629581(942-961)Online publication date: 22-Apr-2024
    • (2024)AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth CostIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339780035:8(1331-1344)Online publication date: Aug-2024
    • (2024)Performance Analysis of the NVIDIA HPC SDK and AMD AOCC Compilers in an HPC Cluster Using Pooled, Robust and Relative Metrics2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00135(726-737)Online publication date: 27-May-2024
    • (2024)Resource Variability Telemetry in Edge Computing2024 Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit)10.1109/EuCNC/6GSummit60053.2024.10597092(919-924)Online publication date: 3-Jun-2024
    • (2024)A methodology for comparing optimization algorithms for auto-tuningFuture Generation Computer Systems10.1016/j.future.2024.05.021159(489-504)Online publication date: Oct-2024
    • (2024)On‐demand JSON: A better way to parse documents?Software: Practice and Experience10.1002/spe.331354:6(1074-1086)Online publication date: 18-Jan-2024
    • (2023)Performance portable ice-sheet modeling with MALIThe International Journal of High Performance Computing Applications10.1177/1094342023118368837:5(600-625)Online publication date: 27-Jun-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media