research-article

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results

Authors:

Torsten Hoefler,

Roberto BelliAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 73, Pages 1 - 12

https://doi.org/10.1145/2807591.2807644

Published: 15 November 2015 Publication History

Abstract

Measuring and reporting performance of parallel computers constitutes the basis for scientific advancement of high-performance computing (HPC). Most scientific reports show performance improvements of new techniques and are thus obliged to ensure reproducibility or at least interpretability. Our investigation of a stratified sample of 120 papers across three top conferences in the field shows that the state of the practice is lacking. For example, it is often unclear if reported improvements are deterministic or observed by chance. In addition to distilling best practices from existing work, we propose statistically sound analysis and reporting techniques and simple guidelines for experimental design in parallel computing and codify them in a portable benchmarking library. We aim to improve the standards of reporting research results and initiate a discussion in the HPC field. A wide adoption of our minimal set of rules will lead to better interpretability of performance results and improve the scientific culture in HPC.

References

[1]

D. G. Altman. Statistical reviewing for medical journals. Statistics in medicine, 17(23):2661--2674, 1998.

[2]

J. N. Amaral. How did this get published? Pitfalls in experimental evaluation of computing systems. Languages, Compilers, Tools and Theory for Embedded Systems, 2012.

[3]

D. H. Bailey. Twelve ways to fool the masses when giving performance results on parallel computers. Supercomputing Review, pages 54--55, August 1991.

[4]

D. H. Bailey. Misleading performance claims in parallel computations. In Proceedings of the 46th Annual Design Automation Conference, DAC '09, pages 528--533, New York, NY, USA, 2009. ACM.

Digital Library

[5]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The NAS parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991.

Digital Library

[6]

R. A. Bailey. Design of Comparative Experiments. Cambridge University Press, 2008.

[7]

S. M. Blackburn et al. Can you trust your experimental results? Technical report, Evaluate Collaboratory, TR #1, February 2012.

[8]

S. M. Blackburn, K. S. McKinley, R. Garner, C. Hoffmann, A. M. Khan, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanovik, T. VanDrunen, D. von Dincklage, and B. Wiedermann. Wake up and smell the coffee: Evaluation methodology for the 21st century. Commun. ACM, 51(8):83--89, Aug. 2008.

Digital Library

[9]

J.-Y. L. Boudec. Performance Evaluation of Computer and Communication Systems. EPFL Press, 2011.

Digital Library

[10]

G. E. Box, W. G. Hunter, and J. S. Hunter. Statistics for experimenters: Design, Innovation, and Discovery, 2nd Edition. Wiley-Interscience, 2005.

[11]

A. Carpen-Amarie, A. Rougier, and F. Lübbe. Stepping stones to reproducible research: A study of current practices in parallel computing. In Euro-Par 2014: Parallel Processing Workshops, volume 8805 of Lecture Notes in Computer Science, pages 499--510. Springer International Publishing, 2014.

Digital Library

[12]

T. Chen, Y. Chen, Q. Guo, O. Temam, Y. Wu, and W. Hu. Statistical performance comparisons of computers. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture, HPCA '12, pages 1--12, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

[13]

R. Coe. It's the effect size, stupid: What effect size is and why it is important. 2002.

[14]

L. Crowl. How to measure, present, and compare parallel performance. Parallel Distributed Technology: Systems Applications, IEEE, 2(1):9--25, Spring 1994.

Digital Library

[15]

A. C. Davison and D. V. Hinkley. Bootstrap Methods and their Application. Cambridge University Press, October 1997.

[16]

A. B. de Oliveira, S. Fischmeister, A. Diwan, M. Hauswirth, and P. F. Sweeney. Why you should care about quantile regression. SIGARCH Comput. Archit. News, 41(1):207--218, Mar. 2013.

Digital Library

[17]

B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, 1993.

[18]

S. Few. Show Me the Numbers: Designing Tables and Graphs to Enlighten. Analytics Press, June 2012.

Digital Library

[19]

P. J. Fleming and J. J. Wallace. How not to lie with statistics: The correct way to summarize benchmark results. Commun. ACM, 29(3):218--221, Mar. 1986.

Digital Library

[20]

A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous java performance evaluation. SIGPLAN Not., 42(10):57--76, Oct. 2007.

Digital Library

[21]

W. Gropp and E. L. Lusk. Reproducible measurements of MPI performance characteristics. In Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 11--18, London, UK, UK, 1999. Springer-Verlag.

Digital Library

[22]

P. W. Gwanyama. The HM-GM-AM-QM inequalities. The College Mathematics Journal, 35(1):47--50, January 2004.

[23]

R. W. Hockney. The Science of Computer Benchmarking. Software, environments, tools. Society for Industrial and Applied Mathematics, 1996.

[24]

T. Hoefler, W. Gropp, M. Snir, and W. Kramer. Performance Modeling for Systematic Performance Tuning. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11), SotP Session, Nov. 2011.

Digital Library

[25]

T. Hoefler, T. Schneider, and A. Lumsdaine. Accurately measuring collective operations at massive scale. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1--8, April 2008.

Digital Library

[26]

T. Hoefler, T. Schneider, and A. Lumsdaine. Characterizing the influence of system noise on large-scale applications by simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[27]

S. Hunold, A. Carpen-Amarie, and J. L. Träff. Reproducible MPI micro-benchmarking isn't as easy as you think. In Proceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA '14, pages 69:69--69:76, New York, NY, USA, 2014. ACM.

Digital Library

[28]

S. Hunold and J. L. Träff. On the state and importance of reproducible experimental research in parallel computing. CoRR, abs/1308.3648, 2013.

[29]

J. P. Ioannidis. Why most published research findings are false. Chance, 18(4):40--47, 2005.

[30]

R. Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, 1991.

[31]

L. Kirkup. Experimental Methods: An Introduction to the Analysis and Presentation of Data. John Wiley & Sons, February 1995.

[32]

B. Kitchenham, S. Pfleeger, L. Pickard, P. Jones, D. Hoaglin, K. El Emam, and J. Rosenberg. Preliminary guidelines for empirical research in software engineering. Software Engineering, IEEE Transactions on, 28(8):721--734, Aug 2002.

Digital Library

[33]

R. Koenker. Quantile regression. Cambridge University Press, May 2005.

[34]

W. Kramer and C. Ryan. Performance Variability of Highly Parallel Architectures, volume 2659 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2003.

Digital Library

[35]

W. H. Kruskal and W. A. Wallis. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association, 47(260):583--621, 1952.

[36]

S. Kurkowski, T. Camp, and M. Colagrosso. Manet simulation studies: The incredibles. SIGMOBILE Mob. Comput. Commun. Rev., 9(4):50--61, Oct. 2005.

Digital Library

[37]

J. T. Leek and R. D. Peng. Statistics: P values are just the tip of the iceberg. Nature, 520(7549), Apr. 2015.

[38]

D. J. Lilja. Measuring computer performance: a practitioner's guide. Cambridge University Press, 2000.

Digital Library

[39]

I. Manolescu, L. Afanasiev, A. Arion, J. Dittrich, S. Manegold, N. Polyzotis, K. Schnaitter, P. Senellart, S. Zoupanos, and D. Shasha. The repeatability experiment of SIGMOD 2008. ACM SIGMOD Record, 37(1):39--45, 2008.

Digital Library

[40]

R. McGill, J. W. Tukey, and W. A. Larsen. Variations of Box Plots. The American Statistician, 32(1):12--16, 1978.

[41]

F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), Kartause Ittingen, Switzerland, May 2015. USENIX Association.

Digital Library

[42]

D. C. Montgomery. Design and Analysis of Experiments. Wiley, 2012.

[43]

T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! SIGPLAN Not., 44(3):265--276, Mar. 2009.

Digital Library

[44]

S. Pakin. Conceptual: a network correctness and performance testing language. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, pages 79--, April 2004.

[45]

S. Pakin. The design and implementation of a domain-specific language for network performance testing. IEEE Trans. Parallel Distrib. Syst., 18(10):1436--1449, Oct. 2007.

Digital Library

[46]

PARKBENCH Committee/Assembled by R. Hockney (Chairman) and M. Berry (Secretary). PARKBENCH report: Public international benchmarks for parallel computers. 3(2):101--146, Summer 1994.

[47]

F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC '03, pages 55--, New York, NY, USA, 2003. ACM.

Digital Library

[48]

N. M. Razali and Y. B. Wah. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. Journal of Statistical Modeling and Analytics 2, (1):21--33, 2011.

[49]

R. Reussner, P. Sanders, L. Prechelt, and M. Müller. Skampi: A detailed, accurate MPI benchmark. In Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 52--59, London, UK, UK, 1998. Springer-Verlag.

Digital Library

[50]

B. Settles. Active Learning. Morgan & Claypool Publishers, July 2012.

[51]

S. S. Shapiro and M. B. Wilk. An analysis of variance test for normality (complete samples). Biometrika, pages 591--611, 1965.

[52]

D. Skinner and W. Kramer. Understanding the causes of performance variability in hpc workloads. In Workload Characterization Symposium, 2005. Proceedings of the IEEE International, pages 137--149, Oct 2005.

[53]

J. E. Smith. Characterizing computer performance with a single number. Commun. ACM, 31(10):1202--1206, Oct. 1988.

Digital Library

[54]

V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to reproducible: Reproducibility in computational and experimental mathematics. Technical report, ICERM report, February 2013.

[55]

D. Trafimow and M. Marks. Editorial. Basic and Applied Social Psychology, 37(1):1--2, 2015.

[56]

E. Tufte. The Visual Display of Quantitative Information. Graphics Pr; 2nd edition, January 2001.

[57]

J. D. Ullman. Experiments as research validation - have we gone too far?, July 2013.

[58]

J. Vitek and T. Kalibera. Repeatability, reproducibility, and rigor in systems research. In Proceedings of the Ninth ACM International Conference on Embedded Software, EMSOFT '11, pages 33--38, New York, NY, USA, 2011. ACM.

Digital Library

[59]

R. C. Whaley and A. M. Castaldo. Achieving accurate and context-sensitive timing for code optimization. Software: Practice and Experience, 38(15):1621--1642, 2008.

Digital Library

[60]

S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, Apr. 2009.

Digital Library

[61]

B. Wilson. An Introduction to Scientific Research. Dover Publications, January 1991.

[62]

T. Worsch, R. Reussner, and W. Augustin. On benchmarking collective MPI operations. In Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 271--279, London, UK, UK, 2002. Springer-Verlag.

Digital Library

[63]

N. Wright, S. Smallen, C. Olschanowsky, J. Hayes, and A. Snavely. Measuring and understanding variation in benchmark performance. In DoD High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2009, pages 438--443, June 2009.

Digital Library

[64]

C. Zannier, G. Melnik, and F. Maurer. On the success of empirical studies in the international conference on software engineering. In Proceedings of the 28th International Conference on Software Engineering, ICSE '06, pages 341--350, New York, NY, USA, 2006. ACM.

Digital Library

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
https://doi.org/10.1145/3665643
Ferrari MCavalli FHarake HLompa CLo Russo NEvans KSchenk O(2024)Arrowhead Factorization of Real Symmetric Matrices and its Applications in Optimized EigendecompositionProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3659914.3659918(1-12)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3659914.3659918
Azevedo LSouza RSoares EThiago RTesolin JOliveira AMoreno M(2024)HKPoly: A Polystore Architecture to Support Data Linkage and Queries on Distributed and Heterogeneous DataProceedings of the 20th Brazilian Symposium on Information Systems10.1145/3658271.3658322(1-10)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3658271.3658322
Show More Cited By

Index Terms

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results

Recommendations

Benchmarking of high throughput computing applications on Grids

Grids constitute a promising platform to execute loosely coupled, high-throughput parameter sweep applications, which arise naturally in many scientific and engineering fields like bio-informatics, computational fluid dynamics, particle physics, etc. In ...
Benchmarking UHGROMOS
HICSS '95: Proceedings of the 28th Hawaii International Conference on System Sciences

Porting of the parallel Fortran preprocessor, Pfortran, to Intel Corporation and IBM Corporation massively parallel processor machines is presented. The machines include the Intel iPSC/860, the Caltech Intel DELTA, the Intel Paragon, and the IBM SP1. ...
Parallel computing and the Grid-experiences and applications

In recent years, Grid computing evolved from first implementations as prototype Grid environments to large-scale production Grid infrastructures utilised during everyday work by scientists around the world. This demonstrates that the concept of the Grid ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

161
Total Citations
View Citations
2,324
Total Downloads

Downloads (Last 12 months)335
Downloads (Last 6 weeks)40

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/3665643Online publication date: 22-May-2024
https://doi.org/10.1145/3665643
Ferrari MCavalli FHarake HLompa CLo Russo NEvans KSchenk O(2024)Arrowhead Factorization of Real Symmetric Matrices and its Applications in Optimized EigendecompositionProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3659914.3659918(1-12)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3659914.3659918
Azevedo LSouza RSoares EThiago RTesolin JOliveira AMoreno M(2024)HKPoly: A Polystore Architecture to Support Data Linkage and Queries on Distributed and Heterogeneous DataProceedings of the 20th Brazilian Symposium on Information Systems10.1145/3658271.3658322(1-10)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3658271.3658322
Sajal SZhu TUrgaonkar BSen S(2024)TraceUpscaler: Upscaling Traces to Evaluate Systems at High LoadProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629581(942-961)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629581
Chen JLi SGuo RYuan JHoefler T(2024)AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth CostIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339780035:8(1331-1344)Online publication date: Aug-2024
https://doi.org/10.1109/TPDS.2024.3397800
Huerta Y(2024)Performance Analysis of the NVIDIA HPC SDK and AMD AOCC Compilers in an HPC Cluster Using Pooled, Robust and Relative Metrics2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00135(726-737)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00135
Giannakopoulos Pvan Knippenberg BChandra Joshi KCalabretta NExarchakos G(2024)Resource Variability Telemetry in Edge Computing2024 Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit)10.1109/EuCNC/6GSummit60053.2024.10597092(919-924)Online publication date: 3-Jun-2024
https://doi.org/10.1109/EuCNC/6GSummit60053.2024.10597092
Willemsen FSchoonhoven RFilipovič JTørring Jvan Nieuwpoort Rvan Werkhoven B(2024)A methodology for comparing optimization algorithms for auto-tuningFuture Generation Computer Systems10.1016/j.future.2024.05.021159(489-504)Online publication date: Oct-2024
https://doi.org/10.1016/j.future.2024.05.021
Keiser JLemire D(2024)On‐demand JSON: A better way to parse documents?Software: Practice and Experience10.1002/spe.331354:6(1074-1086)Online publication date: 18-Jan-2024
https://doi.org/10.1002/spe.3313
Watkins JCarlson MShan KTezaur IPerego MBertagna LKao CHoffman MPrice S(2023)Performance portable ice-sheet modeling with MALIThe International Journal of High Performance Computing Applications10.1177/1094342023118368837:5(600-625)Online publication date: 27-Jun-2023
https://doi.org/10.1177/10943420231183688
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents