research-article

vSensor: leveraging fixed-workload snippets of programs for performance variance detection

Authors:

Xiongchao Tang,

Wenguang ChenAuthors Info & Claims

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2018

Pages 124 - 136

https://doi.org/10.1145/3178487.3178497

Published: 10 February 2018 Publication History

Abstract

Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer from such variance. Performance variance not only causes unpredictable performance requirement violations, but also makes it unintuitive to understand the program behavior. Despite prior efforts, efficient on-line detection of performance variance remains an open problem.

In this paper, we propose vSensor, a novel approach for light-weight and on-line performance variance detection. The key insight is that, instead of solely relying on an external detector, the source code of a program itself could reveal the runtime performance characteristics. Specifically, many parallel programs contain code snippets that are executed repeatedly with an invariant quantity of work. Based on this observation, we use compiler techniques to automatically identify these fixed-workload snippets and use them as performance <u>v</u>ariance <u>sensor</u>s (v-sensors) that enable effective detection. We evaluate vSensor with a variety of parallel programs on the Tianhe-2 system. Results show that vSensor can effectively detect performance variance on HPC systems. The performance overhead is smaller than 4% with up to 16,384 processes. In particular, with vSensor, we found a bad node with slow memory that slowed a program's performance by 21%. As a showcase, we also detected a severe network performance problem that caused a 3.37X slowdown for an HPC kernel program on the Tianhe-2 system.

References

[1]

2016. MPI Documents. (2016). http://mpi-forum.org/docs/

[2]

2017. Intel Trace Analyzer and Collector. (2017). https://software.intel.com/en-us/intel-trace-analyzer

[3]

2017. top500 website. (2017). http://top500.org/

[4]

Saurabh Agarwal, Rahul Garg, and Nisheeth K Vishnoi. 2005. The impact of noise on the scaling of collectives: A theoretical approach. In High Performance Computing-HiPC 2005. Springer, 280--289.

Digital Library

[5]

Dorian C Arnold, Dong H Ahn, BR De Supinski, Gregory Lee, BP Miller, and Martin Schulz. 2007. Stack trace analysis for large scale applications. In 21st IEEE International Parallel & Distributed Processing Symposium (IPDPSâĂ&Zacute;07), Long Beach, CA.

[6]

D. Bailey, T. Harris, W. Saphir, R. V. D. Wijngaart, A. Woo, and M. Yarrow. 1995. The NAS Parallel Benchmarks 2.0. NAS Systems Division, NASA Ames Research Center, Moffett Field, CA.

[7]

Pete Beckman, Kamil Iskra, Kazutomo Yoshii, and Susan Coghlan. 2006. The influence of operating systems on the performance of collective operations at extreme scale. In 2006 IEEE International Conference on Cluster Computing. IEEE, 1--12.

[8]

Alexandru Calotoiu, David Beckinsale, Christopher W Earl, Torsten Hoefler, Ian Karlin, Martin Schulz, and Felix Wolf. 2016. Fast Multi-Parameter Performance Modeling. In 2016 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 172--181.

[9]

Kurt B Ferreira, Patrick G Bridges, Ron Brightwell, and Kevin T Pedretti. 2013. The impact of system design parameters on application noise sensitivity. 2010 IEEE International Conference on Cluster Computing 16, 1 (2013), 117--129.

Digital Library

[10]

Markus Geimer, Felix Wolf, Brian JN Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. 2010. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience 22, 6 (2010), 702--719.

Digital Library

[11]

Yifan Gong, Bingsheng He, and Dan Li. 2014. Finding constant from change: Revisiting network performance aware optimizations on iaas clouds. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 982--993.

Digital Library

[12]

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine. 2010. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). 1--11.

Digital Library

[13]

Terry Jones. 2012. Linux kernel co-scheduling and bulk synchronous parallelism. International Journal of High Performance Computing Applications (2012), 1094342011433523.

Digital Library

[14]

TR Jones, LB Brenner, and JM Fier. 2003. Impacts of operating systems on the scalability of parallel applications. Lawrence Livermore National Laboratory, Technical Report (2003).

[15]

Ian Karlin, Abhinav. Bhatele, Bradford L. Chamberlain, Jonathan. Cohen, Zachary Devito, Maya Gokhale, Riyaz Haque, Rich Hornung, Jeff Keasler, Dan Laney, Edward Luke, Scott Lloyd, Jim McGraw, Rob Neely, David Richards, Martin Schulz, Charle H. Still, Felix Wang, and Daniel Wong. 2012. LULESH Programming Model and Performance Ports Overview. Technical Report LLNL-TR-608824. 1--17 pages.

[16]

Ignacio Laguna, Dong H Ahn, Bronis R de Supinski, Saurabh Bagchi, and Todd Gamblin. 2015. Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference. IEEE Transactions on Parallel and Distributed Systems 26, 5 (2015), 1280--1289.

[17]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 75.

Digital Library

[18]

Seyong Lee, Jeremy S Meredith, and Jeffrey S Vetter. 2015. Compass: A framework for automated performance modeling and prediction. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 405--414.

Digital Library

[19]

Yu Jung Lo, Samuel Williams, Brian Van Straalen, Terry J Ligocki, Matthew J Cordery, Nicholas J Wright, Mary W Hall, and Leonid Oliker. 2014. Roofline model toolkit: A practical tool for architectural and program analysis. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Springer, 129--148.

[20]

Subrata Mitra, Ignacio Laguna, Dong H Ahn, Saurabh Bagchi, Martin Schulz, and Todd Gamblin. 2014. Accurate application progress analysis for large-scale parallel debugging. In ACM SIGPLAN Notices, Vol. 49. ACM, 193--203.

Digital Library

[21]

Oscar H Mondragon, Patrick G Bridges, Scott Levy, Kurt B Ferreira, and Patrick Widener. 2016. Understanding performance interference in next-generation HPC systems. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 384--395.

Digital Library

[22]

Philip Mucci, Jack Dongarra, Shirley Moore, Fengguang Song, Felix Wolf, and Rick Kufrin. 2004. Automating the Large-Scale Collection and Analysis of Performance Data on Linux Clusters1. In Proceedings of the 5th LCI International Conference on Linux Clusters: The HPC Revolution.

[23]

Fabrizio Petrini, Darren J. Kerbyson, and Scott Pakin. 2003. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (SC'03). ACM.

Digital Library

[24]

Wayne Pfeiffer and Alexandros Stamatakis. 2010. Hybrid MPI/Pthreads parallelization of the RAxML phylogenetics code. In 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW). IEEE, 1--8.

[25]

J. C. Phillips, Gengbin Zheng, S. Kumar, and L. V. Kale. {n. d.}. NAMD: Biomolecular Simulation on Thousands of Processors. In Supercomputing, ACM/IEEE 2002 Conference. 36--36.

Digital Library

[26]

David Skinner and William Kramer. 2005. Understanding the causes of performance variability in HPC workloads. In Workload Characterization Symposium, 2005. Proceedings of the IEEE International. IEEE, 137--149.

[27]

David Skinner and William Kramer. 2005. Understanding the causes of performance variability in HPC workloads. In Proceedings of the IEEE International Workload Characterization Symposium, 2005. IEEE, 137--149.

[28]

Nathan R. Tallent, Laksono Adhianto, and John M. Mellor-Crummey. 2010. Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--11.

Digital Library

[29]

Nathan R Tallent, John Mellor-Crummey, Michael Franco, Reed Landrum, and Laksono Adhianto. 2011. Scalable fine-grained call path tracing. In Proceedings of the international conference on Supercomputing. ACM, 63--74.

Digital Library

[30]

Dan Tsafrir, Yoav Etsion, Dror G. Feitelson, and Scott Kirkpatrick. 2005. System Noise, OS Clock Ticks, and Fine-grained Parallel Applications. In Proceedings of the 19th Annual International Conference on Supercomputing (ICS'05). ACM, New York, NY, USA, 303--312.

Digital Library

[31]

Jeffrey Vetter and Chris Chambreau. 2005. mpip: Lightweight, scalable mpi profiling. (2005).

[32]

Vincent M Weaver, Dan Terpstra, and Shirley Moore. 2013. Non-determinism and overcount on modern hardware performance counter implementations. In Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on. IEEE, 215--224.

[33]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.

Digital Library

[34]

Nicholas J Wright, Shava Smallen, Catherine Mills Olschanowsky, Jim Hayes, and Allan Snavely. 2009. Measuring and understanding variation in benchmark performance. In DoD High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC), 2009. IEEE, 438--443.

Digital Library

[35]

Xing Wu and Frank Mueller. 2013. Elastic and scalable tracing and accurate replay of non-deterministic events. In Proceedings of the 27th international ACM conference on International conference on supercomputing. ACM, 59--68.

Digital Library

[36]

Brian J. N. Wylie, Markus Geimer, and Felix Wolf. 2008. Performance measurement and analysis of large-scale parallel applications on leadership computing systems. Scientific programming 16, 2--3 (April 2008), 167--181.

Digital Library

[37]

Ulrike Meier Yang et al. 2002. BoomerAMG: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics 41, 1 (2002), 155--177.

Digital Library

[38]

Jae-Seung Yeom, Jayaraman J Thiagarajan, Abhinav Bhatele, Greg Bronevetsky, and Tzanio Kolev. 2016. Data-driven performance modeling of linear solvers for sparse matrices. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, 32--42.

Digital Library

[39]

Jidong Zhai, Jianfei Hu, Xiongchao Tang, Xiaosong Ma, and Wenguang Chen. 2014. Cypress: combining static and dynamic analysis for top-down communication trace compression. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 143--153.

Digital Library

Cited By

Zhang GLiu YYang HQian D(2021)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-4Online publication date: 9-Jun-2021
https://doi.org/10.1007/s11227-021-03892-4
Zhai JChen W(2018)A vision of post-exascale programmingFrontiers of Information Technology & Electronic Engineering10.1631/FITEE.180044219:10(1261-1266)Online publication date: 28-Nov-2018
https://doi.org/10.1631/FITEE.1800442
Salimi Beni MHunold SCosenza B(2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-wOnline publication date: 28-Mar-2024
https://doi.org/10.1007/s11227-024-06040-w
Show More Cited By

Index Terms

vSensor: leveraging fixed-workload snippets of programs for performance variance detection

Recommendations

Vapro: performance variance detection and diagnosis for production-run parallel applications
PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Performance variance is a serious problem for parallel applications, which can cause performance degradation and make applications' behavior hard to understand. Therefore, detecting and diagnosing performance variance are of crucial importance for users ...
Read More
vSensor: leveraging fixed-workload snippets of programs for performance variance detection
PPoPP '18

Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer ...
Read More
Simulation-Based Performance Analysis and Tuning for a Two-Level Directly Connected System
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

Hardware and software co-design is becoming increasingly important due to complexities in supercomputing architectures. Simulating applications before there is access to the real hardware can assist machine architects in making better design decisions ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2018

442 pages

ISBN:9781450349826

DOI:10.1145/3178487

General Chair:
Andreas Krall
Vienna University of Technology, Austria
,
Program Chair:
Thomas R. Gross
ETH Zürich, Switzerland

ACM SIGPLAN Notices Volume 53, Issue 1
PPoPP '18
January 2018
426 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3200691
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China

Conference

PPoPP '18

Sponsor:

PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
282
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)5

Other Metrics

View Author Metrics

Citations

Cited By

Zhang GLiu YYang HQian D(2021)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-4Online publication date: 9-Jun-2021
https://doi.org/10.1007/s11227-021-03892-4
Zhai JChen W(2018)A vision of post-exascale programmingFrontiers of Information Technology & Electronic Engineering10.1631/FITEE.180044219:10(1261-1266)Online publication date: 28-Nov-2018
https://doi.org/10.1631/FITEE.1800442
Salimi Beni MHunold SCosenza B(2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-wOnline publication date: 28-Mar-2024
https://doi.org/10.1007/s11227-024-06040-w
Zhai JJin YChen WZheng WZhai JJin YChen WZheng W(2023)Production-Run Noise DetectionPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_8(199-224)Online publication date: 19-Jun-2023
https://doi.org/10.1007/978-981-99-4366-1_8
Zhai JJin YChen WZheng WZhai JJin YChen WZheng W(2023)Graph Analysis for Scalability AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_5(101-128)Online publication date: 19-Jun-2023
https://doi.org/10.1007/978-981-99-4366-1_5
Jin YWang HYu TTang XHoefler TLiu XZhai JCuicchi CQualters IKramer W(2020)ScalAnaProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433738(1-14)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433738
Jin YWang HYu TTang XHoefler TLiu XZhai J(2020)SCALANA: Automating Scaling Loss Detection with Graph AnalysisSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00032(1-14)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00032

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents