Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access

Debugging high-performance computing applications at massive scales

Published: 24 August 2015 Publication History

Abstract

Dynamic analysis techniques help programmers find the root cause of bugs in large-scale parallel applications.

References

[1]
Ahn, D.H., Arnold, D.C., de Supinski, B.R., Lee, G.L., Miller, B.P., and Schulz, M. Overcoming scalability challenges for tool daemon launching. In Proceedings of the International Conference on Parallel Processing (Portland, OR, Sept. 8--12). IEEE Press, 2008, 578--585.
[2]
Arnold, D.C., Ahn, D.H., de Supinski, B.R., Lee, G.L., Miller, B.P., and Schulz, M. Stack trace analysis for large-scale debugging. In Proceedings of the International Parallel and Distributed Processing Symposium (Long Beach, CA, Mar. 26--30). IEEE Press, 2007, pages 1--10.
[3]
Bronevetsky, G., Laguna, I., Bagchi, S., de Supinski, B.R., Ahn, D.H., and Schulz, M. <code>AutomaDeD:</code> Automata-based debugging for dissimilar parallel tasks. In Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks (Chicago, IL, June 28--July 1). IEEE Press, 2010, 231--240.
[4]
Cadar, C. and Sen, K. Symbolic execution for software testing: three decades later. Commun. ACM 56, 2 (Feb. 2013), 82--90.
[5]
Chen, Z., Gao, Q., Zhang, W., and Qin, F. <code>FlowChecker</code>: Detecting bugs in MPI libraries via message flow checking. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, LA, Nov. 13--19). IEEE Computer Society, Washington, D.C., 2010, 1--11.
[6]
Dinh, M.N., Abramson, D., and Jin, C. Scalable relative debugging. IEEE Transactions on Parallel and Distributed Systems 25, 3 (Mar. 2014), 740--749.
[7]
Gamblin, T., De Supinski, B.R., Schulz, M., Fowler, R., and Reed, D.A. Clustering performance data efficiently at massive scales. In Proceedings of the 24th ACM International Conference on Supercomputing (Tsukuba, Ibaraki, Japan, June 1--4). ACM Press, New York, 2010, 243--252.
[8]
Gao, Q., Qin, F., and Panda, D.K. DMTracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (Reno, NV, Nov. 10--16). ACM Press, New York, 2007, 15:1--15:12.
[9]
Gopalakrishnan, G., Kirby, R.M., Siegel, S., Thakur, R., Gropp, W., Lusk, E., De Supinski, B.R., Schulz, M., and Bronevetsky, G. Formal analysis of MPI-based parallel programs. Commun. ACM 54, 12 (Dec. 2011), 82--91.
[10]
Gropp, W., Lusk, E., Doss, N., and Skjellum, A. A high-performance, portable implementation of the MPI message-passing interface standard. Parallel Computing 22, 6 (1996), 789--828.
[11]
Hilbrich, T., Schulz, M., de Supinski, B.R., and Müller, M.S. MUST: A scalable approach to runtime error detection in MPI programs. Chapter 5 of Tools for High Performance Computing 2009, M.S. Müller et al., Eds. Springer, Berlin, Heidelberg, 2010, 53--66.
[12]
Kieras, D.E., Meyer, D.E., Ballas, J.A., and Lauber, E.J. Modern computational perspectives on executive mental processes and cognitive control: Where to from here? Chapter 30 of Control of Cognitive Processes: Attention and Performance, S. Monsell and J. Driver, Eds. MIT Press, Cambridge, MA, 2000, 681--712.
[13]
Kinshumann, K., Glerum, K., Greenberg, S., Aul, G., Orgovan, V., Nichols, G., Grant, G., Loihle, G., and Hunt, G. Debugging in the (very) large: 10 years of implementation and experience. Commun. ACM 54, 7 (July 2011), 111--116.
[14]
Krammer, B., Müller, M.S., and Resch, M.M. MPI application development using the analysis tool MARMOT. In Proceedings of the Fourth International Conference on Computational Science, M. Bubak et al., Eds. (Kraków, Poland, June 6--9). Springer, Berlin, Heidelberg, 2004, 464--471.
[15]
Laguna, I., Ahn, D.H., de Supinski, B. R., Bagchi, S., and Gamblin, T. Probabilistic diagnosis of performance faults in large-scale parallel applications. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (Minneapolis, MN, Sept. 19--23). ACM Press, New York, 2012, 213--222.
[16]
Laguna, I., Gamblin, T., de Supinski, B.R., Bagchi, S., Bronevetsky, G., Ahn, D.H., Schulz, M. and Rountree, B. Large-scale debugging of parallel tasks with <code>AutomaDeD</code>. In Proceedings of 2011 International Conference on High Performance Computing, Networking, Storage, and Analysis (Seattle, WA, Nov. 12--18). ACM Press, New York, 2011, 50:1--50:10.
[17]
Lee, G.L., Ahn, D.H., Arnold, D.C., de Supinski, B.R., Legendre, M., Miller, B.P., Schulz, M., and Liblit, B. Lessons learned at 208K: Towards debugging millions of Cores. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (Austin, TX, Nov. 15--21). IEEE Press, Piscataway, NJ, 2008, 1--9.
[18]
Lee, G.L., Ahn, D.H., Arnold, D.C., de Supinski, B.R., Miller, B.P., and Schulz, M. Benchmarking the stack trace analysis tool for BlueGene/L. In Proceedings of the Parallel Computing: Architectures, Algorithms, and Applications Conference (Julich/Aachen, Germany, Sept. 4--7). IOS Press, Amsterdam, the Netherlands, 2007, 621--628.
[19]
Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Version 3.0, Sept. 2012; http://www.mpi-forum.org/docs/
[20]
Mitra, S., Laguna, I., Ahn, D.H., Bagchi, S., Schulz, M., and Gamblin, T. Accurate application progress analysis for large-scale parallel debugging. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (Edinburgh, U.K., June 9--11). ACM Press, New York, 2014, 1--10.
[21]
Open MPI Project; https://svn.open-mpi.org/trac/ompi/ticket/689.
[22]
Roth, P.C., Arnold, D.C., and Miller, B.P. MRNet: A software-based multicast/reduction network for scalable tools. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (Phoenix, AZ, Nov. 15--21). ACM Press, New York, 2003, 21.
[23]
Sistare, S., Allen, D., Bowker, R., Jourdenais, K., Simons, J. et al. A scalable debugger for massively parallel message-passing programs. IEEE Parallel & Distributed Technology: Systems & Applications 2, 2 (Summer 1994), 50--56.
[24]
Vakkalanka, S.S., Sharma, S., Gopalakrishnan, G., and Kirby, R.M. ISP: A tool for model checking MPI programs. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Salt Lake City, UT, Feb. 20--23). ACM Press, New York, 2008, 285--286.
[25]
Vetter, J.S. and de Supinski, B.R. Dynamic software testing of MPI applications with Umpire. In Proceedings of the ACM/IEEE Supercomputing Conference (Dallas, TX, Nov. 4--10). IEEE Press, 2000, 51--51.
[26]
Weiser, M. Program slicing. In Proceedings of the Fifth International Conference on Software Engineering (San Diego, CA, Mar. 9--12). IEEE Press, Piscataway, NJ, 1981, 439--449.
[27]
Yang, J., Cui, H., Wu, J., Tang, Y., and Hu, G. Making parallel programs reliable with stable multithreading. Commun. ACM 57, 3 (Mar. 2014), 58--69.
[28]
Zhou, B., Kulkarni, M., and Bagchi, S. <code>Vrisha</code>: Using scaling properties of parallel programs for bug detection and localization. In Proceedings of the 20th International ACM Symposium on High-Performance and Distributed Computing (San Jose, CA, June 8--11). ACM Press, New York, 2011, 85--96.
[29]
Zhou, B., Too, J., Kulkarni, M., and Bagchi, S. <code>WuKong</code>: Automatically detecting and localizing bugs that manifest at large system scales. In Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing (New York, June 17--21). ACM Press, New York, 2013, 131--142.

Cited By

View all
  • (2023)A Hybrid Approach for Detecting Bugs in HPC Workloads2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00269(1968-1974)Online publication date: 17-Dec-2023
  • (2023)Prediction and Correction of Software Defects in Message-Passing Interfaces Using a Static Analysis Tool and Machine LearningIEEE Access10.1109/ACCESS.2023.328559811(60668-60680)Online publication date: 2023
  • (2023)Graph Analysis for Scalability AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_5(101-128)Online publication date: 19-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 58, Issue 9
September 2015
119 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/2817191
  • Editor:
  • Moshe Y. Vardi
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2015
Published in CACM Volume 58, Issue 9

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Popular
  • Refereed

Funding Sources

  • National Science Foundation
  • U.S. Department of Energy by Lawrence Livermore National Laboratory

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)155
  • Downloads (Last 6 weeks)22
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Hybrid Approach for Detecting Bugs in HPC Workloads2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00269(1968-1974)Online publication date: 17-Dec-2023
  • (2023)Prediction and Correction of Software Defects in Message-Passing Interfaces Using a Static Analysis Tool and Machine LearningIEEE Access10.1109/ACCESS.2023.328559811(60668-60680)Online publication date: 2023
  • (2023)Graph Analysis for Scalability AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_5(101-128)Online publication date: 19-Jun-2023
  • (2022)Jdebug: A Fast, Non-Intrusive and Scalable Fault Locating Tool for Ten-Million-Scale Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315769033:12(3491-3504)Online publication date: 1-Dec-2022
  • (2022)Debugging MPI Implementations via Reduction-to-Primitives2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)10.1109/SuperCheck56652.2022.00007(1-9)Online publication date: Nov-2022
  • (2022)Detecting Scale-Induced Overflow Bugs in Production HPC CodesHigh Performance Computing. ISC High Performance 2022 International Workshops10.1007/978-3-031-23220-6_3(33-43)Online publication date: 29-May-2022
  • (2020)ScalAnaProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433738(1-14)Online publication date: 9-Nov-2020
  • (2020)Symbolic verification of message passing interface programsProceedings of the ACM/IEEE 42nd International Conference on Software Engineering10.1145/3377811.3380419(1248-1260)Online publication date: 27-Jun-2020
  • (2020)SCALANA: Automating Scaling Loss Detection with Graph AnalysisSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00032(1-14)Online publication date: Nov-2020
  • (2020)Efficient noise injection for exposing hidden data racesThe Journal of Supercomputing10.1007/s11227-019-03031-076:1(292-323)Online publication date: 1-Jan-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDFChinese translation

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media