Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3615979.3662154acmconferencesArticle/Chapter ViewAbstractPublication PagespadsConference Proceedingsconference-collections
extended-abstract
Free access

Accurate HPC Network Simulations Using Application-Level Approximation

Published: 24 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    To keep up with the times, supercomputers have to evolve as do the applications they enable in science, national security and artificial intelligence. To do so, the best strategy we have is to simulate how changes in their algorithms and architecture design would impact their performance. Yet, simulation at high fidelity, keeping track of all the interlocking parts in detail, requires in itself large computing resources; just a couple of milliseconds of simulated time can take hours to run. Fortunately, we can exploit massively parallel applications tendency to follow an iterative pattern with clearly defined stages. Every iteration takes roughly the same amount of resources including network utilization. We can record the resources utilized during a couple of iterations and from them we can estimate how long it will take to continue the simulation for dozens or hundreds of iterations longer. Determining how long will each iteration take highly depends on the placement, number of iterations and kinds of applications running alongside. We implement a strategy to record the state of the network, train a statistical surrogate model to estimate the time each iteration will take, and switch the simulation into a low-fidelity, surrogate-enabled mode in which we have seen gains close to the number of iterations skipped, i.e, if we skip 80% of iterations we see a speedup of 5 ×.

    References

    [1]
    Christopher D. Carothers, David Bauer, and Shawn Pearce. 2002. ROSS: A High-Performance, Low-Memory, Modular Time Warp System. J. Parallel and Distrib. Comput. 62, 11 (Nov. 2002), 1648–1669. https://doi.org/10.1016/S0743-7315(02)00004-7
    [2]
    Elkin Cruz-Camacho, Kevin A. Brown, Xin Wang, Xiongxiao Xu, Kai Shu, Zhiling Lan, Robert B. Ross, and Christopher D. Carothers. 2023. Hybrid PDES Simulation of HPC Networks Using Zombie Packets. In ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. ACM, Orlando FL USA, 128–132. https://doi.org/10.1145/3573900.3591122
    [3]
    Misbah Mubarak, Christopher D. Carothers, Robert B. Ross, and Philip Carns. 2017. Enabling Parallel Simulation of Large-Scale HPC Network Systems. IEEE Transactions on Parallel and Distributed Systems 28, 1 (Jan. 2017), 87–100. https://doi.org/10.1109/TPDS.2016.2543725
    [4]
    Xiongxiao Xu, Kevin A. Brown, Tanwi Mallick, Xin Wang, Elkin Cruz-Camacho, Robert B. Ross, Christopher D. Carothers, Zhiling Lan, and Kai Shu. 2024. Exploring Surrogate Modeling for Forecasting HPC Application Iteration Times with Network Features. In ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. to appear.
    [5]
    Xu Yang, John Jenkins, Misbah Mubarak, Robert B. Ross, and Zhiling Lan. 2016. Watch Out for the Bully! Job Interference Study on Dragonfly Network. In SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 750–760. https://doi.org/10.1109/SC.2016.63

    Index Terms

    1. Accurate HPC Network Simulations Using Application-Level Approximation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGSIM-PADS '24: Proceedings of the 38th ACM SIGSIM Conference on Principles of Advanced Discrete Simulation
      June 2024
      155 pages
      ISBN:9798400703638
      DOI:10.1145/3615979
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 June 2024

      Check for updates

      Qualifiers

      • Extended-abstract
      • Research
      • Refereed limited

      Funding Sources

      Conference

      SIGSIM-PADS '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 398 of 779 submissions, 51%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 13
        Total Downloads
      • Downloads (Last 12 months)13
      • Downloads (Last 6 weeks)13

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media