Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Energy and performance improvements in stencil computations on multi-node HPC systems with different network and communication topologies

Published: 09 July 2024 Publication History

Abstract

Energy and performance improvements in stencil computations are relevant for both application developers and data center administrators. They appear as the fundamental scheme in many large-scale scientific simulations and workloads. Many research efforts have focused on some estimation techniques of the energy usage of HPC systems based on specific characteristics of parallel applications. In case of stencils, we have previously concentrated on detailed estimations of energy consumption and the energy-aware distribution of stencil computations on heterogeneous processors. However, we have restricted our comprehensive studies to a single heterogeneous computing node only. In this paper, we show how scheduling and optimization techniques can be applied for energy and performance improvements of stencil computations on multi-node HPC systems using different network topologies. We formulate a scheduling model together with a new Tabu Search algorithm, called Task Movement (TM), taking into account the communication hierarchies, to minimize the overall energy usage and the execution time of stencil computations. Experimental studies show that this algorithm solves the considered problem more efficiently comparing to other, simpler heuristics. We present computational experiments for a reference 7 point stencil computation pattern on three commonly used low-diameter network topologies: Fat-tree, Dragonfly, and Torus. According to our studies, the most promising multi-node HPC architecture for stencil computations is based on the Torus network concept. Finally, we argue that the proposed scheduling model and TM algorithm can be easily adopted within existing high-level parallel execution environments for stencils automatic performance tuning.

Highlights

Discussion of the multi-node communication topologies for the stencil computations.
Modeling energy usage for a stencil pattern on supercomputer architectures.
Formulation of a topology-aware scheduling model for heterogeneous processors.
Presentation of a Tabu Search algorithm for solving the problem.

References

[1]
Kasahara H., Narita S., Practical multiprocessor scheduling algorithms for efficient parallel processing, IEEE Trans. Comput. C-33 (1984) 1023–1029.
[2]
Kwok Y.-K., Ahmad I., Bubble scheduling: A quasi dynamic algorithm for static allocation of tasks to parallel architectures, in: Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing, in: SPDP ’95, IEEE Computer Society, USA, 1995, p. 36.
[3]
Berman F., Snyder L., On mapping parallel algorithms into parallel architectures, J. Parallel Distrib. Comput. 4 (5) (1987) 439–458,. URL http://www.sciencedirect.com/science/article/pii/0743731587900189.
[4]
Bokhari S.H., On the mapping problem, IEEE Trans. Comput. C-30 (1981) 207–214.
[5]
Ghafoor A., Bashkow T.R., A study of odd graphs as fault-tolerant interconnection networks, IEEE Trans. Comput. 40 (2) (1991) 225–232.
[6]
Leiserson C.E., Fat-trees: Universal networks for hardware-efficient supercomputing, IEEE Trans. Comput. C-34 (10) (1985) 892–901.
[7]
Pollard S.D., Jain N., Herbein S., Bhatele A., Evaluation of an interference-free node allocation policy on fat-tree clusters, in: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018, pp. 333–345.
[8]
Jain N., Bhatele A., Ni X., Wright N.J., Kale L.V., Maximizing throughput on a dragonfly network, in: SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2014, pp. 336–347.
[9]
Dorier M., Mubarak M., Ross R., Li J.K., Carothers C.D., Ma K., Evaluation of topology-aware broadcast algorithms for dragonfly networks, in: 2016 IEEE International Conference on Cluster Computing (CLUSTER), 2016, pp. 40–49.
[10]
Tang W., Lan Z., Desai N., Buettner D., Yu Y., Reducing fragmentation on torus-connected supercomputers, 2011, pp. 828–839,.
[11]
Bhatele A., Jain N., Isaacs K.E., Buch R., Gamblin T., Langer S.H., Kale L.V., Optimizing the performance of parallel applications on a 5D torus via task mapping, 2014 21st International Conference on High Performance Computing (HiPC), 2014, pp. 1–10.
[12]
Berman F., Snyder L., On mapping parallel algorithms into parallel architectures, J. Parallel Distrib. Comput. 4 (5) (1987) 439–458,. URL http://www.sciencedirect.com/science/article/pii/0743731587900189.
[13]
T. Agarwal, A. Sharma, A. Laxmikant, L.V. Kale, Topology-aware task mapping for reducing communication contention on large parallel machines, in: Proceedings 20th IEEE International Parallel Distributed Processing Symposium, 2006, pp. 10 pp.–.
[14]
Hoefler T., Jeannot E., Mercier G., An overview of topology mapping algorithms and techniques in high-performance computing, in: High-Performance Computing on Complex Environments, John Wiley & Sons, Ltd, 2014, pp. 73–94,. Ch. 5, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118711897.ch5. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118711897.ch5.
[15]
Drebes A., Heydemann K., Drach N., Pop A., Cohen A., Topology-aware and dependence-aware scheduling and memory allocation for task-parallel languages, ACM Trans. Archit. Code Optim. 11 (3) (2014),.
[16]
Yount C., Duran A., Tobin J., Multi-level spatial and temporal tiling for efficient HPC stencil computation on many-core processors with large shared caches, Future Gener. Comput. Syst. 92 (2019) 903–919,. URL http://www.sciencedirect.com/science/article/pii/S0167739X17304648.
[17]
Pereira A.D., Ramos L., Góes L.F., PSkel: A stencil programming framework for CPU-GPU systems, Concurr. Comput.: Pract. Exper. (2015).
[18]
Basu P., Williams S., Straalen B.V., Oliker L., Colella P., Hall M., Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers, Parallel Comput. 64 (2017) 50–64,. High-End Computing for Next-Generation Scientific Discovery. URL http://www.sciencedirect.com/science/article/pii/S0167819117300376.
[19]
Ciznicki M., Kulczewski M., Kopta P., Kurowski K., Methods to load balance a GCR pressure solver using a stencil framework on multi- and many-core architectures, Sci. Program. 2015 (2015) 13,.
[20]
Li D., Nikolopoulos D.S., Cameron K., de Supinski B.R., Schulz M., Power-aware MPI task aggregation prediction for high-end computing systems, in: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010, pp. 1–12.
[21]
Ciznicki M., Kurowski K., Weglarz J., Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures, Cluster Comput. 20 (3) (2017) 2535–2549.
[22]
Witkowski M., Oleksiak A., Piontek T., Weglarz J., Practical power consumption estimation for real life HPC applications, Future Gener. Comput. Syst. 29 (1) (2013) 208–217,. URL http://www.sciencedirect.com/science/article/pii/S0167739X12001392.
[23]
Jarus M., Oleksiak A., Piontek T., Weglarz J., Runtime power usage estimation of HPC servers for various classes of real-life applications, Future Gener. Comput. Syst. 36 (2014) 299–310,.
[24]
Top500 list, 2019, Online; acessed 2019-08-01. URL http://top500.org.
[25]
Petrini F., Vanneschi M., K-ary n-trees: High performance networks for massively parallel architectures, in: Proceedings 11th International Parallel Processing Symposium, IEEE, 1997, pp. 87–93.
[26]
Lin X.-Y., Chung Y.-C., Huang T.-Y., A multiple LID routing scheme for fat-tree-based InfiniBand networks, in: 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings, IEEE, 2004, p. 11.
[27]
Jain N., Bhatele A., Howell L.H., Böhme D., Karlin I., León E.A., Mubarak M., Wolfe N., Gamblin T., Leininger M.L., Predicting the performance impact of different fat-tree configurations, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, 2017, p. 50.
[28]
Wolfe N., Mubarak M., Jain N., Domke J., Bhatele A., Carothers C.D., Ross R.B., Preliminary performance analysis of multi-rail fat-tree networks, in: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), IEEE, 2017, pp. 258–261.
[29]
Mubarak M., Carothers C.D., Ross R., Carns P., Modeling a million-node dragonfly network using massively parallel discrete-event simulation, in: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, IEEE, 2012, pp. 366–376.
[30]
Wolfe N., Mubarak M., Carothers C.D., Ross R.B., Carns P.H., Modeling large-scale slim fly networks using parallel discrete-event simulation, ACM Trans. Model. Comput. Simul. (TOMACS) 28 (4) (2018) 29.
[31]
Alverson B., Froese E., Kaplan L., Roweth D., Cray XC Series Network, Cray Inc., 2012.
[32]
Alam S.R., Bianchi N., Cardo N., Chesi M., Gila M., Gorini S., Klein M., Passerini M., Ponti C., Verzelloni F., McMurtrie C., An operational perspective on a hybrid and heterogeneous Cray XC50 system, in: CUG2017 Proceedings, 2017.
[33]
Ajima Y., Kawashima T., Okamoto T., Shida N., Hirai K., Shimizu T., Hiramoto S., Ikeda Y., Yoshikawa T., Uchida K., Inoue T., The tofu interconnect d, in: 2018 IEEE International Conference on Cluster Computing (CLUSTER), 2018, pp. 646–654,.
[34]
Gropp W., Gropp W.D., Lusk A.D.F.E.E., Lusk E., Skjellum A., Using MPI: Portable Parallel Programming with the Message-Passing Interface, Vol. 1, MIT press, 1999.
[35]
Kamata S.I., Eason R.O., Bandou Y., A new algorithm for N-dimensional Hilbert scanning, IEEE Trans. Image Process. 8 (7) (1999) 964–973,.
[36]
Intel vtune amplifier, 2019, Online; acessed 2019-08-01. URL https://software.intel.com/en-us/vtune.
[37]
PM: Integrated performance monitoring, 2019, Online; acessed 2019-08-01. URL http://ipm-hpc.sourceforge.net.
[38]
MpiP: Lightweight scalable MPI profiling, 2019, Online; acessed 2019-08-01. URL http://www.llnl.gov/CASC/mpip/.
[39]
PAVE: Performance analysis and visualization at exas-cale, 2019, Online; acessed 2019-08-01. URL https://computation.llnl.gov/project/performance-analysis-through-visualization/software.php.
[40]
Bernholdt D.E., Boehm S., Bosilca G., Gorentla Venkata M., Grant R.E., Naughton T., Pritchard H.P., Schulz M., Vallee G.R., A survey of MPI usage in the US exascale computing project, Concurr. Comput.: Pract. Exper. (2017).
[41]
Aji A.M., Panwar L.S., Ji F., Murthy K., Chabbi M., Balaji P., Bisset K.R., Dinan J., Feng W.-c., Mellor-Crummey J., et al., MPI-ACC: accelerator-aware MPI for scientific applications, IEEE Trans. Parallel Distrib. Syst. 27 (5) (2015) 1401–1414.
[42]
Chapman B., Jost G., Van Der Pas R., Using OpenMP: Portable Shared Memory Parallel Programming, Vol. 10, MIT press, 2008.
[43]
McKay B.D., Miller M., Širáň J., A note on large graphs of diameter two and given maximum degree, J. Combin. Theory Ser. B 74 (1) (1998) 110–118,.
[44]
Abts D., Marty M.R., Wells P.M., Klausler P., Liu H., Energy proportional datacenter networks, in: ACM SIGARCH Computer Architecture News, Vol. 38, ACM, 2010, pp. 338–347.
[45]
Jain N., Bhatele A., White S., Gamblin T., Kale L.V., Evaluating HPC networks via simulation of parallel workloads, in: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 2016, pp. 154–165.
[46]
Mubarak M., Carothers C.D., Ross R.B., Carns P., Enabling parallel simulation of large-scale HPC network systems, IEEE Trans. Parallel Distrib. Syst. 28 (1) (2016) 87–100.
[47]
Kerbyson D.J., Barker K.J., Vishnu A., Hoisie A., A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems, Future Gener. Comput. Syst. 30 (2014) 291–304,. Special Issue on Extreme Scale Parallel Architectures and Systems, Cryptography in Cloud Computing and Recent Advances in Parallel and Distributed Systems, ICPADS 2012 Selected Papers. URL http://www.sciencedirect.com/science/article/pii/S0167739X13001337.
[48]
Subramoni H., Lu X., Panda D.K., A scalable network-based performance analysis tool for MPI on large-scale HPC systems, in: 2017 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, 2017, pp. 354–358.
[49]
Pereira A.D., Ramos L., Góes L.F.W., PSkel: A stencil programming framework for CPU-GPU systems, Concurr. Comput.: Pract. Exper. 27 (17) (2015) 4938–4953,. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.3479. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3479.
[50]
Augonnet C., Aumage O., Furmento N., Namyst R., Thibault S., StarPU-MPI: Task programming over clusters of machines enhanced with accelerators, in: European MPI Users’ Group Meeting, Springer, 2012, pp. 298–299.
[51]
Planas J., Badia R.M., Ayguadé E., Labarta J., Hierarchical task-based programming with starss, Int. J. High Perform. Comput. Appl. 23 (3) (2009) 284–299.
[52]
Planas J., Badia R.M., Ayguade E., Labarta J., Self-adaptive ompss tasks in heterogeneous environments, in: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IEEE, 2013, pp. 138–149.
[53]
Unat D., Cai X., Baden S.B., Mint: realizing CUDA performance in 3D stencil methods with annotated C, in: Proceedings of the International Conference on Supercomputing, ACM, 2011, pp. 214–224.
[54]
Maruyama N., Nomura T., Sato K., Matsuoka S., Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers, in: High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for, IEEE, 2011, pp. 1–12.
[55]
Yount C., Tobin J., Breuer A., Duran A., YASK—Yet another stencil kernel: A framework for HPC stencil code-generation and tuning, in: 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), IEEE, 2016, pp. 30–39.
[56]
Blazewicz M., Brandt S.R., Kierzynka M., Kurowski K., Ludwiczak B., Tao J., Weglarz J., Cakernel–a parallel application programming framework for heterogenous computing architectures, Sci. Program. 19 (4) (2011) 185–197.
[57]
Blazewicz M., Hinder I., Koppelman D., Brandt S., Ciznicki M., Kierzynka M., Loffler F., Schnetter E., Tao J., From physics model to results: An optimizing framework for cross-architecture code generation, Sci. Program. 21 (1) (2013) 1–16.
[58]
Schnetter E., Blazewicz M., Brandt S.R., Koppelman D.M., Löffler F., Chemora: a PDE-solving framework for modern high-performance computing architectures, Comput. Sci. Eng. 17 (2) (2015) 53–64.
[59]
Prusa J.M., Smolarkiewicz P.K., Wyszogrodzki A.A., EULAG, a computational model for multiscale flows, Comput. & Fluids 37 (9) (2008) 1193–1207.
[60]
Glover F., Tabu search: A tutorial, INFORMS J. Appl. Anal. 20 (4) (1990) 74–94,. arXiv:https://doi.org/10.1287/inte.20.4.74.
[61]
Spears W.M., Jong K.A.D., Bäck T., Fogel D.B., de Garis H., An overview of evolutionary computation, in: ECML, 1993.
[62]
Kirkpatrick S., Gelatt C.D., Vecchi M.P., Optimization by simulated annealing, Science 220 (4598) (1983) 671–680,. arXiv:https://science.sciencemag.org/content/220/4598/671.full.pdf. URL https://science.sciencemag.org/content/220/4598/671.
[63]
Shannon C.E., A theorem on coloring the lines of a network, J. Math. Phys. 28 (1) (1949) 148–152.
[64]
Ajima Y., Kawashima T., Okamoto T., Shida N., Hirai K., Shimizu T., Hiramoto S., Ikeda Y., Yoshikawa T., Uchida K., et al., The tofu interconnect d, in: 2018 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, 2018, pp. 646–654.
[65]
The repository address, 2019, Online; acessed 2019-08-01. URL https://github.com/miloszc/task-movement.
[66]
Schloegel K., Karypis G., Kumar V., Parallel multilevel algorithms for multi-constraint graph partitioning, in: Bode A., Ludwig T., Karl W., Wismüller R. (Eds.), Euro-Par 2000 Parallel Processing, Springer Berlin Heidelberg, Berlin, Heidelberg, 2000, pp. 296–310.
[67]
Pellegrini F., Graph partitioning based methods and tools for scientific computing, Parallel Comput. 23 (1) (1997) 153–164,. Environment and tools for parallel scientific computing. URL http://www.sciencedirect.com/science/article/pii/S0167819196001020.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems
Future Generation Computer Systems  Volume 115, Issue C
Feb 2021
880 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 09 July 2024

Author Tags

  1. Stencil computations
  2. Performance analysis
  3. Topology-aware scheduling
  4. Energy modeling
  5. GPUs
  6. HPC

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media