Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL

Joó, Bálint; Kalamkar, Dhiraj D.; Kurth, Thorsten; Vaidyanathan, Karthikeyan; Walden, Aaron

doi:10.1007/978-3-319-46079-6_30

Bálint Joó¹⁶,
Dhiraj D. Kalamkar¹⁷,
Thorsten Kurth¹⁸,
Karthikeyan Vaidyanathan¹⁷ &
…
Aaron Walden^18,19

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9945))

Included in the following conference series:

International Conference on High Performance Computing

2528 Accesses
6 Citations

Abstract

Lattice Quantumchromodynamics (QCD) is a powerful tool to numerically access the low energy regime of QCD in a straightforward way with quantifyable uncertainties. In this approach, QCD is discretized on a four dimensional, Euclidean space-time grid with millions of degrees of freedom. In modern lattice calculations, most of the work is still spent in solving large, sparse linear systems. This part has two challenges, i.e. optimizing the sparse matrix application as well as BLAS-like kernels used in the linear solver. We are going to present performance optimizations of the Dirac operator (dslash) with and without clover term for recent Intel^® architectures, i.e. Haswell and Knights Landing (KNL). We were able to achieve a good fraction of peak performance for the Wilson-Dslash kernel, and Conjugate Gradients and Stabilized BiConjugate Gradients solvers. We will also present a series of experiments we performed on KNL, i.e. running MCDRAM in different modes, enabling or disabling hardware prefetching as well as using different SoA lengths. Furthermore, we will present a weak scaling study up to 16 KNL nodes.

B. Joó—Notice: Authored by Jefferson Science Associates, LLC under U.S. DOE Contract No. DE-AC05-06OR23177. The U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce this manuscript for U.S. Government purposes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Lattice QCD on Intel® Xeon PhiTM Coprocessors

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Performance Optimization and Evaluation of Scalable Optoelectronics Application on Large Scale KNL Cluster

Notes

1.
Due to the page limitation, we can not include numbers from SNC studies in this presentation.
2.
we have also tested a beta of ICC 2017, but we found no significant differences in performance.
3.
Similar measurements for DDR yield a similar result, i.e. ${\sim }70\,\mathrm {GB/s}$ which corresponds to about $77\,\%$ of DDR bandwidth peak performance. Remarkably, this value is lower as the computed effective bandwidth for the Dslash kernel.

References

Boyle, P.: The BlueGene/Q supercomputer. In: PoS LATTICE 2012, vol. 20 (2012). http://pos.sissa.it/archive/conferences/164/020/Lattice%202012_020.pdf
Boyle, P.A.: The bagel assembler generation library. Comput. Phys. Commun. 180(12), 2739–2748 (2009). http://www.sciencedirect.com/science/article/B6TJ5-4X378GD-2/2/34878900f618e4e37cb7f051b6218436, 40 YEARS OF CPC: A celebratory issue focused on quality software for highperformance, grid and novel computing architectures
Article MATH Google Scholar
Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181, 1517–1528 (2010)
Article MATH Google Scholar
Creutz, M.: Quarks, Gluons and Lattices. Cambridge Monographs on Mathematical Physics, 169 p. Univ. Pr., Cambridge (1983)
Google Scholar
Edwards, H.C., Sunderland, D.: Kokkos array performance-portable manycore programming model. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2012, pp. 1–10. ACM, New York (2012). http://doi.acm.org/10.1145/2141702.2141703
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bureau Stand. 49(6), 409–436 (1952)
Article MathSciNet MATH Google Scholar
Heybrock, S., Joó, B., Kalamkar, D.D., Smelyanskiy, M., Vaidyanathan, K., Wettig, T., Dubey, P.: Lattice QCD with domain decomposition on intel® xeon phi™ co-processors. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 69–80. IEEE Press, Piscataway (2014). http://dx.doi.org/10.1109/SC.2014.11
Joó, B., Kalamkar, D., Vaidyanathan, K., Smelyanskiy, M., Pamnany, K., Lee, V., Dubey, P., Watson, W.: Lattice QCD on Intel(R) XeonPhi(TM) Coprocessors. In: Kunkel, J., Ludwig, T., Meuer, H. (eds.) ISC 2013. LNCS, vol. 7905, pp. 40–54. Springer, Heidelberg (2013). http://dx.doi.org/10.1007/978-3-642-38750-0_4
Chapter Google Scholar
Joó, B., Smelyanskiy, M., Kalamkar, D.D., Vaidyanathan, K.: Chapter 9 - Wilson dslash kernel from lattice QCD optimization. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches, vol. 2, pp. 139–170. Morgan Kaufmann, Boston (2015). http://www.sciencedirect.com/science/article/pii/B9780128038192000239
Google Scholar
Joó, B.: qphix package web page. http://jeffersonlab.github.io/qphix
Joó, B.: qphix-codegen package web page. http://jeffersonlab.github.io/qphix-codegen
Kaczmarek, O., Schmidt, C., Steinbrecher, P., Mukherjee, S., Wagner, M.: HISQ inverter on intel xeon phi and NVIDIA gpus. CoRR abs/1409.1510 (2014). http://arxiv.org/abs/1409.1510
Kalamkar, D.D., Smelyanskiy, M., Farber, R., Vaidyanathan, K.: Chapter 26 - quantum chromodynamics (QCD). In: Reinders, J., Jeffers, J., Sodani, A. (eds.) Intel Xeon Phi Processor High Performance Programming Knights Landing Edition. Morgan Kaufmann, Boston (2016)
Google Scholar
Montvay, I., Munster, G.: Quantum Fields on a Lattice. Cambridge Monographs on Mathematical Physics, 491 p. Univ. Pr., Cambridge (1994)
Google Scholar
Nguyen, A.D., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: SC, pp. 1–13 (2010)
Google Scholar
Rothe, H.J.: Lattice Gauge theories: an Introduction. World Sci. Lect. Notes Phys. 74, 1–605 (2005)
Article MathSciNet MATH Google Scholar
Sheikholeslami, B., Wohlert, R.: Improved continuum limit lattice action for QCD with Wilson Fermions. Nucl. Phys. B 259, 572 (1985)
Article Google Scholar
van der Vorst, H.A.: Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 13(2), 631–644 (1992)
Article MathSciNet MATH Google Scholar
Walden, A., Khan, S., Joó, B., Ranjan, D., Zubair, M.: Optimizing a multiple right-hand side Dslash kernel for intel knights corner. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) High Performance Computing. LNCS, vol. 9945, pp. 1–12. Springer International Publishing, Switzerland (2016)
Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for floating-point programs and multicore architectures. Commun. ACM 52, 65–76 (2009)
Article Google Scholar
Wilson, K.G.: Quarks and strings on a lattice. In: Zichichi, A. (ed.) New Phenomena in Subnuclear Physics, p. 69. Plenum Press, New York (1975)
Google Scholar

Download references

Acknowledgments

The majority of this work was carried out during a NERSC Exa-scale Scientific Application Program (NESAP) deep dive known as a Dungeon Session at the offices of Intel in Portland, Oregon. We thank NERSC and Intel for organizing this session. We also like to thank Jack Deslippe for insightful discussions. Performance results were measured on the Intel Endeavor cluster and on the Cori Phase I system at NERSC with additional development work carried out on systems at Jefferson Lab. B. Joó gratefully acknowledges funding from the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Office of Nuclear Physics and Office of High Energy Physics under the SciDAC program (USQCD). This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177. The National Energy Research Scientific Computing Center (NERSC) is a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

US DOE Jefferson Lab, Newport News, VA, USA
Bálint Joó
Intel Parallel Computing Labs, Bangalore, India
Dhiraj D. Kalamkar & Karthikeyan Vaidyanathan
National Energy Research Scientific Computing Center, Berkeley, CA, USA
Thorsten Kurth & Aaron Walden
Old Dominion University, Norfolk, VA, USA
Aaron Walden

Authors

Bálint Joó
View author publications
You can also search for this author in PubMed Google Scholar
Dhiraj D. Kalamkar
View author publications
You can also search for this author in PubMed Google Scholar
Thorsten Kurth
View author publications
You can also search for this author in PubMed Google Scholar
Karthikeyan Vaidyanathan
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Walden
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bálint Joó , Thorsten Kurth or Karthikeyan Vaidyanathan .

Editor information

Editors and Affiliations

University of Delaware, Newark, Delaware, USA
Michela Taufer
Forschungszentrum Jülich, Jülich, Germany
Bernd Mohr
DKRZ, Hamburg, Germany
Julian M. Kunkel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joó, B., Kalamkar, D.D., Kurth, T., Vaidyanathan, K., Walden, A. (2016). Optimizing Wilson-Dirac Operator and Linear Solvers for Intel^® KNL. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-46079-6_30
Published: 06 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel^® KNL

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Lattice QCD on Intel® Xeon PhiTM Coprocessors

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Performance Optimization and Evaluation of Scalable Optoelectronics Application on Large Scale KNL Cluster

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Lattice QCD on Intel® Xeon PhiTM Coprocessors

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Performance Optimization and Evaluation of Scalable Optoelectronics Application on Large Scale KNL Cluster

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel^® KNL