Abstract
Lattice Quantumchromodynamics (QCD) is a powerful tool to numerically access the low energy regime of QCD in a straightforward way with quantifyable uncertainties. In this approach, QCD is discretized on a four dimensional, Euclidean space-time grid with millions of degrees of freedom. In modern lattice calculations, most of the work is still spent in solving large, sparse linear systems. This part has two challenges, i.e. optimizing the sparse matrix application as well as BLAS-like kernels used in the linear solver. We are going to present performance optimizations of the Dirac operator (dslash) with and without clover term for recent Intel® architectures, i.e. Haswell and Knights Landing (KNL). We were able to achieve a good fraction of peak performance for the Wilson-Dslash kernel, and Conjugate Gradients and Stabilized BiConjugate Gradients solvers. We will also present a series of experiments we performed on KNL, i.e. running MCDRAM in different modes, enabling or disabling hardware prefetching as well as using different SoA lengths. Furthermore, we will present a weak scaling study up to 16 KNL nodes.
B. Joó—Notice: Authored by Jefferson Science Associates, LLC under U.S. DOE Contract No. DE-AC05-06OR23177. The U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce this manuscript for U.S. Government purposes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Due to the page limitation, we can not include numbers from SNC studies in this presentation.
- 2.
we have also tested a beta of ICC 2017, but we found no significant differences in performance.
- 3.
Similar measurements for DDR yield a similar result, i.e. \({\sim }70\,\mathrm {GB/s}\) which corresponds to about \(77\,\%\) of DDR bandwidth peak performance. Remarkably, this value is lower as the computed effective bandwidth for the Dslash kernel.
References
Boyle, P.: The BlueGene/Q supercomputer. In: PoS LATTICE 2012, vol. 20 (2012). http://pos.sissa.it/archive/conferences/164/020/Lattice%202012_020.pdf
Boyle, P.A.: The bagel assembler generation library. Comput. Phys. Commun. 180(12), 2739–2748 (2009). http://www.sciencedirect.com/science/article/B6TJ5-4X378GD-2/2/34878900f618e4e37cb7f051b6218436, 40 YEARS OF CPC: A celebratory issue focused on quality software for highperformance, grid and novel computing architectures
Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181, 1517–1528 (2010)
Creutz, M.: Quarks, Gluons and Lattices. Cambridge Monographs on Mathematical Physics, 169 p. Univ. Pr., Cambridge (1983)
Edwards, H.C., Sunderland, D.: Kokkos array performance-portable manycore programming model. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2012, pp. 1–10. ACM, New York (2012). http://doi.acm.org/10.1145/2141702.2141703
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bureau Stand. 49(6), 409–436 (1952)
Heybrock, S., Joó, B., Kalamkar, D.D., Smelyanskiy, M., Vaidyanathan, K., Wettig, T., Dubey, P.: Lattice QCD with domain decomposition on intel® xeon phi™ co-processors. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 69–80. IEEE Press, Piscataway (2014). http://dx.doi.org/10.1109/SC.2014.11
Joó, B., Kalamkar, D., Vaidyanathan, K., Smelyanskiy, M., Pamnany, K., Lee, V., Dubey, P., Watson, W.: Lattice QCD on Intel(R) XeonPhi(TM) Coprocessors. In: Kunkel, J., Ludwig, T., Meuer, H. (eds.) ISC 2013. LNCS, vol. 7905, pp. 40–54. Springer, Heidelberg (2013). http://dx.doi.org/10.1007/978-3-642-38750-0_4
Joó, B., Smelyanskiy, M., Kalamkar, D.D., Vaidyanathan, K.: Chapter 9 - Wilson dslash kernel from lattice QCD optimization. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches, vol. 2, pp. 139–170. Morgan Kaufmann, Boston (2015). http://www.sciencedirect.com/science/article/pii/B9780128038192000239
Joó, B.: qphix package web page. http://jeffersonlab.github.io/qphix
Joó, B.: qphix-codegen package web page. http://jeffersonlab.github.io/qphix-codegen
Kaczmarek, O., Schmidt, C., Steinbrecher, P., Mukherjee, S., Wagner, M.: HISQ inverter on intel xeon phi and NVIDIA gpus. CoRR abs/1409.1510 (2014). http://arxiv.org/abs/1409.1510
Kalamkar, D.D., Smelyanskiy, M., Farber, R., Vaidyanathan, K.: Chapter 26 - quantum chromodynamics (QCD). In: Reinders, J., Jeffers, J., Sodani, A. (eds.) Intel Xeon Phi Processor High Performance Programming Knights Landing Edition. Morgan Kaufmann, Boston (2016)
Montvay, I., Munster, G.: Quantum Fields on a Lattice. Cambridge Monographs on Mathematical Physics, 491 p. Univ. Pr., Cambridge (1994)
Nguyen, A.D., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: SC, pp. 1–13 (2010)
Rothe, H.J.: Lattice Gauge theories: an Introduction. World Sci. Lect. Notes Phys. 74, 1–605 (2005)
Sheikholeslami, B., Wohlert, R.: Improved continuum limit lattice action for QCD with Wilson Fermions. Nucl. Phys. B 259, 572 (1985)
van der Vorst, H.A.: Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 13(2), 631–644 (1992)
Walden, A., Khan, S., Joó, B., Ranjan, D., Zubair, M.: Optimizing a multiple right-hand side Dslash kernel for intel knights corner. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) High Performance Computing. LNCS, vol. 9945, pp. 1–12. Springer International Publishing, Switzerland (2016)
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for floating-point programs and multicore architectures. Commun. ACM 52, 65–76 (2009)
Wilson, K.G.: Quarks and strings on a lattice. In: Zichichi, A. (ed.) New Phenomena in Subnuclear Physics, p. 69. Plenum Press, New York (1975)
Acknowledgments
The majority of this work was carried out during a NERSC Exa-scale Scientific Application Program (NESAP) deep dive known as a Dungeon Session at the offices of Intel in Portland, Oregon. We thank NERSC and Intel for organizing this session. We also like to thank Jack Deslippe for insightful discussions. Performance results were measured on the Intel Endeavor cluster and on the Cori Phase I system at NERSC with additional development work carried out on systems at Jefferson Lab. B. Joó gratefully acknowledges funding from the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Office of Nuclear Physics and Office of High Energy Physics under the SciDAC program (USQCD). This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177. The National Energy Research Scientific Computing Center (NERSC) is a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Joó, B., Kalamkar, D.D., Kurth, T., Vaidyanathan, K., Walden, A. (2016). Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-46079-6_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)