Lattice QCD on Intel® Xeon PhiTM Coprocessors

Joó, Bálint; Kalamkar, Dhiraj D.; Vaidyanathan, Karthikeyan; Smelyanskiy, Mikhail; Pamnany, Kiran; Lee, Victor W.; Dubey, Pradeep; Watson, William

doi:10.1007/978-3-642-38750-0_4

Bálint Joó¹⁹,
Dhiraj D. Kalamkar²⁰,
Karthikeyan Vaidyanathan²⁰,
Mikhail Smelyanskiy²¹,
Kiran Pamnany²⁰,
Victor W. Lee²¹,
Pradeep Dubey²¹ &
…
William Watson III¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7905))

Included in the following conference series:

International Supercomputing Conference

2639 Accesses
24 Citations

Abstract

Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in the theory of the strong interactions, and is of importance in studies of nuclear and high energy physics. LQCD codes use large fractions of supercomputing cycles worldwide and are often amongst the first to be ported to new high performance computing architectures. The recently released Intel Xeon Phi architecture from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. In this contribution, we describe our experiences with optimizing a key LQCD kernel for the Xeon Phi architecture. On a single node, using single precision, our Dslash kernel sustains a performance of up to 320 GFLOPS, while our Conjugate Gradients solver sustains up to 237 GFLOPS. Furthermore we demonstrate a fully ’native’ multi-node LQCD implementation running entirely on KNC nodes with minimum involvement of the host CPU. Our multi-node implementation of the solver has been strong scaled to 3.9 TFLOPS on 32 KNCs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Large-Scale Parallelization of Lattice QCD on Sunway TaihuLight Supercomputer

References

Hestenes, M.R., Stiefel, E.: Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards 49(6), 409–436 (1952)
Article MathSciNet Google Scholar
Creutz, M.: Quarks, Gluons and Lattices. Cambridge Monographs on Mathematical Physics, 169 p. Univ. Pr., Cambridge (1983)
Google Scholar
Wilson, K.G.: Quarks and Strings on a Lattice. In: Zichichi, A. (ed.) New Phenomena in Subnuclear Physics, p. 69. Plenum Press, New York (1975)
Google Scholar
van der Vorst, H.A.: Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems. SIAM Journal on Scientific and Statistical Computing 13(2), 631–644 (1992)
Article MathSciNet Google Scholar
Smelyanskiy, M., Vaidyanathan, K., Choi, J., Joó, B., Chhugani, J., Clark, M.A., Dubey, P.: High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 69:1–69:11 (2011)
Google Scholar
Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181, 1517–1528 (2010)
Article Google Scholar
OpenMP Architecture Review Board: OpenMP Application Program Interface (2011)
Google Scholar
Nguyen, A.D., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: SC, pp. 1–13 (2010)
Google Scholar
Babich, R., Clark, M.A., Joó, B.: Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11 (2010)
Google Scholar
Boyle, P.A.: The BlueGene/Q supercomputer. PoS LATTICE 2012, 020 (2012)
Google Scholar
MPI: A Message-Passing Interface Standard (March 1994)
Google Scholar
Joó, B.: SciDAC-2 software infrastructure for lattice QCD. Journal of Physics: Conference Series 78(1), 012034 (2007)
Google Scholar
Pakin, S., Lang, M., Kerbyson, D.J.: The reverse-acceleration model for programming petascale hybrid systems. IBM Journal of Research and Development 53(5), 8:1–8:15 (2009)
Article Google Scholar
Heinecke, A., et al.: Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel(R) Xeon Phi(TM) Coprocessor. In: Proceedings of IPDPS Conference (2013)
Google Scholar
Strzodka, R., Göddeke, D.: Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components. In: IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2006), pp. 259–268 (April 2006)
Google Scholar
Doi, J.: Peta-scale lattice quantum chromodynamics on a blue gene/Q supercomputer. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–45. IEEE Computer Society Press, Los Alamitos (2012)
Google Scholar
Alexandru, A., Lujan, M., Pelissier, C., Gamari, B., Lee, F.X.: Efficient implementation of the overlap operator on multi-GPUs (2011)
Google Scholar
Kowalski, A., Shen, X.: Implementing the Dslash Operator in OpenCL. College of William and Mary Technical Report (2010)
Google Scholar
Bach, M., Lindenstruth, V., Philipsen, O., Pinke, C.: Lattice QCD based on OpenCL (2012)
Google Scholar
Clark, M.A., Babich, R.: High-efficiency lattice QCD computations on the fermi architecture. In: Innovative Parallel Computing (InPar), pp. 1–9 (May 2012)
Google Scholar
Chen, D., et al.: QCDSP machines: design, performance and cost. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (CDROM), Supercomputing 1998, pp. 1–6. IEEE Computer Society, Washington, DC (1998)
Google Scholar
Vranas, P., et al.: The BlueGene/L supercomputer and quantum ChromoDynamics. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 2006. ACM, New York (2006)
Google Scholar
Boyle, P.A.: The BAGEL assembler generation library. Computer Physics Communications 180(12), 2739–2748 (2009) 40 YEARS OF CPC: A celebratory issue focused on quality software for high performance, grid and novel computing architectures
Google Scholar
Pochinsky, A.: Writing efficient QCD code made simpler: QA(0). PoS LATTICE 2008, 040 (2008)
Google Scholar
Chen, J., Watson, W., Mao, W.: GMH: A Message Passing Toolkit for GPU Clusters. In: 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS), pp. 35–42 (December 2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Thomas Jefferson National Accelerator Facility, Newport News, VA, USA
Bálint Joó & William Watson III
Parallel Computing Lab., Intel Corporation, Bangalore, India
Dhiraj D. Kalamkar, Karthikeyan Vaidyanathan & Kiran Pamnany
Parallel Computing Lab., Intel Corporation, Santa Clara, CA, USA
Mikhail Smelyanskiy, Victor W. Lee & Pradeep Dubey

Authors

Bálint Joó
View author publications
You can also search for this author in PubMed Google Scholar
Dhiraj D. Kalamkar
View author publications
You can also search for this author in PubMed Google Scholar
Karthikeyan Vaidyanathan
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Smelyanskiy
View author publications
You can also search for this author in PubMed Google Scholar
Kiran Pamnany
View author publications
You can also search for this author in PubMed Google Scholar
Victor W. Lee
View author publications
You can also search for this author in PubMed Google Scholar
Pradeep Dubey
View author publications
You can also search for this author in PubMed Google Scholar
William Watson III
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Hamburg, Department of Informatics, Bundestraße 45a, 20146, Hamburg, Germany
Julian Martin Kunkel
Deutsches Klimarechenzentrum, Bundestraße 45a, 20146, Hamburg, Germany
Thomas Ludwig
Germany and Prometeus GmbH, University of Mannheim, Fliederstraße 2, 74915, Waibstadt, Germany
Hans Werner Meuer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joó, B. et al. (2013). Lattice QCD on Intel^® Xeon Phi^TM Coprocessors. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2013. Lecture Notes in Computer Science, vol 7905. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38750-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-38750-0_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38749-4
Online ISBN: 978-3-642-38750-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Lattice QCD on Intel^® Xeon Phi^TM Coprocessors

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Large-Scale Parallelization of Lattice QCD on Sunway TaihuLight Supercomputer

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Lattice QCD on Intel® Xeon PhiTM Coprocessors

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Large-Scale Parallelization of Lattice QCD on Sunway TaihuLight Supercomputer

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

Lattice QCD on Intel^® Xeon Phi^TM Coprocessors