research-article

Lattice QCD with domain decomposition on Intel^® Xeon Phi^™ co-processors

Authors:

Simon Heybrock,

Dhiraj D. Kalamkar,

Mikhail Smelyanskiy,

Karthikeyan Vaidyanathan,

Pradeep DubeyAuthors Info & Claims

SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 69 - 80

https://doi.org/10.1109/SC.2014.11

Published: 16 November 2014 Publication History

Abstract

The gap between the cost of moving data and the cost of computing continues to grow, making it ever harder to design iterative solvers on extreme-scale architectures. This problem can be alleviated by alternative algorithms that reduce the amount of data movement. We investigate this in the context of Lattice Quantum Chromodynamics and implement such an alternative solver algorithm, based on domain decomposition, on Intel^® Xeon Phi^™ co-processor (KNC) clusters. We demonstrate close-to-linear on-chip scaling to all 60 cores of the KNC. With a mix of single- and half-precision the domain-decomposition method sustains 400-500 Gflop/s per chip. Compared to an optimized KNC implementation of a standard solver [1], our full multi-node domain-decomposition solver strong-scales to more nodes and reduces the time-to-solution by a factor of 5.

References

[1]

B. Joó, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, and I. Watson, William, "Lattice QCD on Intel Xeon Phi Coprocessors," in Supercomputing, ser. Lecture Notes in Computer Science, J. M. Kunkel, T. Ludwig, and H. W. Meuer, Eds. Springer Berlin Heidelberg, 2013, vol. 7905, pp. 40--54. {Online}. Available: http://dx.doi.org/10.1007/978-3-642-38750-0_4

[2]

K. Bergman et al., "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Peter Kogge, Editor & Study Lead," 2008.

[3]

K. Vaidyanathan, K. Pamnany, D. D. Kalamkar, A. Heinecke, M. Smelyanskiy, J. Park, D. Kim, A. Shet, B. Kaul, B. Joo, and P. Dubey, "Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters," in IPDPS 2014 (28th IEEE International Parallel & Distributed Processing Symposium). to be published.

Digital Library

[4]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey, "3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1--13. {Online}. Available: http://dx.doi.org/10.1109/SC.2010.2

Digital Library

[5]

K. Wilson, "Quarks and strings on a lattice," in New Phenomena in Subnuclear Physics, ser. The Subnuclear Series, A. Zichichi, Ed. Springer US, 1977, vol. 13, pp. 69--142. {Online}. Available: http://dx.doi.org/10.1007/978-1-4613-4208-3_6

[6]

B. Sheikholeslami and R. Wohlert, "Improved Continuum Limit Lattice Action for QCD with Wilson Fermions," Nucl. Phys., vol. B259, p. 572, 1985.

[7]

M. R. Hestenes and E. Stiefel, "Methods of conjugate gradients for solving linear systems," Journal of research of the National Bureau of Standards, vol. 49, pp. 409--436, 1952.

[8]

Y. Saad and M. Schultz, "GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems," SIAM Journal on Scientific and Statistical Computing, vol. 7, no. 3, pp. 856--869, 1986. {Online}. Available: http://epubs.siam.org/doi/abs/10.1137/0907058

Digital Library

[9]

H. A. van der Vorst, "BI-CGSTAB: A Fast and Smoothly Converging Variant of BI-CG for the Solution of Nonsymmetric Linear Systems," SIAM J. Sci. Stat. Comput., vol. 13, no. 2, pp. 631--644, Mar. 1992. {Online}. Available: http://dx.doi.org/10.1137/0913035

Digital Library

[10]

A. Frommer, A. Nobile, and P. Zingler, "Deflation and Flexible SAP-Preconditioning of GMRES in Lattice QCD Simulation," 2012. {Online}. Available: http://arxiv.org/abs/1204.5463

[11]

H. A. Schwarz, "Über einen Grenzübergang durch alternierendes Verfahren," in Vierteljahrsschrift der Naturforschenden Gesellschaft in Zürich, 1870, vol. 15, pp. 272--286.

[12]

M. Lüscher, "Solution of the Dirac equation in lattice QCD using a domain decomposition method," Comput. Phys. Commun., vol. 156, pp. 209--220, 2004.

[13]

Y. Saad, Iterative Methods for Sparse Linear Systems, Second Edition, 2nd ed. Society for Industrial and Applied Mathematics, 2003.

Digital Library

[14]

M. Lüscher, "Computational strategies in lattice QCD," in Modern Perspectives in Lattice QCD: Quantum Field Theory and High Performance Computing, L. Lellouch, R. Sommer, B. Svetitsky, A. Vladikas, and L. F. Cugliandolo, Eds., vol. Les Houches 2009, Session XCIII, 2009, pp. 331--399.

[15]

M. Lüscher, "Schwarz-preconditioned HMC algorithm for two-flavour lattice QCD," Comput. Phys. Commun., vol. 165, pp. 199--220, 2005.

[16]

G. Bali, P. Bruns, S. Collins, M. Deka, B. Gläßle et al., "Nucleon mass and sigma term from lattice QCD with two light fermion flavors," Nucl. Phys., vol. B866, pp. 1--25, 2013.

[17]

S. Dürr, Z. Fodor, C. Hölbling, R. Hoffmann, S. Katz et al., "Scaling study of dynamical smeared-link clover fermions," Phys. Rev., vol. D79, p. 014501, 2009.

[18]

S. Duane, A. D. Kennedy, B. Pendleton, and D. Roweth, "Hybrid Monte Carlo," Phys. Lett., vol. B195, pp. 216--222, 1987.

[19]

R. G. Edwards and B. Joo, "The Chroma software system for lattice QCD," Nucl. Phys. Proc. Suppl., vol. 140, p. 832, 2005.

[20]

R. Babich, J. Brannick, R. Brower, M. Clark, T. Manteuffel et al., "Adaptive multigrid algorithm for the lattice Wilson-Dirac operator," Phys. Rev. Lett., vol. 105, p. 201602, 2010.

[21]

J. Osborn, R. Babich, J. Brannick, R. Brower, M. Clark et al., "Multigrid solver for clover fermions," PoS, vol. LATTICE2010, p. 037, 2010.

[22]

A. Frommer, K. Kahl, S. Krieg, B. Leder, and M. Rottmann, "Adaptive Aggregation Based Domain Decomposition Multigrid for the Lattice Wilson Dirac Operator," 2013. {Online}. Available: http://arxiv.org/abs/1303.1377

[23]

A. Frommer, K. Kahl, S. Krieg, B. Leder, and M. Rottmann, "An Adaptive Aggregation Based Domain Decomposition Multilevel Method for the Lattice Wilson Dirac Operator: Multilevel Results," 2013. {Online}. Available: http://arxiv.org/abs/1307.6101

[24]

M. Lüscher, "Local coherence and deflation of the low quark modes in lattice QCD," JHEP, vol. 0707, p. 081, 2007.

[25]

R. Babich, M. A. Clark, B. Joó, G. Shi, R. C. Brower, and S. Gottlieb, "Scaling Lattice QCD Beyond 100 GPUs," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '11. New York, NY, USA: ACM, 2011, pp. 70:1--70:11. {Online}. Available: http://doi.acm.org/10.1145/2063384.2063478

Digital Library

[26]

Y. Osaki and K.-I. Ishikawa, "Domain Decomposition method on GPU cluster," PoS, vol. LATTICE2010, p. 036, 2010.

Cited By

Yuan LZhang YGuo PHuang SMohr BRaghavan P(2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126920
Bjørnseth BMeyer JNatvig LElsman MGrelck CKloeckner APadua DSolomonik E(2017)Efficient array slicing on the Intel Xeon Phi coprocessorProceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming10.1145/3091966.3091975(40-47)Online publication date: 18-Jun-2017
https://dl.acm.org/doi/10.1145/3091966.3091975
Kashyap AVadhiyar SNanjundiah RVinayachandran P(2017)Asynchronous and synchronous models of executions on Intel Xeon Phi coprocessor systems for high performance of long wave radiation calculations in atmosphere modelsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2016.12.018102:C(199-212)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.jpdc.2016.12.018
Show More Cited By

Index Terms

Lattice QCD with domain decomposition on Intel^® Xeon Phi^™ co-processors

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2014

1054 pages

ISBN:9781479955008

General Chair:
Trish Damkroger
Lawrence Livermore National Laboratory, Livermore, California
,
Program Chair:
Jack Dongarra
University of Tennessee, Knoxville, Tennessee

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 November 2014

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '14

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC '14: International Conference for High Performance Computing, Networking, Storage and Analysis

November 16 - 21, 2014

Louisana, New Orleans

Acceptance Rates

SC '14 Paper Acceptance Rate 83 of 394 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
219
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yuan LZhang YGuo PHuang SMohr BRaghavan P(2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126920
Bjørnseth BMeyer JNatvig LElsman MGrelck CKloeckner APadua DSolomonik E(2017)Efficient array slicing on the Intel Xeon Phi coprocessorProceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming10.1145/3091966.3091975(40-47)Online publication date: 18-Jun-2017
https://dl.acm.org/doi/10.1145/3091966.3091975
Kashyap AVadhiyar SNanjundiah RVinayachandran P(2017)Asynchronous and synchronous models of executions on Intel Xeon Phi coprocessor systems for high performance of long wave radiation calculations in atmosphere modelsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2016.12.018102:C(199-212)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.jpdc.2016.12.018
Wang YAnderson MCohen JHeinecke ALi KSatish NSundaram NTurk-Browne NWillke TKern JVetter J(2015)Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessorsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2807591.2807631(1-12)Online publication date: 15-Nov-2015
https://dl.acm.org/doi/10.1145/2807591.2807631

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents