research-article

A Hardware Pipeline with High Energy and Resource Efficiency for FMM Acceleration

Authors:

Meikang QiuAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 17, Issue 2

Article No.: 51, Pages 1 - 20

https://doi.org/10.1145/3157670

Published: 30 January 2018 Publication History

Abstract

The fast multipole method (FMM) is a promising mathematical technique that accelerates the calculation of long-ranged forces in the large-sized n-body problem. Existing implementations of the FMM on general-purpose processors are energy and resource inefficient. To mitigate these issues, we propose a hardware pipeline that accelerates three key FMM steps. The pipeline improves energy efficiency by exploiting fine-granularity parallelism of the FMM. We reuse the pipeline for different FMM steps to reduce resource usage by 66%. Compared to the state-of-the-art implementations on CPUs and GPUs, our implementation requires 15% less energy and delivers 2.61 times more floating-point operations.

References

[1]

Lorena A. Barba and Rio Yokota. 2013. How will the fast multipole method fare in the exascale era. SIAM News 46, 6 (2013), 1--3.

[2]

Pablo Barrio, Carlos Carreras, Juan A. López, Óscar Robles, Ruzica Jevtic, and Roberto Sierra. 2014. Memory optimization in FPGA-accelerated scientific codes based on unstructured meshes. Journal of Systems Architecture 60, 7 (2014), 579--591.

[3]

Spiridon F. Beldianu and Sotirios G. Ziavras. 2013. Multicore-based vector coprocessor sharing for performance and energy gains. ACM Transactions on Embedded Computing Systems (TECS) 13, 2 (2013), 17.

Digital Library

[4]

Stephen J. Carey, David R. W. Barr, and Piotr Dudek. 2013. Low power high-performance smart camera system based on SCAMP vision sensor. Journal of Systems Architecture 59, 10 (2013), 889--899.

Digital Library

[5]

Ya Hui Chai, Wen Feng Shen, Wei Min Xu, and Yan Heng Zheng. 2011. Computing acceleration of FMM algorithm on the basis of FPGA and GPU. Advanced Materials Research 291 (2011), 3272--3277.

[6]

Jee Choi, Aparna Chandramowlishwaran, Kamesh Madduri, and Richard Vuduc. 2014. A cpu: Gpu hybrid implementation and model-driven scheduling of the fast multipole method. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 64.

[7]

Barry A. Cipra. 2000. The best of the 20th century: Editors name top 10 algorithms. SIAM News 33, 4 (2000), 1--2.

[8]

Felipe A. Cruz and Lorena A. Barba. 2008. Characterization of the errors of the FMM in particle simulations. arXiv preprint arXiv:0809.1810 (2008).

[9]

Felipe A. Cruz, Matthew G. Knepley, and Lorena A. Barba. 2011. PetFMM—A dynamically load-balancing parallel fast multipole library. International Journal for Numerical Methods in Engineering 85, 4 (2011), 403--428.

[10]

Eric Darve. 2000. The fast multipole method: Numerical implementation. Journal of Computational Physics 160, 1 (2000), 195--240.

Digital Library

[11]

Walter Dehnen. 2014. A fast multipole method for stellar dynamics. Computational Astrophysics and Cosmology 1, 1 (2014), 1--24.

[12]

William Fong and Eric Darve. 2009. The black-box fast multipole method. Journal of Computational Physics 228, 23 (2009), 8712--8725.

Digital Library

[13]

Fang Gong, Hao Yu, Lingli Wang, and Lei He. 2012. A parallel and incremental extraction of variational capacitance with stochastic geometric moments. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20, 9 (Sept. 2012), 1729--1737.

Digital Library

[14]

Leslie Greengard and Vladimir Rokhlin. 1987. A fast algorithm for particle simulations. Journal of Computational Physics 73, 2 (1987), 325--348.

Digital Library

[15]

Nail A. Gumerov and Ramani Duraiswami. 2008. Fast multipole methods on graphics processors. Journal of Computational Physics 227, 18 (2008), 8290--8313.

Digital Library

[16]

Wolfgang Hafla, Andre Buchau, and Wolfgang M. Rucker. 2006. Application of fast multipole method to Biot-Savart law computations. In 6th International Conference on Computational Electromagnetics. 1--2.

[17]

Yusuke Hagihara. 1970. Celestial mechanics: Dynamical principles and transformation theory. Vol. 1. Cambridge: MIT Press.

[18]

Tsuyoshi Hamada, Tetsu Narumi, Rio Yokota, Kenji Yasuoka, Keigo Nitadori, and Makoto Taiji. 2009. 42 tflops hierarchical n-body simulations on GPUs with applications in both astrophysics and turbulence. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 62.

Digital Library

[19]

Intel. 2010. Intel Xeon Processor X5680. Retrieved from http://ark.intel.com/products/47916/Intel-Xeon-Processor-X5680-12M-Cache-3_33-GHz-6_40-GTs-Intel-QPI. Accessed February 12, 2017.

[20]

Ilya Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan-Anh Nguyen, Rahul Sampath, Aashay Shringarpure, Richard Vuduc, Lexing Ying, Denis Zorin, and George Biros. 2012. A massively parallel adaptive fast multipole method on heterogeneous architectures. Communications of the ACM 55, 5 (2012), 101--109.

Digital Library

[21]

nVidia. 2010. Tesla C2050/C2070 GPU Computing Processor. Retrieved from http://www.nvidia.co.uk/object/product_tesla_C2050_C2070_uk.html. (2010). Accessed: February 12, 2017.

[22]

Henrik G. Petersen, E. R. Smith, and D. Soelvason. 1995a. Error estimates for the fast multipole method. II. The three-dimensional case. Proceedings of the Royal Society of London. Series A: Mathematical and Physical Sciences 448, 1934 (1995), 401--418.

[23]

Henrik G. Petersen, D. Soelvason, John W. Perram, and E. R. Smith. 1995b. Error estimates for the fast multipole method. I. The two-dimensional case. Proceedings of the Royal Society of London. Series A: Mathematical and Physical Sciences 448, 1934 (1995), 389--400.

[24]

Kevin E. Schmidt and Michael A. Lee. 1991. Implementing the fast multipole method in three dimensions. Journal of Statistical Physics 63, 5--6 (1991), 1223--1235.

[25]

Toru Takahashi, Cris Cecka, William Fong, and Eric Darve. 2012. Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. International Journal of Numerical Methods in Engineering 89, 1 (2012), 105--133.

[26]

Mathias Winkel, Robert Speck, Helge Hübner, Lukas Arnold, Rolf Krause, and Paul Gibbon. 2012. A massively parallel, multi-disciplinary Barnes--Hut tree code for extreme-scale N-body simulations. Computer Physics Communications 183, 4 (2012), 880--889.

[27]

Rio Yokota, Lorena A. Barba, Tetsu Narumi, and Kenji Yasuoka. 2013. Petascale turbulence simulation using a highly parallel fast multipole method on GPUs. Computer Physics Communications 184, 3 (2013), 445--455.

[28]

Rio Yokota, Tetsu Narumi, Ryuji Sakamaki, Shun Kameoka, Shinnosuke Obi, and Kenji Yasuoka. 2009. Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence. Computer Physics Communications 180, 11 (2009), 2066--2078.

[29]

Zhe Zheng, Yongxin Zhu, Xu Wang, Zhiqiang Que, Tian Huang, Xiaojing Yin, Hui Wang, Guoguang Rong, and Meikang Qiu. 2010. Revealing feasibility of FMM on ASIC: Efficient implementation of N-Body problem on FPGA. In 2010 IEEE 13th International Conference on Computational Science and Engineering (CSE’10). IEEE, 132--139.

Digital Library

[30]

Bo Zhou, Xiaobo Sharon Hu, Danny Z. Chen, and Cedric X. Yu. 2013. Accelerating radiation dose calculation: A multi-FPGA solution. ACM Transactions on Embedded Computing Systems (TECS) 13, 1s (2013), 33.

Digital Library

Cited By

Kużelewski AZieniuk ECzupryna M(2024)Solving Multi-connected BVPs with Uncertainly Defined Complex ShapesComputational Science – ICCS 202410.1007/978-3-031-63751-3_10(147-158)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-63751-3_10
Yan GLiu XChen FWang HHa Y(2022)Ultra-Fast FPGA Implementation of Graph Cut Algorithm With Ripple Push and Early TerminationIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2021.313759069:4(1532-1545)Online publication date: Apr-2022
https://doi.org/10.1109/TCSI.2021.3137590
Asri MMalhotra DWang JBiros GJohn LGerstlauer A(2021)Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body MethodsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305604532:8(2035-2048)Online publication date: 1-Aug-2021
https://doi.org/10.1109/TPDS.2021.3056045
Show More Cited By

Index Terms

A Hardware Pipeline with High Energy and Resource Efficiency for FMM Acceleration
1. Hardware
  1. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific integrated circuits

Recommendations

Measuring the Power-Constrained Performance and Energy Gap between FPGAs and Processors (Abstract Only)
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

This work measures the performance and power consumption gap between the current generation of low power FPGAs and low power microprocessors (microcontrollers) through an implementation of the Canny edge detection algorithm. In particular, the algorithm ...
Designing secure systems on reconfigurable hardware

The extremely high cost of custom ASIC fabrication makes FPGAs an attractive alternative for deployment of custom hardware. Embedded systems based on reconfigurable hardware integrate many functions onto a single device. Since embedded designers often ...
On the energy efficiency of parallel multi-core vs hardware accelerated HD video decoding
Special Issue on the 4th Embedded Operating Systems Workshop (EWiLi 2014)

Hardware video accelerators are used on mobile devices to provide support for energy efficient real time High definition (HD) video decoding. Recently, the rise of multi-core architectures on those devices increased their performances and make real time ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 17, Issue 2

Special Issue on MEMCODE 2015 and Regular Papers (Diamonds)

March 2018

640 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3160927

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 30 January 2018

Accepted: 01 October 2017

Revised: 01 February 2017

Received: 01 September 2015

Published in TECS Volume 17, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
157
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kużelewski AZieniuk ECzupryna M(2024)Solving Multi-connected BVPs with Uncertainly Defined Complex ShapesComputational Science – ICCS 202410.1007/978-3-031-63751-3_10(147-158)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-63751-3_10
Yan GLiu XChen FWang HHa Y(2022)Ultra-Fast FPGA Implementation of Graph Cut Algorithm With Ripple Push and Early TerminationIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2021.313759069:4(1532-1545)Online publication date: Apr-2022
https://doi.org/10.1109/TCSI.2021.3137590
Asri MMalhotra DWang JBiros GJohn LGerstlauer A(2021)Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body MethodsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305604532:8(2035-2048)Online publication date: 1-Aug-2021
https://doi.org/10.1109/TPDS.2021.3056045
Li MGuo HLiu JGan YHu W(2021)Decrease Iteration Time Deterministic Cyclic Scheduling for Real-time Periodic Tasks*2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00172(1248-1254)Online publication date: Sep-2021
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00172
Chang LZhang SDu HWang SQiu MWang J(2021)Accuracy vs. Efficiency: Achieving both Through Hardware-Aware Quantization and Reconfigurable Architecture with Mixed Precision2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00033(151-158)Online publication date: Sep-2021
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00033
Kużelewski AZieniuk E(2020)The fast parametric integral equations system in an acceleration of solving polygonal potential boundary value problemsAdvances in Engineering Software10.1016/j.advengsoft.2020.102770141(102770)Online publication date: Mar-2020
https://doi.org/10.1016/j.advengsoft.2020.102770

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents