Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Hardware Pipeline with High Energy and Resource Efficiency for FMM Acceleration

Published: 30 January 2018 Publication History

Abstract

The fast multipole method (FMM) is a promising mathematical technique that accelerates the calculation of long-ranged forces in the large-sized n-body problem. Existing implementations of the FMM on general-purpose processors are energy and resource inefficient. To mitigate these issues, we propose a hardware pipeline that accelerates three key FMM steps. The pipeline improves energy efficiency by exploiting fine-granularity parallelism of the FMM. We reuse the pipeline for different FMM steps to reduce resource usage by 66%. Compared to the state-of-the-art implementations on CPUs and GPUs, our implementation requires 15% less energy and delivers 2.61 times more floating-point operations.

References

[1]
Lorena A. Barba and Rio Yokota. 2013. How will the fast multipole method fare in the exascale era. SIAM News 46, 6 (2013), 1--3.
[2]
Pablo Barrio, Carlos Carreras, Juan A. López, Óscar Robles, Ruzica Jevtic, and Roberto Sierra. 2014. Memory optimization in FPGA-accelerated scientific codes based on unstructured meshes. Journal of Systems Architecture 60, 7 (2014), 579--591.
[3]
Spiridon F. Beldianu and Sotirios G. Ziavras. 2013. Multicore-based vector coprocessor sharing for performance and energy gains. ACM Transactions on Embedded Computing Systems (TECS) 13, 2 (2013), 17.
[4]
Stephen J. Carey, David R. W. Barr, and Piotr Dudek. 2013. Low power high-performance smart camera system based on SCAMP vision sensor. Journal of Systems Architecture 59, 10 (2013), 889--899.
[5]
Ya Hui Chai, Wen Feng Shen, Wei Min Xu, and Yan Heng Zheng. 2011. Computing acceleration of FMM algorithm on the basis of FPGA and GPU. Advanced Materials Research 291 (2011), 3272--3277.
[6]
Jee Choi, Aparna Chandramowlishwaran, Kamesh Madduri, and Richard Vuduc. 2014. A cpu: Gpu hybrid implementation and model-driven scheduling of the fast multipole method. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 64.
[7]
Barry A. Cipra. 2000. The best of the 20th century: Editors name top 10 algorithms. SIAM News 33, 4 (2000), 1--2.
[8]
Felipe A. Cruz and Lorena A. Barba. 2008. Characterization of the errors of the FMM in particle simulations. arXiv preprint arXiv:0809.1810 (2008).
[9]
Felipe A. Cruz, Matthew G. Knepley, and Lorena A. Barba. 2011. PetFMM—A dynamically load-balancing parallel fast multipole library. International Journal for Numerical Methods in Engineering 85, 4 (2011), 403--428.
[10]
Eric Darve. 2000. The fast multipole method: Numerical implementation. Journal of Computational Physics 160, 1 (2000), 195--240.
[11]
Walter Dehnen. 2014. A fast multipole method for stellar dynamics. Computational Astrophysics and Cosmology 1, 1 (2014), 1--24.
[12]
William Fong and Eric Darve. 2009. The black-box fast multipole method. Journal of Computational Physics 228, 23 (2009), 8712--8725.
[13]
Fang Gong, Hao Yu, Lingli Wang, and Lei He. 2012. A parallel and incremental extraction of variational capacitance with stochastic geometric moments. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20, 9 (Sept. 2012), 1729--1737.
[14]
Leslie Greengard and Vladimir Rokhlin. 1987. A fast algorithm for particle simulations. Journal of Computational Physics 73, 2 (1987), 325--348.
[15]
Nail A. Gumerov and Ramani Duraiswami. 2008. Fast multipole methods on graphics processors. Journal of Computational Physics 227, 18 (2008), 8290--8313.
[16]
Wolfgang Hafla, Andre Buchau, and Wolfgang M. Rucker. 2006. Application of fast multipole method to Biot-Savart law computations. In 6th International Conference on Computational Electromagnetics. 1--2.
[17]
Yusuke Hagihara. 1970. Celestial mechanics: Dynamical principles and transformation theory. Vol. 1. Cambridge: MIT Press.
[18]
Tsuyoshi Hamada, Tetsu Narumi, Rio Yokota, Kenji Yasuoka, Keigo Nitadori, and Makoto Taiji. 2009. 42 tflops hierarchical n-body simulations on GPUs with applications in both astrophysics and turbulence. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 62.
[19]
Intel. 2010. Intel Xeon Processor X5680. Retrieved from http://ark.intel.com/products/47916/Intel-Xeon-Processor-X5680-12M-Cache-3_33-GHz-6_40-GTs-Intel-QPI. Accessed February 12, 2017.
[20]
Ilya Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan-Anh Nguyen, Rahul Sampath, Aashay Shringarpure, Richard Vuduc, Lexing Ying, Denis Zorin, and George Biros. 2012. A massively parallel adaptive fast multipole method on heterogeneous architectures. Communications of the ACM 55, 5 (2012), 101--109.
[21]
nVidia. 2010. Tesla C2050/C2070 GPU Computing Processor. Retrieved from http://www.nvidia.co.uk/object/product_tesla_C2050_C2070_uk.html. (2010). Accessed: February 12, 2017.
[22]
Henrik G. Petersen, E. R. Smith, and D. Soelvason. 1995a. Error estimates for the fast multipole method. II. The three-dimensional case. Proceedings of the Royal Society of London. Series A: Mathematical and Physical Sciences 448, 1934 (1995), 401--418.
[23]
Henrik G. Petersen, D. Soelvason, John W. Perram, and E. R. Smith. 1995b. Error estimates for the fast multipole method. I. The two-dimensional case. Proceedings of the Royal Society of London. Series A: Mathematical and Physical Sciences 448, 1934 (1995), 389--400.
[24]
Kevin E. Schmidt and Michael A. Lee. 1991. Implementing the fast multipole method in three dimensions. Journal of Statistical Physics 63, 5--6 (1991), 1223--1235.
[25]
Toru Takahashi, Cris Cecka, William Fong, and Eric Darve. 2012. Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. International Journal of Numerical Methods in Engineering 89, 1 (2012), 105--133.
[26]
Mathias Winkel, Robert Speck, Helge Hübner, Lukas Arnold, Rolf Krause, and Paul Gibbon. 2012. A massively parallel, multi-disciplinary Barnes--Hut tree code for extreme-scale N-body simulations. Computer Physics Communications 183, 4 (2012), 880--889.
[27]
Rio Yokota, Lorena A. Barba, Tetsu Narumi, and Kenji Yasuoka. 2013. Petascale turbulence simulation using a highly parallel fast multipole method on GPUs. Computer Physics Communications 184, 3 (2013), 445--455.
[28]
Rio Yokota, Tetsu Narumi, Ryuji Sakamaki, Shun Kameoka, Shinnosuke Obi, and Kenji Yasuoka. 2009. Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence. Computer Physics Communications 180, 11 (2009), 2066--2078.
[29]
Zhe Zheng, Yongxin Zhu, Xu Wang, Zhiqiang Que, Tian Huang, Xiaojing Yin, Hui Wang, Guoguang Rong, and Meikang Qiu. 2010. Revealing feasibility of FMM on ASIC: Efficient implementation of N-Body problem on FPGA. In 2010 IEEE 13th International Conference on Computational Science and Engineering (CSE’10). IEEE, 132--139.
[30]
Bo Zhou, Xiaobo Sharon Hu, Danny Z. Chen, and Cedric X. Yu. 2013. Accelerating radiation dose calculation: A multi-FPGA solution. ACM Transactions on Embedded Computing Systems (TECS) 13, 1s (2013), 33.

Cited By

View all
  • (2024)Solving Multi-connected BVPs with Uncertainly Defined Complex ShapesComputational Science – ICCS 202410.1007/978-3-031-63751-3_10(147-158)Online publication date: 2-Jul-2024
  • (2022)Ultra-Fast FPGA Implementation of Graph Cut Algorithm With Ripple Push and Early TerminationIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2021.313759069:4(1532-1545)Online publication date: Apr-2022
  • (2021)Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body MethodsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305604532:8(2035-2048)Online publication date: 1-Aug-2021
  • Show More Cited By

Index Terms

  1. A Hardware Pipeline with High Energy and Resource Efficiency for FMM Acceleration

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Embedded Computing Systems
    ACM Transactions on Embedded Computing Systems  Volume 17, Issue 2
    Special Issue on MEMCODE 2015 and Regular Papers (Diamonds)
    March 2018
    640 pages
    ISSN:1539-9087
    EISSN:1558-3465
    DOI:10.1145/3160927
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 30 January 2018
    Accepted: 01 October 2017
    Revised: 01 February 2017
    Received: 01 September 2015
    Published in TECS Volume 17, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Fast multipole method (FMM)
    2. energy efficiency
    3. field programmable gate arrays (FPGAs)
    4. pipeline
    5. resource efficiency

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 02 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Solving Multi-connected BVPs with Uncertainly Defined Complex ShapesComputational Science – ICCS 202410.1007/978-3-031-63751-3_10(147-158)Online publication date: 2-Jul-2024
    • (2022)Ultra-Fast FPGA Implementation of Graph Cut Algorithm With Ripple Push and Early TerminationIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2021.313759069:4(1532-1545)Online publication date: Apr-2022
    • (2021)Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body MethodsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305604532:8(2035-2048)Online publication date: 1-Aug-2021
    • (2021)Decrease Iteration Time Deterministic Cyclic Scheduling for Real-time Periodic Tasks*2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00172(1248-1254)Online publication date: Sep-2021
    • (2021)Accuracy vs. Efficiency: Achieving both Through Hardware-Aware Quantization and Reconfigurable Architecture with Mixed Precision2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00033(151-158)Online publication date: Sep-2021
    • (2020)The fast parametric integral equations system in an acceleration of solving polygonal potential boundary value problemsAdvances in Engineering Software10.1016/j.advengsoft.2020.102770141(102770)Online publication date: Mar-2020

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media