Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3447818.3460380acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Topology-aware optimizations for multi-GPU ptychographic image reconstruction

Published: 04 June 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Ptychography is an advanced high-resolution X-ray imaging technique that can generate extremely large datasets. Ptychographic reconstruction transforms reciprocal space experimental data to high-resolution 2D real-space images. GPUs have been used extensively to meet the computational requirements of the reconstruction. Generic multi-GPU reconstruction solutions use common communication topologies, such as P2P graph and ring, that are provided by MPI and NCCL libraries, to establish inter-GPU communications. However, these common topologies assume homogeneous physical links between GPUs, resulting in sub-optimal performance on heterogeneous configurations that are composed of both high- (e.g., NVLink) and low-speed (e.g., PCIe) interconnects. This mismatch between application-level communication topology and physical interconnection can cause data transfer congestion, inefficient memory access, and under-utilization of network resources. Here we present topology-aware designs and optimizations to address the aforementioned mismatch and boost end-to-end application performance. We introduce topology-aware data splitting, propose a novel communication topology, and incorporate asynchronous data movement and computation. We evaluate our design and optimizations using real and artificial datasets and compare its performance with that of the direct P2P and NCCL-based approaches. The results show that our optimizations always outperform the counterparts and achieve up to 5.13× and 1.63× communication and end-to-end application speedups, respectively.

    References

    [1]
    AMD. (accessed Oct. 20, 2020). Workload Tuning Guide for AMD EPYC™ 7002 Series Processor Based Servers. https://developer.amd.com/wp-content/resources/56745_0.80.pdf.
    [2]
    Selin Aslan, Zhengchun Liu, Viktor Nikitin, Tekin Bicer, Sven Leyffer, and Doga Gursoy. 2020. Distributed optimization with tunable learned priors for robust ptycho-tomography. arXiv preprint arXiv:2009.09498 (2020).
    [3]
    Selin Aslan, Viktor Nikitin, Daniel J Ching, Tekin Bicer, Sven Leyffer, and Doğa Gürsoy. 2019. Joint ptycho-tomography reconstruction through alternating direction method of multipliers. Optics express 27, 6 (2019), 9128--9143.
    [4]
    Ammar Ahmad Awan, Jereon Bédorf, Ching-Hsiang Chu, Hari Subramoni, and Dhabaleswar K Panda. 2019. Scalable distributed dnn training using tensorflow and cuda-aware mpi: Characterization, designs, and performance evaluation. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 498--507.
    [5]
    Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K Panda. 2017. S-caffe: Co-designing mpi runtimes and caffe for scalable deep learning on modern gpu clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 193--205.
    [6]
    Tal Ben-Nun, Ely Levy, Amnon Barak, and Eri Rubin. 2015. Memory access patterns: the missing piece of the multi-GPU puzzle. In SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--12.
    [7]
    Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An asynchronous multi-GPU programming model for irregular computations. ACM SIGPLAN Notices 52, 8 (2017), 235--248.
    [8]
    Ching-Hsiang Chu, Pouya Kousha, Ammar Ahmad Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K Panda. 2020. Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--12.
    [9]
    Martin Dierolf, Oliver Bunk, Søren Kynde, Pierre Thibault, Ian Johnson, Andreas Menzel, Konstantins Jefimovs, Christian David, Othmar Marti, and Franz Pfeiffer. 2008. Ptychography & lensless X-ray imaging. Europhysics News 39, 1 (2008), 22--24.
    [10]
    Martin Dierolf, Andreas Menzel, Pierre Thibault, Philipp Schneider, Cameron M Kewish, Roger Wepf, Oliver Bunk, and Franz Pfeiffer. 2010. Ptychographic X-ray computed tomography at the nanoscale. Nature 467, 7314 (2010), 436--439.
    [11]
    Zhihua Dong, Yao-Lung L Fang, Xiaojing Huang, Hanfei Yan, Sungsoo Ha, Wei Xu, Yong S Chu, Stuart I Campbell, and Meifeng Lin. 2018. High-Performance Multi-Mode Ptychography Reconstruction on Distributed GPUs. arXiv preprint arXiv:1808.10375 (2018).
    [12]
    B Enders and P Thibault. 2016. A computational framework for ptychographic reconstructions. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 472, 2196 (2016), 20160640.
    [13]
    Pablo Enfedaque, Huibin Chang, Bjoern Enders, David Shapiro, and Stefano Marchesini. 2019. High Performance Partial Coherent X-Ray Ptychography. In International Conference on Computational Science. Springer, 46--59.
    [14]
    Denis Foley and John Danskin. 2017. Ultra-performance Pascal GPU and NVLink interconnect. IEEE Micro 37, 2 (2017), 7--17.
    [15]
    S Mahdieh Ghazimirsaeed, Seyed H Mirsadeghi, and Ahmad Afsahi. 2019. An efficient collaborative communication mechanism for MPI neighborhood collectives. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 781--792.
    [16]
    Peter Gottschling and Torsten Hoefler. 2012. Productive parallel linear algebra programming with unstructured topology adaption. In 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE, 9--16.
    [17]
    William Gropp. (accessed Oct. 20, 2020). Strategies for Parallelism and Halo Exchange. https://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture25.pdf.
    [18]
    William D Gropp. 2019. Using node and socket information to implement MPI Cartesian topologies. Parallel Comput. 85 (2019), 98--108.
    [19]
    Mert Hidayetoğlu, Tekin Bicer, Simon Garcia de Gonzalo, Bin Ren, Vincent De Andrade, Doga Gursoy, Raj Kettimuthu, Ian T Foster, and Wen-mei W Hwu. 2020. Petascale XCT: 3D image reconstruction with hierarchical communications on multi-GPU nodes. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--13.
    [20]
    Mert Hidayetoğlu, Tekin Biçer, Simon Garcia De Gonzalo, Bin Ren, Doğa Gürsoy, Rajkumar Kettimuthu, Ian T Foster, and Wen-mei W Hwu. 2019. Memxct: Memory-centric x-ray CT reconstruction with massive parallelization. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--56.
    [21]
    Torsten Hoefler, Rolf Rabenseifner, Hubert Ritzdorf, Bronis R de Supinski, Rajeev Thakur, and Jesper Larsson Träff. 2011. The scalable process topology interface of MPI 2.2. Concurrency and Computation: Practice and Experience 23, 4 (2011), 293--310.
    [22]
    Torsten Hoefler and Marc Snir. 2011. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the international conference on Supercomputing. 75--84.
    [23]
    Kaixi Hou, Hao Wang, Wu-chun Feng, Jeffrey S Vetter, and Seyong Lee. 2018. Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 276--285.
    [24]
    IBM. 2018 (accessed Oct. 20, 2020). IBM Spectrum MPI: Accelerating high-performance application parallelization. https://www.ibm.com/us-en/marketplace/spectrum-mpi.
    [25]
    Vishwesh Jatala, Roshan Dathathri, Gurbinder Gill, Loc Hoang, V Krishna Nandivada, and Keshav Pingali. 2020. A Study of Graph Analytics for Massive Datasets on Distributed Multi-GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 84--94.
    [26]
    Nicholas T Karonis, Bronis R De Supinski, Ian Foster, William Gropp, Ewing Lusk, and John Bresnahan. 2000. Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In 14th International Parallel and Distributed Processing Symposium. IEEE, 377--384.
    [27]
    Network-Based Computing Laboratory. 2001 (accessed Oct. 20, 2020). MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. http://mvapich.cse.ohio-state.edu/.
    [28]
    Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J Barker. 2019. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2019), 94--110.
    [29]
    Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. [n.d.]. Tartan: evaluating modern GPU interconnect via a multi-GPU benchmark suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 191--202.
    [30]
    A. M. Maiden and J. M. Rodenburg. 2009. An improved ptychographical phase retrieval algorithm for diffractive imaging. Ultramicroscopy 109 (2009), 1256--1262.
    [31]
    Ondrej Mandula, Marta Elzo Aizarna, Joël Eymery, Manfred Burghammer, and Vincent Favre-Nicolin. 2016. PyNX. Ptycho: a computing library for X-ray coherent diffraction imaging of nanostructures. Journal of Applied Crystallography 49, 5 (2016), 1842--1848.
    [32]
    Stefano Marchesini, Hari Krishnan, Benedikt J Daurer, David A Shapiro, Talita Perciano, James A Sethian, and Filipe RNC Maia. 2016. SHARP: a distributed GPU-based ptychographic solver. Journal of applied crystallography 49, 4 (2016), 1245--1252.
    [33]
    Seyed Hessamedin Mirsadeghi, Jesper Larsson Träff, Pavan Balaji, and Ahmad Afsahi. 2017. Exploiting common neighborhoods to optimize MPI neighborhood collectives. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC). IEEE, 348--357.
    [34]
    Dmitriy Morozov and Tom Peterka. 2016. Block-Parallel Data Analysis with DIY2. (2016).
    [35]
    Open MPI. 2004 (accessed Oct. 20, 2020). Open MPI: Open Source High Performance Computing. https://www.open-mpi.org/.
    [36]
    Youssef SG Nashed, David J Vine, Tom Peterka, Junjing Deng, Rob Ross, and Chris Jacobsen. 2014. Parallel ptychographic reconstruction. Optics express 22, 26 (2014), 32082--32097.
    [37]
    Viktor Nikitin, Selin Aslan, Yudong Yao, Tekin Biçer, Sven Leyffer, Rajmund Mokso, and Doğa Gürsoy. 2019. Photon-limited ptychography of 3D objects via Bayesian reconstruction. OSA Continuum 2, 10 (2019), 2948--2968.
    [38]
    Marziyeh Nourian, Xiang Wang, Xiaodong Yu, Wu-chun Feng, and Michela Becchi. 2017. Demystifying Automata Processing: GPUs, FPGAs or Micron's AP?. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM.
    [39]
    NVIDIA. (accessed Oct. 20, 2020)a. NVIDIA Collective Communication Library (NCCL) Documentation. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/.
    [40]
    NVIDIA. (accessed Oct. 20, 2020)b. NVIDIA Collective Communication Library (NCCL) Documentation. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/p2p.html\#neighbor-exchange.
    [41]
    Michal Odstrčil, Andreas Menzel, and Manuel Guizar-Sicairos. 2018. Iterative least-squares solver for generalized maximum-likelihood ptychography. Optics express 26, 3 (2018), 3108--3123.
    [42]
    Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D Owens. 2017. Multi-GPU graph analytics. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 479--490.
    [43]
    Tom Peterka, Robert Ross, Attila Gyulassy, Valerio Pascucci, Wesley Kendall, Han-Wei Shen, Teng-Yok Lee, and Abon Chaudhuri. 2011. Scalable parallel building blocks for custom data analysis. In 2011 IEEE Symposium on Large Data Analysis and Visualization. IEEE, 105--112.
    [44]
    Franz Pfeiffer. 2018. X-ray ptychography. Nature Photonics 12, 1 (2018), 9--17.
    [45]
    Pierre Thibault, Martin Dierolf, Andreas Menzel, Oliver Bunk, Christian David, and Franz Pfeiffer. 2008. High-resolution scanning x-ray diffraction microscopy. Science 321, 5887 (2008), 379--382.
    [46]
    Jiannan Tian, Sheng Di, Kai Zhao, Cody Rivera, Megan Hickman Fulp, Robert Underwood, Sian Jin, Xin Liang, Jon Calhoun, Dingwen Tao, et al. 2020. Cusz: An efficient gpu-based error-bounded lossy compression framework for scientific data. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 3--15.
    [47]
    Klaus Wakonig, Hans-Christian Stadler, Michal Odstrčil, Esther H. R. Tsai, Ana Diaz, Mirko Holler, Ivan Usov, Jörg Raabe, Andreas Menzel, and Manuel Guizar-Sicairos. 2020. PtychoShelves, a versatile high-level framework for high-performance analysis of ptychographic data. Journal of Applied Crystallography 53, 2 (Apr 2020).
    [48]
    Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Nikhil Devanur, Jorgen Thelin, and Ion Stoica. 2020. Blink: Fast and Generic Collectives for Distributed ML. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.), Vol. 2. 172--186. https://proceedings.mlsys.org/paper/2020/file/43ec517d68b6edd3015b3edc9a11367b-Paper.pdf
    [49]
    Hao Wang, Sreeram Potluri, Devendar Bureddy, Carlos Rosales, and Dhabaleswar K Panda. 2013. GPU-aware MPI on RDMA-enabled clusters: Design, implementation and evaluation. IEEE Transactions on Parallel and Distributed Systems 25, 10 (2013), 2595--2605.
    [50]
    Linnan Wang, Wei Wu, Zenglin Xu, Jianxiong Xiao, and Yi Yang. 2016. Blasx: A high performance level-3 BLAS library for heterogeneous multi-GPU computing. In Proceedings of the 2016 International Conference on Supercomputing. 1--11.
    [51]
    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612.
    [52]
    Xiaodong Yu. 2013. Deep packet inspection on large datasets: algorithmic and parallelization techniques for accelerating regular expression matching on many-core processors. University of Missouri-Columbia.
    [53]
    Xiaodong Yu. 2019. Algorithms and Frameworks for Accelerating Security Applications on HPC Platforms. Ph.D. Dissertation. Virginia Tech.
    [54]
    Xiaodong Yu and Michela Becchi. 2013a. Exploring Different Automata Representations for Efficient Regular Expression Matching on GPUs. SIGPLAN Not. (2013).
    [55]
    Xiaodong Yu and Michela Becchi. 2013b. GPU Acceleration of Regular Expression Matching for Large Datasets: Exploring the Implementation Space. In Proceedings of the ACM International Conference on Computing Frontiers (Ischia, Italy) (CF '13). ACM, New York, NY, USA, Article 18, 10 pages.
    [56]
    Xiaodong Yu, Hao Wang, Wu-chun Feng, Hao Gong, and Guohua Cao. 2016. cuART: Fine-Grained Algebraic Reconstruction Technique for Computed Tomography Images on GPUs. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).
    [57]
    Xiaodong Yu, Hao Wang, Wu-chun Feng, Hao Gong, and Guohua Cao. 2017. An Enhanced Image Reconstruction Tool for Computed Tomography on GPUs. In Proceedings of the Computing Frontiers Conference (CF'17). ACM.
    [58]
    Xiaodong Yu, Hao Wang, Wu-chun Feng, Hao Gong, and Guohua Cao. 2019. GPU-based iterative medical CT image reconstructions. Journal of Signal Processing Systems 91, 3-4 (2019), 321--338.
    [59]
    Xiaodong Yu, Fengguo Wei, Xinming Ou, Michela Becchi, Tekin Bicer, and Danfeng (Daphne) Yao. 2020. GPU-Based Static Data-Flow Analysis for Fast and Scalable Android App Vetting. In The 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE.
    [60]
    Jing Zhang, Hao Wang, and Wu-chun Feng. 2017. cublastp: Fine-grained parallelization of protein sequence search on CPU+GPU. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 14, 4 (2017), 830--843.

    Cited By

    View all
    • (2023)Throughput Optimization with a NUMA-Aware Runtime System for Efficient Scientific Data StreamingProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624593(795-805)Online publication date: 12-Nov-2023
    • (2023)Deep learning at the edge enables real-time streaming ptychographic imagingNature Communications10.1038/s41467-023-41496-z14:1Online publication date: 3-Nov-2023
    • (2022)Image gradient decomposition for parallel and memory-efficient ptychographic reconstructionProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571895(1-13)Online publication date: 13-Nov-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing
    June 2021
    506 pages
    ISBN:9781450383356
    DOI:10.1145/3447818
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 June 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. Heterogeneous inter-GPU connections
    3. NVLink
    4. image reconstruction
    5. neighborhood communication
    6. ptychography

    Qualifiers

    • Research-article

    Funding Sources

    • U.S. Department of Energy

    Conference

    ICS '21
    Sponsor:

    Acceptance Rates

    ICS '21 Paper Acceptance Rate 39 of 157 submissions, 25%;
    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)248
    • Downloads (Last 6 weeks)28
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Throughput Optimization with a NUMA-Aware Runtime System for Efficient Scientific Data StreamingProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624593(795-805)Online publication date: 12-Nov-2023
    • (2023)Deep learning at the edge enables real-time streaming ptychographic imagingNature Communications10.1038/s41467-023-41496-z14:1Online publication date: 3-Nov-2023
    • (2022)Image gradient decomposition for parallel and memory-efficient ptychographic reconstructionProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571895(1-13)Online publication date: 13-Nov-2022
    • (2022)Ultrafast Error-bounded Lossy Compression for Scientific DatasetsProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531473(159-171)Online publication date: 27-Jun-2022
    • (2022)Image Gradient Decomposition for Parallel and Memory-Efficient Ptychographic ReconstructionSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00013(1-13)Online publication date: Nov-2022
    • (2022)Scalable and accurate multi-GPU-based image reconstruction of large-scale ptychography dataScientific Reports10.1038/s41598-022-09430-312:1Online publication date: 29-Mar-2022
    • (2022)Efficient microscopy image analysis on CPU-GPU systems with cost-aware irregular data partitioningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.02.004164:C(40-54)Online publication date: 1-Jun-2022
    • (2022)High-Performance Ptychographic Reconstruction with Federated FacilitiesDriving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation10.1007/978-3-030-96498-6_10(173-189)Online publication date: 10-Mar-2022
    • (2021)cuZ-Checker: A GPU-Based Ultra-Fast Assessment System for Lossy Compressions2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00065(307-319)Online publication date: Sep-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media