Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3330345.3332469acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Optimizing the linear fascicle evaluation algorithm for many-core systems

Published: 26 June 2019 Publication History

Abstract

Sparse matrix-vector multiplication (SpMV) operations are commonly used in various scientific and engineering applications. The performance of the SpMV operation often depends on exploiting regularity patterns in the matrix. Various representations and optimization techniques have been proposed to minimize the memory bandwidth bottleneck arising from the irregular memory access pattern involved. Among recent representation techniques, tensor decomposition is a popular one used for very large but sparse matrices. Post sparse-tensor decomposition, the new representation involves indirect accesses, making it more challenging to optimize for massive parallelism, such as on GPUs.
Computational neuroscience algorithms often involve sparse datasets while still performing long running computations on them. The Linear Fascicle Evaluation (LiFE) application is a popular neuroscience algorithm used for pruning brain connectivity graphs. The datasets employed herein involve the Sparse Tucker Decomposition (STD) --- a widely used tensor decomposition method. Using this decomposition leads to multiple irregular array references, making it very difficult to optimize for GPUs. Recent implementations of the LiFE algorithm show that its SpMV operations are the key bottleneck for performance and scaling. In this paper, we first propose data restructuring techniques to minimize the effects of irregular accesses. We then propose various optimizations to optimally map threads at the granularity of warps, thread blocks and grid, and methods to partition the computation among thread blocks to obtain fine-grained parallelism and data reuse. Our optimized GPU implementation achieves a speedup of 5.2× over a reference optimized GPU code version on NVIDIA's GeForce RTX 2080 Ti GPU, and a speedup of 9.7× over a highly optimized and parallelized CPU implementation running on a 16-core Intel Xeon Silver (Skylake-based) system.

References

[1]
Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and Bülent Yener. 2007. Multiway analysis of epilepsy tensors. Bioinformatics 23, 13 (July 2007), i10--i18.
[2]
Evrim Acar, Canan Aykut Bingol, Haluk Bingol, Rasmus Bro, and Bulent Yener. 2007. Seizure Recognition on Epilepsy Feature Tensor. In 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.
[3]
Evrim Acar, Seyit A. Çamtepe, Mukkai S. Krishnamoorthy, and Bülent Yener. 2005. Modeling and Multiway Analysis of Chatroom Tensors. In Intelligence and Security Informatics. 256--268.
[4]
Evrim Acar, Seyit A. Çamtepe, and Bülent Yener. 2006. Collective Sampling and Analysis of High Order Tensors for Chatroom Communications. In Intelligence and Security Informatics. 213--224.
[5]
Karan Aggarwal and Uday Bondhugula. 2019. Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems. (2019). https://arxiv.org/pdf/1905.06234
[6]
Muthu Manikandan Baskaran and Rajesh Bordawekar. 2009. Optimizing Sparse Matrix-Vector Multiplication on GPUs.
[7]
Peter J. Basser, Sinisa Pajevic, Carlo Pierpaoli, Jeffrey Duda, and Akram Aldroubi. 2000. In vivo fiber tractography using DT-MRI data. Magnetic Resonance in Medicine 44, 4 (2000), 625--632.
[8]
C.F. Beckmann and S.M. Smith. 2005. Tensorial extensions of independent component analysis for multisubject FMRI analysis. NeuroImage 25, 1 (March 2005), 294--311.
[9]
Mehmet Belgin, Godmar Back, and Calvin J. Ribbens. 2009. Pattern-based sparse matrix representation for memory-efficient SMVM kernels. In Proceedings of the 23rd international conference on Conference on Supercomputing - ICS '09.
[10]
Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09.
[11]
Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2016. Sparse Matrix Format Selection with Multiclass SVM for SpMV on GPU. In 2016 45th International Conference on Parallel Processing (ICPP).
[12]
Kevin L Briggman and Davi D Bock. 2012. Volume electron microscopy for neuronal circuit reconstruction. Current Opinion in Neurobiology 22, 1 (feb 2012), 154--161.
[13]
Cesar F. Caiafa and Franco Pestilli. 2017. Multidimensional encoding of brain connectomes. Scientific Reports 7, 1 (sep 2017).
[14]
Cesar F. Caiafa, Olaf Sporns, Andrew J. Saykin, and Franco Pestilli. 2017. Unified representation of tractography and diffusion-weighted MRI data using sparse multidimensional arrays. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 4343--4354.
[15]
Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Shivmaran S. Pandian, Yogish Sabharwal, and Dheeraj Sreedhar. 2018. On Optimizing Distributed Tucker Decomposition for Sparse Tensors. In Proceedings of the 2018 International Conference on Supercomputing - ICS '18.
[16]
Jee Choi, Xing Liu, Shaden Smith, and Tyler Simon. 2018. Blocking Optimization Techniques for Sparse Tensor Computation. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[17]
Jee W. Choi, Xing Liu, and Venkatesan T. Chakaravarthy. 2018. High-performance dense tucker decomposition on GPU clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018. 42:1--42:11.
[18]
Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. ACM SIGPLAN Notices 45, 5 (may 2010), 115.
[19]
Andrzej Cichocki, Danilo Mandic, Lieven De Lathauwer, Guoxu Zhou, Qibin Zhao, Cesar Caiafa, and HUY ANH PHAN. 2015. Tensor Decompositions for Signal Processing Applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine 32, 2 (mar 2015), 145--163.
[20]
R Cameron Craddock, Saad Jbabdi, Chao-Gan Yan, Joshua T Vogelstein, F Xavier Castellanos, Adriana Di Martino, Clare Kelly, Keith Heberlein, Stan Colcombe, and Michael P Milham. 2013. Imaging human connectomes at the macroscale. Nature Methods 10, 6 (jun 2013), 524--539.
[21]
Julien Demouth. 2013. Shuffle: Tips and Tricks. NVIDIA GTC (2013).
[22]
Anand Ekambaram and Eurípides Montagne. 2003. An Alternative Compressed Storage Format for Sparse Matrices. In Computer and Information Sciences - ISCIS 2003. 196--203.
[23]
Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[24]
Shashank Gugnani, Xiaoyi Lu, Franco Pestilli, Cesar F. Caiafa, and Dhabaleswar K. Panda. 2017. MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI. In 24th IEEE International Conference on High Performance Computing, HiPC 2017. 213--222.
[25]
Ping Guo and Chung wei Lee. 2016. A Performance Prediction and Analysis Integrated Framework for SpMV on GPUs. Procedia Computer Science 80 (2016), 178--189.
[26]
D.K. Jones. 2008. Tractography Gone Wild: Probabilistic Fibre Tracking Using the Wild Bootstrap With Diffusion Tensor MRI. IEEE Transactions on Medical Imaging 27, 9 (sep 2008), 1268--1274.
[27]
Derek K Jones. 2010. Challenges and limitations of quantifying brain connectivityin vivowith diffusion MRI. Imaging in Medicine 2, 3 (June 2010), 341--355.
[28]
J. Kasthuri, S. Veerapandian, and N. Rajendiran. 2009. Biological synthesis of silver and gold nanoparticles using apiin as reducing agent. Colloids and Surfaces B: Biointerfaces 68, 1 (jan 2009), 55--60.
[29]
Oguz Kaya and Bora Ucar. 2016. High Performance Parallel Algorithms for the Tucker Decomposition of Sparse Tensors. In 2016 45th International Conference on Parallel Processing (ICPP).
[30]
Henry Kennedy, David C. Van Essen, and Yves Christen (Eds.). 2016. Micro-, Meso- and Macro-Connectomics of the Brain.
[31]
Tamara G. Kolda and Brett W. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (aug 2009), 455--500.
[32]
Sawan Kumar, Varsha Sreenivasan, Partha Talukdar, Franco Pestilli, and Devarajan Sridharan. 2019. ReAl-LiFE: Accelerating the Discovery of Individualized Brain Connectomes on GPUs. In Association for the Advancement of Artificial Intelligence.
[33]
Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. 2007. Fourth-Order Cumulant-Based Blind Identification of Underdetermined Mixtures. IEEE Transactions on Signal Processing 55, 6 (June 2007), 2965--2973.
[34]
Lieven De Lathauwer and Alexandre de Baynast. 2008. Blind Deconvolution of DS-CDMA Signals by Means of Decomposition in Rank-(1, L, L) Terms. IEEE Transactions on Signal Processing 56, 4 (April 2008), 1562--1571.
[35]
Lieven De Lathauwer and Joos Vandewalle. 2004. Dimensionality reduction in higher-order signal processing and rank-(R1, R2, ..., RN) reduction in multilinear algebra. Linear Algebra Appl. 391 (Nov. 2004), 31--55.
[36]
Yifeng Li and Alioune Ngom. 2013. Nonnegative Least-Squares Methods for the Classification of High-Dimensional Biological Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10, 2 (mar 2013), 447--456.
[37]
Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13.
[38]
Gumma Venkata Kailash Madhav. 2017. Optimization of Connectome Pruning Algorithm using Hybrid CPU-GPU methods. Master's thesis. The Department of Computational and Data Sciences, Indian Institute of Science.
[39]
Mohammed Mahmoud, Mark Hoffmann, and Hassan Reza. 2017. An Efficient Storage Format for Storing Configuration Interaction Sparse Matrices on CPU/GPU. In 2017 International Conference on Computational Science and Computational Intelligence (CSCI).
[40]
Mohammed Mahmoud, Mark Hoffmann, and Hassan Reza. 2018. Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU. Computation 6, 3 (aug 2018), 45.
[41]
Klaus H. Maier-Hein, Peter F. Neher, et al. 2017. The challenge of mapping the human connectome based on diffusion tractography. Nature Communications 8, 1 (Nov. 2017).
[42]
Eduardo Martinez-Montes, Pedro A. Valdés-Sosa, Fumikazu Miwakeichi, Robin I. Goldman, and Mark S. Cohen. 2004. Concurrent EEG/fMRI analysis by multiway Partial Least Squares. NeuroImage 22, 3 (July 2004), 1023--1034.
[43]
John Mellor-Crummey and John Garvin. 2004. Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam. The International Journal of High Performance Computing Applications 18, 2 (may 2004), 225--236.
[44]
Klaus-Dietmar Merboldt, Wolfgang Hanicke, and Jens Frahm. 1985. Self-diffusion NMR imaging using stimulated echoes. Journal of Magnetic Resonance (1969) 64, 3 (oct 1985), 479--486.
[45]
Fumikazu Miwakeichi, Eduardo Martinez-Montes, Pedro A. Valdés-Sosa, Nobuaki Nishiyama, Hiroaki Mizuhara, and Yoko Yamaguchi. 2004. Decomposing EEG data into space-time-frequency components using Parallel Factor Analysis. NeuroImage 22, 3 (July 2004), 1035--1045.
[46]
Susumu Mori, Barbara J. Crain, V. P. Chacko, and Peter C. M. Van Zijl. 1999. Three-dimensional tracking of axonal projections in the brain by magnetic resonance imaging. Annals of Neurology 45, 2 (feb 1999), 265--269.
[47]
Morten Mørup, Lars Kai Hansen, and Sidse M. Arnfred. 2007. ERPWAVELAB. Journal of Neuroscience Methods 161, 2 (April 2007), 361--368.
[48]
Morten Mørup, Lars Kai Hansen, and Sidse M. Arnfred. 2008. Algorithms for Sparse Nonnegative Tucker Decompositions. Neural Computation 20, 8 (Aug. 2008), 2112--2131.
[49]
Morten Mørup, Lars Kai Hansen, Christoph S. Herrmann, Josef Parnas, and Sidse M. Arnfred. 2006. Parallel Factor Analysis as an exploratory tool for wavelet transformed event-related EEG. NeuroImage 29, 3 (Feb. 2006), 938--947.
[50]
Peter F. Neher, Michael Götz, Tobias Norajitra, Christian Weber, and Klaus H. Maier-Hein. 2015. A Machine Learning Based Approach to Fiber Tractography Using Classifier Voting. In Lecture Notes in Computer Science. 45--52.
[51]
Evangelos E. Papalexakis, Christos Faloutsos, and Nicholas D. Sidiropoulos. 2016. Tensors for Data Mining and Data Fusion. ACM Transactions on Intelligent Systems and Technology 8, 2 (oct 2016), 1--44.
[52]
Ioakeim Perros, Robert Chen, Richard Vuduc, and Jimeng Sun. 2015. Sparse Hierarchical Tucker Factorization and Its Application to Healthcare. In 2015 IEEE International Conference on Data Mining.
[53]
Ioakeim Perros, Robert Chen, Richard W. Vuduc, and Jimeng Sun. 2016. Sparse Hierarchical Tucker Factorization and its Application to Healthcare. CoRR abs/1610.07722 (2016). arXiv:1610.07722 http://arxiv.org/abs/1610.07722
[54]
F. Pestilli and C. F. Caiafa. 2016. Demo Data for Multidimensional Encoding of Brain Connectomes. (2016). https://scholarworks.iu.edu/cgi-bin/mdssRequest.pl?file=2022/20995/Demo_Data_for_Multidimensional_Encoding_of_Brain_Connectomes.tar.gz
[55]
F. Pestilli and C. F. Caiafa. 2016. Encode: Multidimensional encoding of brain connectomes. (2016). https://github.com/brain-life/encode
[56]
Franco Pestilli, Jason D Yeatman, Ariel Rokem, Kendrick N Kay, and Brian A Wandell. 2014. Evaluation and statistical inference for human connectomes. Nature Methods 11, 10 (sep 2014), 1058--1063.
[57]
Manu Shantharam, Anirban Chatterjee, and Padma Raghavan. 2011. Exploiting dense substructures for fast sparse matrix vector multiplication. The International Journal of High Performance Computing Applications 25, 3 (aug 2011), 328--341.
[58]
Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, and Christos Faloutsos. 2017. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing 65, 13 (jul 2017), 3551--3582.
[59]
Olaf Sporns, Giulio Tononi, and Rolf Kötter. 2005. The Human Connectome: A Structural Description of the Human Brain. PLoS Computational Biology 1, 4 (2005), e42.
[60]
Jimeng Sun, Spiros Papadimitriou, and Philip Yu. 2006. Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams. In Sixth International Conference on Data Mining (ICDM'06).
[61]
Jimeng Sun, Dacheng Tao, and Christos Faloutsos. 2006. Beyond streams and graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '06.
[62]
Xiangzheng Sun, Yunquan Zhang, Ting Wang, Xianyi Zhang, Liang Yuan, and Li Rao. 2011. Optimizing SpMV for Diagonal Sparse Matrices on GPU. In 2011 International Conference on Parallel Processing.
[63]
J-Donald Tournier, Fernando Calamante, and Alan Connelly. 2012. MRtrix: Diffusion Tractography in Crossing Fiber Regions. Int. J. Imaging Syst. Technol. 22, 1 (March 2012), 53--66.
[64]
L. R. Tucker. 1966. Some mathematical notes on three-mode factor analysis. Psychometrika 31 (1966), 279--311.
[65]
F. Vázquez, J. J. Fernández, and E. M. Garzón. 2010. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency and Computation: Practice and Experience 23, 8 (sep 2010), 815--826.
[66]
Maarten De Vos, Lieven De Lathauwer, Bart Vanrumste, Sabine Van Huffel, and W. Van Paesschen. 2007. Canonical Decomposition of Ictal Scalp EEG and Accurate Source Localisation: Principles and Simulation Study. Computational Intelligence and Neuroscience 2007 (2007), 1--10.
[67]
Richard W. Vuduc and Hyun-Jin Moon. 2005. Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure. In High Performance Computing and Communications. 807--816.
[68]
Mark T. Wallace, Ramnarayan Ramachandran, and Barry E. Stein. 2004. A revised view of sensory cortical parcellation. Proceedings of the National Academy of Sciences 101, 7 (feb 2004), 2167--2172.
[69]
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07.
[70]
Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13.
[71]
Carl Yang, Aydin Buluç, and John D. Owens. 2018. Design Principles for Sparse Matrix Multiplication on the GPU. CoRR abs/1803.08601 (2018). arXiv:1803.08601 http://arxiv.org/abs/1803.08601
[72]
Fang-Cheng Yeh, Sandip Panesar, David Fernandes, Antonio Meola, Masanori Yoshino, Juan C. Fernandez-Miranda, Jean M. Vettel, and Timothy Verstynen. 2018. Population-averaged atlas of the macroscale human structural connectome and its network topology. NeuroImage 178 (sep 2018), 57--68.
[73]
Tatsuya Yokota and Andrzej Cichocki. 2014. Multilinear tensor rank estimation via Sparse Tucker Decomposition. In 2014 Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS).
[74]
Syed Zubair and Wenwu Wang. 2013. Tensor dictionary learning with sparse TUCKER decomposition. In 2013 18th International Conference on Digital Signal Processing (DSP).

Cited By

View all
  • (2020)Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core SystemsACM Transactions on Parallel Computing10.1145/34180757:4(1-45)Online publication date: 25-Nov-2020
  • (2020)Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414649(317-328)Online publication date: 30-Sep-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '19: Proceedings of the ACM International Conference on Supercomputing
June 2019
533 pages
ISBN:9781450360791
DOI:10.1145/3330345
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. LiFE Algorithm
  3. SBBNNLS
  4. SpMV
  5. connectome
  6. dMRI
  7. indirect array access
  8. sparse tucker decomposition
  9. tractography

Qualifiers

  • Research-article

Funding Sources

  • Science and Engineering Research Board, India

Conference

ICS '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core SystemsACM Transactions on Parallel Computing10.1145/34180757:4(1-45)Online publication date: 25-Nov-2020
  • (2020)Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414649(317-328)Online publication date: 30-Sep-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media