Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core Systems

Published: 25 November 2020 Publication History

Abstract

Sparse matrix-vector multiplication (SpMV) operations are commonly used in various scientific and engineering applications. The performance of the SpMV operation often depends on exploiting regularity patterns in the matrix. Various representations and optimization techniques have been proposed to minimize the memory bandwidth bottleneck arising from the irregular memory access pattern involved. Among recent representation techniques, tensor decomposition is a popular one used for very large but sparse matrices. Post sparse-tensor decomposition, the new representation involves indirect accesses, making it challenging to optimize for multi-cores and even more demanding for the massively parallel architectures, such as on GPUs.
Computational neuroscience algorithms often involve sparse datasets while still performing long-running computations on them. The Linear Fascicle Evaluation (LiFE) application is a popular neuroscience algorithm used for pruning brain connectivity graphs. The datasets employed herein involve the Sparse Tucker Decomposition (STD)—a widely used tensor decomposition method. Using this decomposition leads to multiple indirect array references, making it very difficult to optimize on both multi-core and many-core systems. Recent implementations of the LiFE algorithm show that its SpMV operations are the key bottleneck for performance and scaling.
In this work, we first propose target-independent optimizations to optimize the SpMV operations of LiFE decomposed using the STD technique, followed by target-dependent optimizations for CPU and GPU systems. The target-independent techniques include: (1) standard compiler optimizations to prevent unnecessary and redundant computations, (2) data restructuring techniques to minimize the effects of indirect array accesses, and (3) methods to partition computations among threads to obtain coarse-grained parallelism with low synchronization overhead. Then, we present the target-dependent optimizations for CPUs such as: (1) efficient synchronization-free thread mapping and (2) utilizing BLAS calls to exploit hardware-specific speed. Following that, we present various GPU-specific optimizations to optimally map threads at the granularity of warps, thread blocks, and grid. Furthermore, to automate the CPU-based optimizations developed for this algorithm, we also extend the PolyMage domain-specific language, embedded in Python. Our highly optimized and parallelized CPU implementation obtains a speedup of 6.3× over the naive parallel CPU implementation running on 16-core Intel Xeon Silver (Skylake-based) system. In addition to that, our optimized GPU implementation achieves a speedup of 5.2× over a reference-optimized GPU code version on NVIDIA’s GeForce RTX 2080 Ti GPU and a speedup of 9.7× over our highly optimized and parallelized CPU implementation.

References

[1]
Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and Bülent Yener. 2007a. Multiway analysis of epilepsy tensors. Bioinformatics 23, 13 (July 2007), i10--i18
[2]
Evrim Acar, Canan Aykut Bingol, Haluk Bingol, Rasmus Bro, and Bulent Yener. 2007b. Seizure recognition on epilepsy feature tensor. In Proceedings of the 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 4273--4276.
[3]
Evrim Acar, Seyit A. Çamtepe, Mukkai S. Krishnamoorthy, and Bülent Yener. 2005. Modeling and multiway analysis of chatroom tensors. In Intelligence and Security Informatics. Springer, 256–268.
[4]
Evrim Acar, Seyit A. Çamtepe, and Bülent Yener. 2006. Collective sampling and analysis of high order tensors for chatroom communications. In Intelligence and Security Informatics. Springer, 213--224.
[5]
Karan Aggarwal and Uday Bondhugula. 2019. Optimizing the linear fascicle evaluation algorithm for many-core systems. In Proceedings of the ACM International Conference on Supercomputing (ICS’19). 425--437.
[6]
Manuel Arenaz, Juan Touriño, and Ramón Doallo. 2005. An inspector-executor algorithm for irregular assignment parallelization. In Parallel and Distributed Processing and Applications, Jiannong Cao, Laurence T. Yang, Minyi Guo, and Francis Lau (Eds.). Springer Berlin, 4--15.
[7]
Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Alp Dener, Victor Eijkhout, William D. Gropp, Dmitry Karpeyev, Dinesh Kaushik, Matthew G. Knepley, Dave A. May, Lois Curfman McInnes, Richard Tran Mills, Todd Munson, Karl Rupp, Patrick Sanan, Barry F. Smith, Stefano Zampini, Hong Zhang, and Hong Zhang. 2020. PETSc Users Manual. ANL-95/11 - Revision 3.14. Argonne National Laboratory. Retrieved from https://www.mcs.anl.gov/petsc.
[8]
Muthu Manikandan Baskaran and Rajesh Bordawekar. 2009. Optimizing sparse matrix-vector multiplication on GPUs, Vol. 8. 812--47. Retrieved from https://dominoweb.draco.res.ibm.com/1d32f6d23b99f7898525752200618339.html.
[9]
Peter J. Basser, Sinisa Pajevic, Carlo Pierpaoli, Jeffrey Duda, and Akram Aldroubi. 2000. In vivo fiber tractography using DT-MRI data. Mag. Reson. Med. 44, 4 (2000), 625--632.
[10]
C. F. Beckmann and S. M. Smith. 2005. Tensorial extensions of independent component analysis for multisubject FMRI analysis. NeuroImage 25, 1 (Mar. 2005), 294--311.
[11]
Mehmet Belgin, Godmar Back, and Calvin J. Ribbens. 2009. Pattern-based sparse matrix representation for memory-efficient SMVM kernels. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). 100--109.
[12]
Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). 1--11.
[13]
Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2016. Sparse matrix format selection with multiclass SVM for SpMV on GPU. In Proceedings of the 45th International Conference on Parallel Processing (ICPP’16). 496--505.
[14]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08). 101--113.
[15]
Kevin L. Briggman and Davi D. Bock. 2012. Volume electron microscopy for neuronal circuit reconstruction. Curr. Opin. Neurobiol. 22, 1 (Feb. 2012), 154--161.
[16]
Aydın Buluç and John R. Gilbert. 2011. The combinatorial BLAS: Design, implementation, and applications. Int. J. High Perf. Comput. Applic. 25, 4 (May 2011), 496--509.
[17]
Cesar F. Caiafa and Franco Pestilli. 2017. Multidimensional encoding of brain connectomes. Sci. Rep. 7, 1 (Sep. 2017).
[18]
Cesar F. Caiafa, Olaf Sporns, Andrew J. Saykin, and Franco Pestilli. 2017. Unified representation of tractography and diffusion-weighted MRI data using sparse multidimensional arrays. In Proceedings of the Annual Conference on Neural Information Processing Systems. 4343--4354.
[19]
Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler optimizations for improving data locality. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’94). 252--262.
[20]
Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Shivmaran S. Pandian, Yogish Sabharwal, and Dheeraj Sreedhar. 2018. On optimizing distributed Tucker decomposition for sparse tensors. In Proceedings of the International Conference on Supercomputing (ICS’18). 374--384.
[21]
Jee Choi, Xing Liu, Shaden Smith, and Tyler Simon. 2018b. Blocking optimization techniques for sparse tensor computation. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’18). 568--577.
[22]
Jee W. Choi, Xing Liu, and Venkatesan T. Chakaravarthy. 2018a. High-performance dense Tucker decomposition on GPU clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18). 543--553.
[23]
Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. ACM SIGPLAN Not. 45, 5 (May 2010), 115--126.
[24]
Andrzej Cichocki, Danilo Mandic, Lieven De Lathauwer, Guoxu Zhou, Qibin Zhao, Cesar Caiafa, and Huy Anh Phan. 2015. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Sig. Proc. Mag. 32, 2 (Mar. 2015), 145--163.
[25]
Michał Cierniak and Wei Li. 1995. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’95). 205--217.
[26]
Albert Cohen, Sylvain Girbal, David Parello, M. Sigler, Olivier Temam, and Nicolas Vasilache. 2005. Facilitating the search for compositions of program transformations. In Proceedings of the ACM International Conference on Supercomputing. 151--160.
[27]
R. Cameron Craddock, Saad Jbabdi, Chao-Gan Yan, Joshua T. Vogelstein, F. Xavier Castellanos, Adriana Di Martino, Clare Kelly, Keith Heberlein, Stan Colcombe, and Michael P. Milham. 2013. Imaging human connectomes at the macroscale. Nat. Meth. 10, 6 (June 2013), 524--539.
[28]
Julien Demouth. 2013. Shuffle: Tips and tricks. NVIDIA GTC (2013). Retrieved from https://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf.
[29]
Chen Ding and Ken Kennedy. 1999. Improving cache performance in dynamic applications through data and computation reorganization at run time. ACM SIGPLAN Not. 34, 5 (May 1999), 229--241.
[30]
Anand Ekambaram and Eurípides Montagne. 2003. An alternative compressed storage format for sparse matrices. In Proceedings of the International Symposium on Computer and Information Sciences (ISCIS’03). 196--203.
[31]
Paul Feautrier. 1991. Dataflow analysis of array and scalar references. Int. J. Parallel Prog. 20, 1 (Feb. 1991), 23--53.
[32]
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. Int. J. Parallel Prog. 21, 5 (Oct. 1992), 313--347.
[33]
Joseph L. Greathouse and Mayank Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 769--780.
[34]
Shashank Gugnani, Xiaoyi Lu, Franco Pestilli, Cesar F. Caiafa, and Dhabaleswar K. Panda. 2017. MPI-LiFE: Designing high-performance linear fascicle evaluation of brain connectome with MPI. In Proceedings of the 24th IEEE International Conference on High Performance Computing (HiPC’17). 213--222.
[35]
Ping Guo and Chung wei Lee. 2016. A performance prediction and analysis integrated framework for SpMV on GPUs. Procedia Comput. Sci. 80 (2016), 178--189.
[36]
Hwansoo Han and Chau-Wen Tseng. 2006. Exploiting locality for irregular scientific codes. IEEE Trans. Parallel Distrib. Syst. 17, 7 (July 2006), 606--618.
[37]
D. K. Jones. 2008. Tractography gone wild: Probabilistic fibre tracking using the wild bootstrap with diffusion tensor MRI. IEEE Trans. Med. Imag. 27, 9 (Sept. 2008), 1268--1274.
[38]
Derek K. Jones. 2010. Challenges and limitations of quantifying brain connectivity in vivo with diffusion MRI. Imag. Med. 2, 3 (2010), 341--355.
[39]
James W. Demmel, Jong-Ho Byun, Richard Lin, and Katherine A. Yelick. 2012. pOSKI: Parallel optimized sparse kernel interface library user guide for version 1.0.0. (2012). Retrieved from https://bebop.cs.berkeley.edu/poski.
[40]
M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. 1998. Improving locality using loop and data transformations in an integrated framework. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture. 285--296.
[41]
J. Kasthuri, S. Veerapandian, and N. Rajendiran. 2009. Biological synthesis of silver and gold nanoparticles using apiin as reducing agent. Coll. Surf. B: Biointerf. 68, 1 (Jan. 2009), 55--60.
[42]
Oguz Kaya and Bora Ucar. 2016. High performance parallel algorithms for the Tucker decomposition of sparse tensors. In Proceedings of the 45th International Conference on Parallel Processing (ICPP’16). 103--112.
[43]
W. Kelly and W. Pugh. 1995. A unifying framework for iteration reordering transformations. In Proceedings of the 1st International Conference on Algorithms and Architectures for Parallel Processing, Vol. 1. 153--162.
[44]
Henry Kennedy, David C. Van Essen, and Yves Christen (Eds.). 2016. Micro-, Meso- and Macro-Connectomics of the Brain. Springer Nature.
[45]
Dongmin Kim, Suvrit Sra, and Inderjit S. Dhillon. 2013. A non-monotonic method for large-scale non-negative least squares. Optim. Meth. Softw. 28, 5 (Oct. 2013), 1012--1039.
[46]
Induprakas Kodukula and Keshav Pingali. 1996. Transformations for imperfectly nested loops. In Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM) (Supercomputing’96). 12–12es.
[47]
Tamara G. Kolda and Brett W. Bader. 2009. Tensor decompositions and applications. SIAM Rev. 51, 3 (Aug. 2009), 455--500.
[48]
Sawan Kumar, Varsha Sreenivasan, Partha Talukdar, Franco Pestilli, and Devarajan Sridharan. 2019. ReAl-LiFE: Accelerating the discovery of individualized brain connectomes on GPUs. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference. 630--638.
[49]
Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. 2007. Fourth-order cumulant-based blind identification of underdetermined mixtures. IEEE Trans. Sig. Proc. 55, 6 (June 2007), 2965--2973.
[50]
Lieven De Lathauwer and Alexandre de Baynast. 2008. Blind deconvolution of DS-CDMA signals by means of decomposition in rank-(1, L, L) terms. IEEE Trans. Sig. Proc. 56, 4 (Apr. 2008), 1562--1571.
[51]
Lieven De Lathauwer and Joos Vandewalle. 2004. Dimensionality reduction in higher-order signal processing and rank-(R1, R2,…, RN) reduction in multilinear algebra. Lin. Alg. Appl. 391 (Nov. 2004), 31--55.
[52]
Jiajia Li, Yuchen Ma, Chenggang Yan, and Richard Vuduc. 2016. Optimizing sparse tensor times matrix on multi-core and many-core architectures. In Proceedings of the 6th Workshop on Irregular Applications: Architectures and Algorithms. 26--33.
[53]
Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical storage of sparse tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’18). 238--252.
[54]
Wei Li and Keshav Pingali. 1994. A singular loop transformation framework based on non-singular matrices. Int. J. Parallel Prog. 22, 2 (Apr. 1994), 183--205.
[55]
Yifeng Li and Alioune Ngom. 2013. Nonnegative least-squares methods for the classification of high-dimensional biological data. IEEE/ACM Trans. Comput. Biol. Bioinf. 10, 2 (Mar. 2013), 447--456.
[56]
Bangtian Liu, Chengyao Wen, Anand D. Sarwate, and Maryam Mehri Dehnavi. 2017. A unified optimization approach for sparse tensor operations on GPUs. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’17). 47--57.
[57]
Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM conference on Supercomputing (ICS’13). 273--282.
[58]
Juan A. Lorenzo, Julio L. Albin, Tomas F. Pena, Francisco F. Rivera, and David E. Singh. 2007. An inspector/executor based strategy to efficiently parallelize n-body simulation programs on shared memory systems. In Proceedings of the 6th International Symposium on Parallel and Distributed Computing (ISPDC’07). 9--9.
[59]
Lee-Chung Lu. 1991. A unified framework for systematic loop transformations. ACM SIGPLAN Not. 26, 7 (July 1991), 28--38.
[60]
Gumma Venkata Kailash Madhav. 2017. Optimization of Connectome Pruning Algorithm using Hybrid CPU-GPU Methods. Master’s thesis. Department of Computational and Data Sciences, Indian Institute of Science.
[61]
Mohammed Mahmoud, Mark Hoffmann, and Hassan Reza. 2017. An efficient storage format for storing configuration interaction sparse matrices on CPU/GPU. In Proceedings of the International Conference on Computational Science and Computational Intelligence (CSCI’17). 141--147.
[62]
Mohammed Mahmoud, Mark Hoffmann, and Hassan Reza. 2018. Developing a new storage format and a warp-based SpMV kernel for configuration interaction sparse matrices on the GPU. Computation 6, 3 (Aug. 2018), 45.
[63]
Klaus H. Maier-Hein, Peter F. Neher, et al. 2017. The challenge of mapping the human connectome based on diffusion tractography. Nat. Commun. 8, 1349, 1 (2017).
[64]
Eduardo Martinez-Montes, Pedro A. Valdés-Sosa, Fumikazu Miwakeichi, Robin I. Goldman, and Mark S. Cohen. 2004. Concurrent EEG/fMRI analysis by multiway partial least squares. NeuroImage 22, 3 (July 2004), 1023--1034.
[65]
John D. McCalpin. 1991-2007. STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical Report. University of Virginia, Charlottesville, Virginia. A continually updated technical report.
[66]
John Mellor-Crummey and John Garvin. 2004. Optimizing sparse matrix–vector product computations using unroll and jam. Int. J. High Perf. Comput. Applic. 18, 2 (May 2004), 225--236.
[67]
Klaus-Dietmar Merboldt, Wolfgang Hanicke, and Jens Frahm. 1985. Self-diffusion NMR imaging using stimulated echoes. J. Mag. Reson. (1969) 64, 3 (Oct. 1985), 479--486.
[68]
N. Mitchell, L. Carter, and J. Ferrante. 1999. Localizing non-affine array references. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 192--202.
[69]
Fumikazu Miwakeichi, Eduardo Martinez-Montes, Pedro A. Valdés-Sosa, Nobuaki Nishiyama, Hiroaki Mizuhara, and Yoko Yamaguchi. 2004. Decomposing EEG data into space–time–frequency components using parallel factor analysis. NeuroImage 22, 3 (July 2004), 1035--1045.
[70]
Susumu Mori, Barbara J. Crain, V. P. Chacko, and Peter C. M. Van Zijl. 1999. Three-dimensional tracking of axonal projections in the brain by magnetic resonance imaging. Ann. Neurol. 45, 2 (Feb. 1999), 265--269.
[71]
Morten Mørup, Lars Kai Hansen, and Sidse M. Arnfred. 2007. ERPWAVELAB. J. Neurosci. Meth. 161, 2 (Apr. 2007), 361--368.
[72]
Morten Mørup, Lars Kai Hansen, and Sidse M. Arnfred. 2008. Algorithms for sparse nonnegative tucker decompositions. Neural Comput. 20, 8 (Aug. 2008), 2112--2131.
[73]
Morten Mørup, Lars Kai Hansen, Christoph S. Herrmann, Josef Parnas, and Sidse M. Arnfred. 2006. Parallel factor analysis as an exploratory tool for wavelet transformed event-related EEG. NeuroImage 29, 3 (Feb. 2006), 938--947.
[74]
Susanne G. Mueller, Michael W. Weiner, Leon J. Thal, Ronald C. Petersen, Clifford R. Jack, William Jagust, John Q. Trojanowski, Arthur W. Toga, and Laurel Beckett. 2005. Ways toward an early diagnosis in Alzheimer’s disease: The Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzh. Dement. 1, 1 (July 2005), 55--66.
[75]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. SIGARCH Comput. Archit. News 43, 1 (Mar. 2015), 429--443.
[76]
Kumudha Narasimhan. 2018. Optimizing Dense Matrix Computations with PolyMage. Master’s thesis. The Department of Computer Science and Automation, Indian Institute of Science.
[77]
Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, and P. Sadayappan. 2019. Load-balanced sparse MTTKRP on GPUs. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’19). 123--133.
[78]
NVIDIA. 2019. NVIDIA Nsight Compute. Retrieved from https://developer.nvidia.com/nsight-compute-2019_5.
[79]
I. V. Oseledets, D. V. Savostianov, and E. E. Tyrtyshnikov. 2008. Tucker dimensionality reduction of three-dimensional arrays in linear time. SIAM J. Matrix Anal. Appl. 30, 3 (Jan. 2008), 939--956.
[80]
Evangelos E. Papalexakis, Christos Faloutsos, and Nicholas D. Sidiropoulos. 2016. Tensors for data mining and data fusion. ACM Trans. Intell. Syst. Technol. 8, 2 (Oct. 2016), 1--44.
[81]
Ioakeim Perros, Robert Chen, Richard Vuduc, and Jimeng Sun. 2015. Sparse hierarchical Tucker factorization and its application to healthcare. In Proceedings of the IEEE International Conference on Data Mining. 943--948.
[82]
Ioakeim Perros, Robert Chen, Richard W. Vuduc, and Jimeng Sun. 2016. Sparse hierarchical Tucker factorization and its application to healthcare. CoRR abs/1610.07722 (2016).
[83]
F. Pestilli and C. F. Caiafa. 2016a. Encode: Multidimensional encoding of brain connectomes. Retrieved from https://github.com/brain-life/encode.
[84]
F. Pestilli and C. F. Caiafa. 2016b. Demo Data for Multidimensional Encoding of Brain Connectomes. Retrieved from https://scholarworks.iu.edu/cgi-bin/mdssRequest.pl?file=2022/20995/Demo_Data_for_Multidimensional_Encoding_of_Brain_Connectomes.tar.gz.
[85]
Franco Pestilli, Jason D. Yeatman, Ariel Rokem, Kendrick N. Kay, and Brian A. Wandell. 2014. Evaluation and statistical inference for human connectomes. Nat. Meth. 11, 10 (Sep. 2014), 1058--1063.
[86]
Eric T. Phipps and Tamara G. Kolda. 2019. Software for sparse tensor decomposition on emerging computing architectures. SIAM J. Sci. Comput. 41, 3 (Jan. 2019), C269–C290.
[87]
William Pugh. 1991. The Omega test: A fast and practical integer programming algorithm for dependence analysis. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing’91). 4--13.
[88]
William Pugh and David Wonnacott. 1994. Nonlinear Array Dependence Analysis. Technical Report. 1--14 pages.
[89]
Vivek Sarkar and Radhika Thekkath. 1992. A general framework for iteration-reordering loop transformations. ACM SIGPLAN Not. 27, 7 (July 1992), 175--187.
[90]
Manu Shantharam, Anirban Chatterjee, and Padma Raghavan. 2011. Exploiting dense substructures for fast sparse matrix vector multiplication. Int. J. High Perf. Comput. Applic. 25, 3 (Aug. 2011), 328--341.
[91]
Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, and Christos Faloutsos. 2017. Tensor decomposition for signal processing and machine learning. IEEE Trans. Sig. Proc. 65, 13 (July 2017), 3551--3582.
[92]
Shaden Smith and George Karypis. 2017. Accelerating the Tucker decomposition with compressed sparse tensors. In Lecture Notes in Computer Science. 653--668.
[93]
Shaden Smith, Jongsoo Park, and George Karypis. 2017. Sparse tensor factorization on many-core processors with high-bandwidth memory. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). 1058--1067.
[94]
Shaden Smith, Niranjay Ravindran, Nicholas D. Sidiropoulos, and George Karypis. 2015. SPLATT: Efficient and parallel sparse tensor-matrix multiplication. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. 61--70.
[95]
Olaf Sporns, Giulio Tononi, and Rolf Kötter. 2005. The human connectome: A structural description of the human brain. PLoS Comput. Biol. 1, 4 (2005), e42.
[96]
Michelle Mills Strout, Alan LaMielle, Larry Carter, Jeanne Ferrante, Barbara Kreaseck, and Catherine Olschanowsky. 2016. An approach for code generation in the sparse polyhedral framework. Parallel Comput. 53, C (Apr. 2016), 32--57.
[97]
Daniel Stucht, K. Appu Danishad, Peter Schulze, Frank Godenschweger, Maxim Zaitsev, and Oliver Speck. 2015. Highest resolution in vivo human brain MRI using prospective motion correction. Plos ONE 10, 7 (July 2015).
[98]
Jimeng Sun, Spiros Papadimitriou, and Philip Yu. 2006a. Window-based tensor analysis on high-dimensional and multi-aspect streams. In Proceedings of the International Conference on Data Mining (ICDM’06). 1076--1080.
[99]
Jimeng Sun, Dacheng Tao, and Christos Faloutsos. 2006b. Beyond streams and graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). 374--383.
[100]
Xiangzheng Sun, Yunquan Zhang, Ting Wang, Xianyi Zhang, Liang Yuan, and Li Rao. 2011. Optimizing SpMV for diagonal sparse matrices on GPU. In Proceedings of the International Conference on Parallel Processing. 492--501.
[101]
Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009, Matthias S. Müller, Michael M. Resch, Alexander Schulz, and Wolfgang E. Nagel (Eds.). Springer Berlin, 157--173.
[102]
William Thies, Frédéric Vivien, Jeffrey Sheldon, and Saman Amarasinghe. 2001. A unified framework for schedule and storage optimization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’01). 232--242.
[103]
J-Donald Tournier, Fernando Calamante, and Alan Connelly. 2012. MRtrix: Diffusion tractography in crossing fiber regions. Int. J. Imaging Syst. Technol. 22, 1 (Mar. 2012), 53--66.
[104]
L. R. Tucker. 1966. Some mathematical notes on three-mode factor analysis. Psychometrika 31 (1966), 279--311.
[105]
F. Vázquez, J. J. Fernández, and E. M. Garzón. 2010. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr. Comput.: Pract. Exper. 23, 8 (Sep. 2010), 815--826.
[106]
Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and data transformations for sparse matrix code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’15). 521--532.
[107]
Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout, and Mary Hall. 2016. Automating wavefront parallelization for sparse matrix computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). 480--491.
[108]
Anand Venkat, Manu Shantharam, Mary Hall, and Michelle Mills Strout. 2014. Non-affine extensions to polyhedral code generation. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization. 185--194.
[109]
Sven Verdoolaege. 2010. isl: An integer set library for the polyhedral model. In Proceedings of the Mathematical Software (ICMS’10). 299--302.
[110]
Maarten De Vos, Lieven De Lathauwer, Bart Vanrumste, Sabine Van Huffel, and W. Van Paesschen. 2007. Canonical decomposition of ictal scalp EEG and accurate source localisation: Principles and simulation study. Comput. Intell. Neurosci. 2007 (2007), 1--10.
[111]
Richard Vuduc, James W Demmel, and Katherine A Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. J. Phys.: Conf. Series 16 (Jan. 2005), 521--530.
[112]
Richard W. Vuduc, James Demmel, Katherine A. Yelick, and Berkeley Benchmarking. 2007. The Optimized Sparse Kernel Interface (OSKI) Library User’s Guide for Version 1.0.1.
[113]
Richard W. Vuduc and Hyun-Jin Moon. 2005. Fast sparse matrix-vector multiplication by exploiting variable block structure. In High Performance Computing and Communications. 807--816.
[114]
Mark T. Wallace, Ramnarayan Ramachandran, and Barry E. Stein. 2004. A revised view of sensory cortical parcellation. Proc. Nat. Acad. Sci. 101, 7 (2004), 2167--2172.
[115]
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’07). 1--12.
[116]
M. E. Wolf, D. E. Maydan, and Ding-Kai Chen. 1996. Combining loop transformations considering caches and scheduling. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’96). 274--286.
[117]
Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). 57--68.
[118]
Carl Yang, Aydin Buluç, and John D. Owens. 2018. Design principles for sparse matrix multiplication on the GPU. CoRR abs/1803.08601 (2018). arxiv:1803.08601 http://arxiv.org/abs/1803.08601.
[119]
Tatsuya Yokota and Andrzej Cichocki. 2014. Multilinear tensor rank estimation via Sparse Tucker Decomposition. In Proceedings of the 7th International Conference on Soft Computing and Intelligent Systems (SCIS’14). 478--483.
[120]
Syed Zubair and Wenwu Wang. 2013. Tensor dictionary learning with sparse TUCKER decomposition. In Proceedings of the 18th International Conference on Digital Signal Processing (DSP’13). 1--6.

Cited By

View all
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023

Index Terms

  1. Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core Systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Parallel Computing
      ACM Transactions on Parallel Computing  Volume 7, Issue 4
      Special Issue on Innovations in Systems for Irregular Applications, Part 2
      December 2020
      179 pages
      ISSN:2329-4949
      EISSN:2329-4957
      DOI:10.1145/3426879
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 November 2020
      Accepted: 01 June 2020
      Revised: 01 June 2020
      Received: 01 July 2019
      Published in TOPC Volume 7, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. GPU
      2. LiFE algorithm
      3. SpMV
      4. connectome
      5. indirect array accesses
      6. multi-core
      7. sparse tucker decomposition
      8. tensor decomposition
      9. tractography

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 27 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media