Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

Published: 01 June 2012 Publication History

Abstract

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6í faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129í for the 3D unbiased nonlinear image registration technique and 93í for the non-local means surface denoising algorithm.

References

[1]
Toga, A.W. and Mazziotta, J.C., Brain Mapping: The Methods. 2002. Academic Press.
[2]
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B. and Hwu, W.W., Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPoPP Proceedings: 20-23 February, pp. 73-82.
[3]
Kutter, O., Shams, R. and Navab, N., Visualization and GPU-accelerated simulation of medical ultrasound from ct images. Computer Methods and Programs in BioMedicine. v94. 250-266.
[4]
Jacques, R., Taylor, R., Wong, J. and McNutt, T., Towards real-time radiation therapy: GPU accelerated superposition/convolution. Computer Methods and Programs in BioMedicine. v94. 1-8.
[5]
Stone, J.E., Phillips, J.C., Freddolino, P.L., Hardy, D.J., Trabuco, L.G. and Schulten, K., Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry. v28. 2618-2640.
[6]
S. Jung, Parallelized pairwise sequence alignment using CUDA on multiple GPUs, BMC Bioinformatics 10 (Suppl. 7) (2009) A3. URL: http://www.biomedcentral.com/1471-2105/10/S7/A3.
[7]
Kirk, D. and Hwu, W., Programming Massively Parallel Processors: A Hands-on Approach. 2010. Morgan Kaufmann.
[8]
Shattuck, D.W., Mirzaa, M., Adisetiyoa, V., Hojatkashania, C., Salamon, G., Narra, K.L., Poldrackc, R.A., Bilderc, R.M. and Toga, A.W., Construction of a 3d probabilistic atlas of human cortical structures. NeuroImage. v39. 1064-1080.
[9]
MacKenzie-Graham, A., Lee, E.-F., Dinov, I.D., Bota, M., Shattuck, D.W., Ruffins, S., Yuan, H., Konstantinidis, F., Pitiot, A., Ding, Y., Hu, G., Jacobs, R.E. and Toga, A.W., A multimodal, multidimensional atlas of the c57bl/6j mouse brain. Journal of Anatomy. v2. 93-102.
[10]
Rueckert, D., Frangi, A. and Schnabel, J., Automatic construction of 3d statistical deformation models using non-rigid registration. 2001.
[11]
Sowell, E.R., Thompson, P.M., Mattson, S.N., Tessner, K.D., Jernigan, T.L., Riley, E.P.A. and Toga, A.W., Voxel-based morphometric analyses of the brain in children and adolescents prenatally exposed to alcohol. NeuroReport. v12. 515-523.
[12]
Thompson, P.M., Schwartz, C. and Toga, A.W., High-resolution random mesh algorithms for creating a probabilistic 3d surface atlas of the human brain. NeuroImage. v3. 19-34.
[13]
Köhn, A., Drexl, J., Tietter, F., König, M. and Peitgen, H.O., Gpu accelerated image registration in two and three dimensions. 2006. Springer, Berlin/Heidelberg.
[14]
Khamene, A., Chisu, R., Wein, W., Navab, N. and Sauer, F., A novel projection based approach for medical image registration. 2006. Springer, Berlin /Heidelberg.
[15]
Vetter, C., Guetter, C., Xu, C. and Westermann, R., Non-rigid multi-modal registration on the GPU. In: Pluim, J.P., Reinhardt, J.M. (Eds.), SPIE Medical Imaging 2007: Image Processing, pp. 555-560.
[16]
Ruijters, D., ter Haar-Romeny, B.M. and Suetens, P., Efficient GPU-accelerated elastic image registration. In: Proc. Sixth IASTED International Conference on Biomedical Engineering (BioMed), Innsbruck, Austria, pp. 419-424.
[17]
Fan, Z., Vetter, C., Guetter, C., Yu, D., Westermann, R., Kaufman, A. and Xu, C., Optimized GPU implementation of learning-based non-rigid multi-modal registration. In: Proceedings of SPIE, vol. 6914, pp. 69142Y
[18]
Shams, R. and Barnes, N., Speeding up mutual information computation using nvidia CUDA hardware. In: Proceedings of the 9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications, IEEE Computer Society Press. pp. 555-560.
[19]
Shams, R., Sadeghi, P., Kennedy, R. and Hartley, R., Parallel computation of mutual information on the GPU with application to real-time registration of 3d medical images. Computer Methods and Programs in Biomedicine.
[20]
Muyan-Ozcelik, P., Owens, J.D., Xia, J. and Samant, S.S., Fast deformable registration on the GPU: a CUDA implementation of demons. In: Proceedings of the 2008 International Conference on Computational Science and Its Applications (ICCSA), IEEE Computer Society Press. pp. 223-233.
[21]
Shams, R., Sadeghi, P., Kennedy, R.A. and Hartley, R.I., A survey of medical image registration on multicore and the GPU. IEEE Signal Processing Magazine. v27 i2. 50-60.
[22]
Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.Z., Stratton, J.A. and Hwu, W.W., Program optimization space pruning for a multithreaded gpu. Proceedings of the Sixth Annual IEEE/ACM International Symposium on Code Generation and Optimization. 195-204.
[23]
Jang, B., Do, S., Pien, H. and Kaeli, D., Architecture-aware optimization targeting multithreaded stream computing. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, vol. 383, pp. 62-70.
[24]
nVidia, Santa Clara, CA, CUDA Programming Guide, August 2009.
[25]
Cuda occupancy calculator {online}.
[26]
Yanovsky, I., Thompson, P.M., Osher, S. and Leow, A.D., Topology preserving log-unbiased nonlinear image registration: theory and implementation. 2007.
[27]
Christensen, G., Rabbitt, R. and Miller, M., Deformable templates using large deformation kinematics. IEEE Transactions on Image Processing. v5 i10. 1435-1447.
[28]
Christensen, G. and Johnson, H., Consistent image registration. IEEE Transactions on Medical Imaging. v20 i7. 568-582.
[29]
D'Agostino, E., Maes, F., Vandermeulen, D. and Suetens, P., A viscous fluid model for multimodal non-rigid image registration using mutual information. Medical Image Analysis. v7. 565-575.
[30]
Rudin, L., Osher, S. and Fatemi, E., Nonlinear total variation based noise removal algorithms. Physics D: Nonlinear Phenomena. v60. 259-268.
[31]
Osher, S., Burger, M., Goldfarb, D., Xu, J. and Yin, W., An iterative regularization method for total variation based image restoration. Multiscale Modeling and Simulation. v4. 460-489.
[32]
Burger, M., Gilboa, G., Osher, S. and Xu, J., Nonlinear inverse scale space methods. Communications in Mathematical Science. v4 i1. 179-212.
[33]
Iterative regularizaiton and nonlinear inverse scale space applied to wavelet based denoising. In: IEEE Image Proceedings, vol. 16,
[34]
Tomasi, C. and Manduchi, R., Bilateral filtering for gray and color images. In: Sixth International Conference on Computer Vision, pp. 839-846.
[35]
Buades, A., Coll, B. and Morel, J.-M., On image denoising methods. Multiscale Modeling and Simulation. v4 i2. 490-530.
[36]
Buades, A., Coll, B. and Morel, J.-M., Neighborhood filters and pde's. Numerische Mathematik. v105 i1. 1-34.
[37]
Gilboa, G. and Osher, S., Nonlocal linear image regularization and supervised segmentation. Multiscale Modeling and Simulation. v6 i2. 595-630.
[38]
Nonlocal operators with applications to image processing. 2007.
[39]
Kindermann, S., Osher, S. and Jones, P.W., Deblurring and denoising of images by nonlocal functionals. Multiscale Modeling and Simulation. v4 i4. 1091-1115.
[40]
Clarenz, U., Diewald, U. and Rumpf, M., Anisotropic diffusion in surface processing. In: Ertl, T., Hamann, B., Varshney, A. (Eds.), Proceedings of IEEE Visualization, pp. 397-405.
[41]
Desbrun, M., Meyer, M., Schröder, P. and Barr, A., Implicit fairing of irregular meshes using diffusion and curvature flow. 1999.
[42]
Desbrun, M., Meyer, M., Schröder, P. and Barr, A., Anisotropic feature preserving denoising of height fields and bivariate data. Graphics Interface.
[43]
Hoppe, H., DeRose, T., Duchamp, T., Halstead, M., Jin, H., McDonald, J., Schweitzer, J. and Stuetzle, W., Piecewise smooth surface reconstruction. 1994.
[44]
Tasdizen, T., Whitaker, R., Burchard, P. and Osher, S., Geometric surface processing via normal maps. ACM Transactions on Graphics. v22. 1012-1033.
[45]
Tasdizen, T., Whitaker, R.T., Burchard, P. and Osher, S., Geometric surface smoothing via anisotropic diffusion of normals. In: Proceedings of IEEE Visualization, pp. 125-132.
[46]
Yoshizawa, S., Belyaev, A. and Seidel, H.P., Smoothing by example: Mesh denoising by averaging with similarity-based weights. In: IEEE International Conference on Shape Modeling and Applications, pp. 38-44.
[47]
Dong, B., Ye, J., Osher, S. and Dinov, I., Level set based nonlocal surface restoration. SIAM Multiscale Modeling and Simulation. v7. 589-598.

Cited By

View all
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • (2020)Petascale XCTProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433750(1-13)Online publication date: 9-Nov-2020
  • (2019)MemXCTProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356220(1-56)Online publication date: 17-Nov-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

Publisher

Elsevier North-Holland, Inc.

United States

Publication History

Published: 01 June 2012

Author Tags

  1. CUDA
  2. Compute-bound
  3. Graphics Processing Unit (GPU)
  4. Memory-bound
  5. Neuroimaging
  6. Performance Optimization

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • (2020)Petascale XCTProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433750(1-13)Online publication date: 9-Nov-2020
  • (2019)MemXCTProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356220(1-56)Online publication date: 17-Nov-2019
  • (2018)Exploring parallel multi-GPU local search strategies in a metaheuristic frameworkJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.06.011111:C(39-55)Online publication date: 1-Jan-2018
  • (2018)Parallelized implementation of an explicit finite element method in many integrated core (MIC) architectureAdvances in Engineering Software10.1016/j.advengsoft.2017.12.001116:C(50-59)Online publication date: 1-Feb-2018
  • (2018)Embedded GPU implementation of sensor correction for on-board real-time stream computing of high-resolution optical satellite imageryJournal of Real-Time Image Processing10.1007/s11554-017-0741-015:3(565-581)Online publication date: 1-Oct-2018
  • (2016)An Evaluation of Emerging Many-Core Parallel Programming ModelsProceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/2883404.2883420(1-10)Online publication date: 12-Mar-2016
  • (2016)Hybrid CPU-GPU constraint checkingInformation and Software Technology10.1016/j.infsof.2015.10.00374:C(230-242)Online publication date: 1-Jun-2016
  • (2016)GPU-accelerated iterative reconstruction from Compton scattered data using a matched pair of conic projector and backprojectorComputer Methods and Programs in Biomedicine10.1016/j.cmpb.2016.04.012131:C(27-36)Online publication date: 1-Jul-2016
  • (2015)Scaling up MapReduce-based Big Data Processing on Multi-GPU systemsCluster Computing10.1007/s10586-014-0400-118:1(369-383)Online publication date: 1-Mar-2015
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media