research-article

Open access

Accelerating Divergent Applications on SIMD Architectures Using Neural Networks

Authors:

Beayna Grigorian,

Glenn ReinmanAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 1

Article No.: 2, Pages 1 - 23

https://doi.org/10.1145/2717311

Published: 09 March 2015 Publication History

Abstract

The purpose of this research is to find a neural-network-based solution to the well-known problem of branch divergence in Single Instruction Multiple Data (SIMD) architectures. Our approach differs from existing techniques that handle branch (or control-flow) divergence, which use costly hardware modifications, low-utilization masking techniques, or static prediction methods. As we examine divergent applications, we characterize the degree of data-dependent control flow seen in each and isolate the code regions (or “kernels”) that cause the most performance degradation due to branch divergence. We then train neural networks (NNs) offline to approximate these kernels and inject the NN computations directly into the applications as substitutes for the kernels they approximate. This essentially translates control flow into nondivergent computation, trading off precision for performance. As our methodology manipulates application source code directly, it is inherently platform agnostic and can be adopted as a general means for accelerating divergent applications on data-parallel architectures. In this article, we present the Neuralizer, an automated software flow for kernel identification, NN training, and NN integration, as well as supplementary user-controlled optimization techniques. Evaluating our approach on a variety of divergent applications run on a Graphics Processing Unit (GPU), we on average achieve performance gains of 13.6 × and energy savings of 14.8 × with 96% accuracy.

References

[1]

Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 198--209.

Digital Library

[2]

Michele Banko and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. 26--33.

Digital Library

[3]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 72--81.

Digital Library

[4]

Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 33--52.

Digital Library

[5]

Swarat Chaudhuri, Sumit Gulwani, Roberto Lublinerman, and Sara Navidpour. 2011. Proving programs robust. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 102--112.

Digital Library

[6]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. 2008. A performance study of general-purpose applications on graphics processors using CUDA. Journal of Parallel and Distributed Computing 68, 10 (2008), 1370--1380.

Digital Library

[7]

T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. Temam. 2012. BenchNN: On the broad potential application scope of hardware neural network accelerators. In IEEE International Symposium on Workload Characterization. 36--45.

Digital Library

[8]

H. Cho, L. Leem, and S. Mitra. 2012. ERSA: Error resilient system architecture for probabilistic applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4 (2012), 546--558.

Digital Library

[9]

Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012a. Architecture support for accelerator-rich CMPs. In Proceedings of the 49th Annual Design Automation Conference. 843--849.

Digital Library

[10]

Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012b. CHARM: A composable heterogeneous accelerator-rich microprocessor. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design. 379--384.

Digital Library

[11]

Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 477--488.

Digital Library

[12]

Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012a. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems. 301--312.

Digital Library

[13]

Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012b. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. 449--460.

Digital Library

[14]

W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture. 25--36.

Digital Library

[15]

Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. 407--420.

Digital Library

[16]

Beayna Grigorian and Glenn Reinman. 2014. Dynamically adaptive and reliable approximate computing using light-weight error analysis. In NASA/ESA Conference on Adaptive Hardware and Systems. 248--255.

[17]

Michael Gschwind. 2006. Chip multiprocessing and the cell broadband engine. In Proceedings of the 3rd Conference on Computing Frontiers. 1--8.

Digital Library

[18]

Simon Haykin. 1998. Neural Networks: A Comprehensive Foundation (2nd ed.). Prentice Hall PTR.

Digital Library

[19]

Kurt Hornik. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 2 (March 1991), 251--257.

Digital Library

[20]

P3 International. 2014. Kill A Watt. http://www.p3international.com/products/p4400.html. (2014).

[21]

Ujval J. Kapasi. 2004. Conditional Techniques for Stream Processing Kernels. Ph.D. Dissertation. Stanford University.

Digital Library

[22]

Ujval J. Kapasi, William J. Dally, Scott Rixner, Peter R. Mattson, John D. Owens, and Brucek Khailany. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture. 159--170.

Digital Library

[23]

Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th Annual International Symposium on Computer Architecture. 129--140.

Digital Library

[24]

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (2008), 39--55.

Digital Library

[25]

Chris Lomont. 2011. Introduction to Intel advanced vector extensions. In Proceedings of the 2nd Annual ASCI Conference. 132--137.

[26]

P. K. Meher. 2010. An optimized lookup-table for the evaluation of sigmoid function for artificial neural networks. In Proceedings of the 18th IEEE/IFIP VLSI System on Chip Conference. 91--95.

[27]

Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 235--246.

Digital Library

[28]

Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 308--317.

Digital Library

[29]

Steffen Nissen. 2003. Implementation of a Fast Artificial Neural Network Library (FANN). Report. Department of Computer Science University of Copenhagen (DIKU).

[30]

Nvidia. 2013. GeForce GTX 480. Retrieved from http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480.

[31]

Nvidia. 2014a. CUDA 5.5 Production Release. Retrieved from http://developer.nvidia.com/cuda-downloads.

[32]

Nvidia. 2014b. CUDA Math Library. Retrieved from http://developer.nvidia.com/cuda-math-library.

[33]

J. M. Ortega and W. C. Rheinboldt. 1970. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press.

Digital Library

[34]

Dan Quinlan. 2000. ROSE: Compiler support for object-oriented frameworks. Parallel Processing Letters 10 (2000), 215--226.

[35]

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533--536.

[36]

Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 35--50.

Digital Library

[37]

Mehrzad Samadi, Janghaeng Lee, Davoud Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 13--24.

Digital Library

[38]

Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. 164--174.

Digital Library

[39]

Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley.

Digital Library

[40]

J. Sartori and R. Kumar. 2013. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia 15, 2 (Feb. 2013), 279--290.

Digital Library

[41]

Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 2008. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics 27, 3 (Aug. 2008), 18:1--18:15.

Digital Library

[42]

Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 124--134.

Digital Library

[43]

J. E. Smith, G. Faanes, and R. Sugumar. 2000. Vector instruction set support for conditional operations. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 260--269.

Digital Library

[44]

Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, and Mani Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 368--379.

Digital Library

[45]

Ganesh Venkatesh, Jack Sampson, Nathan Goulding-Hotta, Sravanthi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. 2011. QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 163--174.

Digital Library

[46]

Ingo Wald. 2011. Active thread compaction for GPU path tracing. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics. 51--58.

Digital Library

[47]

Haicheng Wu, Gregory Diamos, Jin Wang, Si Li, and Sudhakar Yalamanchili. 2012. Characterization and transformation of unstructured control flow in bulk synchronous GPU applications. International Journal of High Performance Computing Applications 26, 2 (2012), 170--185.

Digital Library

[48]

Guoqiang P. Zhang. 2000. Neural networks for classification: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 30, 4 (2000), 451--462.

Digital Library

Cited By

Amirafshar NBaroughi AShahhoseini HTaheriNejad N(2022)An Approximate Carry Disregard Multiplier with Improved Mean Relative Error Distance and Probability of Correctness2022 25th Euromicro Conference on Digital System Design (DSD)10.1109/DSD57027.2022.00016(46-52)Online publication date: Aug-2022
https://doi.org/10.1109/DSD57027.2022.00016
Balasubramanian PNayar RMaskell D(2021)Gate-Level Static Approximate Adders: A Comparative AnalysisElectronics10.3390/electronics1023291710:23(2917)Online publication date: 25-Nov-2021
https://doi.org/10.3390/electronics10232917
Vespa LPeters G(2021)Contrived and Remediated GPU Thread Divergence Using a Flattening TechniqueAdvances in Parallel & Distributed Processing, and Applications10.1007/978-3-030-69984-0_46(647-658)Online publication date: 19-Oct-2021
https://doi.org/10.1007/978-3-030-69984-0_46
Show More Cited By

Recommendations

Efficient SIMD implementation for accelerating convolutional neural network
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

Convolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through ...
SIMD divergence optimization through intra-warp compaction
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU ...
SIMD divergence optimization through intra-warp compaction
ICSA '13

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 12, Issue 1

April 2015

201 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2744295

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2015

Accepted: 01 January 2015

Revised: 01 December 2014

Received: 01 September 2014

Published in TACO Volume 12, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSF Graduate Research Fellowship Grant # DGE-0707424
NSF Expedition in Computing Award # CCF-0926127
C-FAR (one of six centers of STARnet, an SRC program sponsored by MARCO and DARPA)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
857
Total Downloads

Downloads (Last 12 months)103
Downloads (Last 6 weeks)15

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Amirafshar NBaroughi AShahhoseini HTaheriNejad N(2022)An Approximate Carry Disregard Multiplier with Improved Mean Relative Error Distance and Probability of Correctness2022 25th Euromicro Conference on Digital System Design (DSD)10.1109/DSD57027.2022.00016(46-52)Online publication date: Aug-2022
https://doi.org/10.1109/DSD57027.2022.00016
Balasubramanian PNayar RMaskell D(2021)Gate-Level Static Approximate Adders: A Comparative AnalysisElectronics10.3390/electronics1023291710:23(2917)Online publication date: 25-Nov-2021
https://doi.org/10.3390/electronics10232917
Vespa LPeters G(2021)Contrived and Remediated GPU Thread Divergence Using a Flattening TechniqueAdvances in Parallel & Distributed Processing, and Applications10.1007/978-3-030-69984-0_46(647-658)Online publication date: 19-Oct-2021
https://doi.org/10.1007/978-3-030-69984-0_46
Jiao J(2020)HEAP: A Holistic Error Assessment Framework for Multiple Approximations Using Probabilistic Graphical ModelsElectronics10.3390/electronics90203739:2(373)Online publication date: 22-Feb-2020
https://doi.org/10.3390/electronics9020373
Liu WLombardi FShulte M(2020)A Retrospective and Prospective View of Approximate Computing [Point of View}Proceedings of the IEEE10.1109/JPROC.2020.2975695108:3(394-399)Online publication date: Mar-2020
https://doi.org/10.1109/JPROC.2020.2975695
Khairy MWassal AZahran M(2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
https://doi.org/10.1016/j.jpdc.2018.11.012
Inaoka HKobayashi KNebuya SKumagai HTsuruta HFukuoka Y(2019)Derivation of NARX models by expanding activation functions in neural networksIEEJ Transactions on Electrical and Electronic Engineering10.1002/tee.2292014:8(1209-1218)Online publication date: 17-May-2019
https://doi.org/10.1002/tee.22920
Oliveira GGonçalves LBrandalero MBeck ACarro L(2018)Employing classification-based algorithms for general-purpose approximate computingProceedings of the 55th Annual Design Automation Conference10.1145/3195970.3196043(1-6)Online publication date: 24-Jun-2018
https://dl.acm.org/doi/10.1145/3195970.3196043
Maier DCosenza BJuurlink B(2018)Local memory-aware kernel perforationProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168814(278-287)Online publication date: 2018
https://doi.org/10.1145/3179541.3168814
Maier DCosenza BJuurlink BKnoop JSchordan MJohnson TO'Boyle M(2018)Local memory-aware kernel perforationProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168814(278-287)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3168814
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents