Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Accelerating Divergent Applications on SIMD Architectures Using Neural Networks

Published: 09 March 2015 Publication History

Abstract

The purpose of this research is to find a neural-network-based solution to the well-known problem of branch divergence in Single Instruction Multiple Data (SIMD) architectures. Our approach differs from existing techniques that handle branch (or control-flow) divergence, which use costly hardware modifications, low-utilization masking techniques, or static prediction methods. As we examine divergent applications, we characterize the degree of data-dependent control flow seen in each and isolate the code regions (or “kernels”) that cause the most performance degradation due to branch divergence. We then train neural networks (NNs) offline to approximate these kernels and inject the NN computations directly into the applications as substitutes for the kernels they approximate. This essentially translates control flow into nondivergent computation, trading off precision for performance. As our methodology manipulates application source code directly, it is inherently platform agnostic and can be adopted as a general means for accelerating divergent applications on data-parallel architectures. In this article, we present the Neuralizer, an automated software flow for kernel identification, NN training, and NN integration, as well as supplementary user-controlled optimization techniques. Evaluating our approach on a variety of divergent applications run on a Graphics Processing Unit (GPU), we on average achieve performance gains of 13.6 × and energy savings of 14.8 × with 96% accuracy.

References

[1]
Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 198--209.
[2]
Michele Banko and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. 26--33.
[3]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 72--81.
[4]
Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 33--52.
[5]
Swarat Chaudhuri, Sumit Gulwani, Roberto Lublinerman, and Sara Navidpour. 2011. Proving programs robust. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 102--112.
[6]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. 2008. A performance study of general-purpose applications on graphics processors using CUDA. Journal of Parallel and Distributed Computing 68, 10 (2008), 1370--1380.
[7]
T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. Temam. 2012. BenchNN: On the broad potential application scope of hardware neural network accelerators. In IEEE International Symposium on Workload Characterization. 36--45.
[8]
H. Cho, L. Leem, and S. Mitra. 2012. ERSA: Error resilient system architecture for probabilistic applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4 (2012), 546--558.
[9]
Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012a. Architecture support for accelerator-rich CMPs. In Proceedings of the 49th Annual Design Automation Conference. 843--849.
[10]
Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012b. CHARM: A composable heterogeneous accelerator-rich microprocessor. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design. 379--384.
[11]
Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 477--488.
[12]
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012a. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems. 301--312.
[13]
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012b. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. 449--460.
[14]
W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture. 25--36.
[15]
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. 407--420.
[16]
Beayna Grigorian and Glenn Reinman. 2014. Dynamically adaptive and reliable approximate computing using light-weight error analysis. In NASA/ESA Conference on Adaptive Hardware and Systems. 248--255.
[17]
Michael Gschwind. 2006. Chip multiprocessing and the cell broadband engine. In Proceedings of the 3rd Conference on Computing Frontiers. 1--8.
[18]
Simon Haykin. 1998. Neural Networks: A Comprehensive Foundation (2nd ed.). Prentice Hall PTR.
[19]
Kurt Hornik. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 2 (March 1991), 251--257.
[20]
P3 International. 2014. Kill A Watt. http://www.p3international.com/products/p4400.html. (2014).
[21]
Ujval J. Kapasi. 2004. Conditional Techniques for Stream Processing Kernels. Ph.D. Dissertation. Stanford University.
[22]
Ujval J. Kapasi, William J. Dally, Scott Rixner, Peter R. Mattson, John D. Owens, and Brucek Khailany. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture. 159--170.
[23]
Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th Annual International Symposium on Computer Architecture. 129--140.
[24]
Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (2008), 39--55.
[25]
Chris Lomont. 2011. Introduction to Intel advanced vector extensions. In Proceedings of the 2nd Annual ASCI Conference. 132--137.
[26]
P. K. Meher. 2010. An optimized lookup-table for the evaluation of sigmoid function for artificial neural networks. In Proceedings of the 18th IEEE/IFIP VLSI System on Chip Conference. 91--95.
[27]
Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 235--246.
[28]
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 308--317.
[29]
Steffen Nissen. 2003. Implementation of a Fast Artificial Neural Network Library (FANN). Report. Department of Computer Science University of Copenhagen (DIKU).
[30]
Nvidia. 2013. GeForce GTX 480. Retrieved from http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480.
[31]
Nvidia. 2014a. CUDA 5.5 Production Release. Retrieved from http://developer.nvidia.com/cuda-downloads.
[32]
Nvidia. 2014b. CUDA Math Library. Retrieved from http://developer.nvidia.com/cuda-math-library.
[33]
J. M. Ortega and W. C. Rheinboldt. 1970. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press.
[34]
Dan Quinlan. 2000. ROSE: Compiler support for object-oriented frameworks. Parallel Processing Letters 10 (2000), 215--226.
[35]
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533--536.
[36]
Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 35--50.
[37]
Mehrzad Samadi, Janghaeng Lee, Davoud Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 13--24.
[38]
Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. 164--174.
[39]
Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley.
[40]
J. Sartori and R. Kumar. 2013. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia 15, 2 (Feb. 2013), 279--290.
[41]
Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 2008. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics 27, 3 (Aug. 2008), 18:1--18:15.
[42]
Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 124--134.
[43]
J. E. Smith, G. Faanes, and R. Sugumar. 2000. Vector instruction set support for conditional operations. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 260--269.
[44]
Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, and Mani Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 368--379.
[45]
Ganesh Venkatesh, Jack Sampson, Nathan Goulding-Hotta, Sravanthi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. 2011. QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 163--174.
[46]
Ingo Wald. 2011. Active thread compaction for GPU path tracing. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics. 51--58.
[47]
Haicheng Wu, Gregory Diamos, Jin Wang, Si Li, and Sudhakar Yalamanchili. 2012. Characterization and transformation of unstructured control flow in bulk synchronous GPU applications. International Journal of High Performance Computing Applications 26, 2 (2012), 170--185.
[48]
Guoqiang P. Zhang. 2000. Neural networks for classification: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 30, 4 (2000), 451--462.

Cited By

View all
  • (2022)An Approximate Carry Disregard Multiplier with Improved Mean Relative Error Distance and Probability of Correctness2022 25th Euromicro Conference on Digital System Design (DSD)10.1109/DSD57027.2022.00016(46-52)Online publication date: Aug-2022
  • (2021)Gate-Level Static Approximate Adders: A Comparative AnalysisElectronics10.3390/electronics1023291710:23(2917)Online publication date: 25-Nov-2021
  • (2021)Contrived and Remediated GPU Thread Divergence Using a Flattening TechniqueAdvances in Parallel & Distributed Processing, and Applications10.1007/978-3-030-69984-0_46(647-658)Online publication date: 19-Oct-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 1
April 2015
201 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2744295
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2015
Accepted: 01 January 2015
Revised: 01 December 2014
Received: 01 September 2014
Published in TACO Volume 12, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Branch divergence
  2. approximate computing
  3. hardware acceleration
  4. neural networks

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NSF Graduate Research Fellowship Grant # DGE-0707424
  • NSF Expedition in Computing Award # CCF-0926127
  • C-FAR (one of six centers of STARnet, an SRC program sponsored by MARCO and DARPA)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)103
  • Downloads (Last 6 weeks)15
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)An Approximate Carry Disregard Multiplier with Improved Mean Relative Error Distance and Probability of Correctness2022 25th Euromicro Conference on Digital System Design (DSD)10.1109/DSD57027.2022.00016(46-52)Online publication date: Aug-2022
  • (2021)Gate-Level Static Approximate Adders: A Comparative AnalysisElectronics10.3390/electronics1023291710:23(2917)Online publication date: 25-Nov-2021
  • (2021)Contrived and Remediated GPU Thread Divergence Using a Flattening TechniqueAdvances in Parallel & Distributed Processing, and Applications10.1007/978-3-030-69984-0_46(647-658)Online publication date: 19-Oct-2021
  • (2020)HEAP: A Holistic Error Assessment Framework for Multiple Approximations Using Probabilistic Graphical ModelsElectronics10.3390/electronics90203739:2(373)Online publication date: 22-Feb-2020
  • (2020)A Retrospective and Prospective View of Approximate Computing [Point of View}Proceedings of the IEEE10.1109/JPROC.2020.2975695108:3(394-399)Online publication date: Mar-2020
  • (2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
  • (2019)Derivation of NARX models by expanding activation functions in neural networksIEEJ Transactions on Electrical and Electronic Engineering10.1002/tee.2292014:8(1209-1218)Online publication date: 17-May-2019
  • (2018)Employing classification-based algorithms for general-purpose approximate computingProceedings of the 55th Annual Design Automation Conference10.1145/3195970.3196043(1-6)Online publication date: 24-Jun-2018
  • (2018)Local memory-aware kernel perforationProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168814(278-287)Online publication date: 2018
  • (2018)Local memory-aware kernel perforationProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168814(278-287)Online publication date: 24-Feb-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media