Article

Neural Acceleration for General-Purpose Approximate Programs

Authors:

Hadi Esmaeilzadeh,

Adrian Sampson,

Doug BurgerAuthors Info & Claims

MICRO-45: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 449 - 460

https://doi.org/10.1109/MICRO.2012.48

Published: 01 December 2012 Publication History

Abstract

This paper describes a learning-based approach to the acceleration of approximate programs. We describe the \emph{Parrot transformation}, a program transformation that selects and trains a neural network to mimic a region of imperative code. After the learning phase, the compiler replaces the original code with an invocation of a low-power accelerator called a \emph{neural processing unit} (NPU). The NPU is tightly coupled to the processor pipeline to accelerate small code regions. Since neural networks produce inherently approximate results, we define a programming model that allows programmers to identify approximable code regions -- code that can produce imprecise but acceptable results. Offloading approximable code regions to NPUs is faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides whole-application speedup of 2.3x and energy savings of 3.0x on average with quality loss of at most 9.6%.

References

[1]

C. Alvarez, J. Corbal, and M. Valero, "Fuzzy memoization for floating-point multimedia applications," IEEE Trans. Comput., vol. 54, no. 7, 2005.

Digital Library

[2]

W. Baek and T. M. Chilimbi, "Green: A framework for supporting energy-conscious programming using controlled approximation," in PLDI, 2010.

Digital Library

[3]

B. E. Boser, E. Säckinger, J. Bromley, Y. Lecun, L. D. Jackel, and S. Member, "An analog neural network processor with programmable topology," J. Solid-State Circuits, vol. 26, pp. 2017-2025, 1991.

[4]

L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V. Palem, and B. Seshasayee, "Ultra-efficient (embedded) SOC architectures based on probabilistic CMOS (PCMOS) technology," in DATE, 2006.

Digital Library

[5]

T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. Temam, "Benchnn: On the broad potential application scope of hardware neural network accelerators?" in IISWC, Nov. 2012.

Digital Library

[6]

N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner, "Application-specific processing on a general-purpose core via transparent instruction set customization," in MICRO, 2004.

Digital Library

[7]

M. de Kruijf and K. Sankaralingam, "Exploring the synergy of emerging workloads and silicon reliability trends," in SELSE, 2009.

[8]

M. de Kruijf, S. Nomura, and K. Sankaralingam, "Relax: An architectural framework for software recovery of hardware faults," in ISCA, 2010.

Digital Library

[9]

H. Esmaeilzadeh, P. Saeedi, B. Araabi, C. Lucas, and S. Fakhraie, "Neural network stream processing core (NnSP) for embedded systems," in ISCAS, 2006.

[10]

H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, "Dark silicon and the end of multicore scaling," in ISCA, 2011.

Digital Library

[11]

H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Architecture support for disciplined approximate programming," in ASPLOS, 2012.

Digital Library

[12]

H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Towards neural acceleration for general-purpose approximate computing," in WEED, Jun. 2012.

[13]

K. Fan, M. Kudlur, G. Dasika, and S. Mahlke, "Bridging the computation gap between programmable processors and hardwired accelerators," in HPCA, 2009.

[14]

Y. Fang, H. Li, and X. Li, "A fault criticality evaluation framework of digital systems for error tolerant video applications," in ATS, 2011.

Digital Library

[15]

FANN, "Fast artificial neural network library," 2012. Available: http://leenissen.dk/fann/wp/

[16]

A. Frank and A. Asuncion, "UCI machine learning repository," 2010. Available: http://archive.ics.uci.edu/ml

[17]

S. Galal and M. Horowitz, "Energy-efficient floating-point unit design," IEEE Trans. Comput., vol. 60, no. 7, pp. 913-922, 2011.

Digital Library

[18]

V. Govindaraju, C.-H. Ho, and K. Sankaralingam, "Dynamically specialized datapaths for energy efficient computing," in HPCA, 2011.

Digital Library

[19]

S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, "Bundled execution of recurring traces for energy-efficient general purpose processing," in MICRO, 2011.

Digital Library

[20]

A. Guzhva, S. Dolenko, and I. Persiantsev, "Multifold acceleration of neural network computations using GPU," in ICANN, 2009.

Digital Library

[21]

R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, "Understanding sources of inefficiency in general-purpose chips," in ISCA, 2010.

Digital Library

[22]

A. Hashmi, H. Berry, O. Temam, and M. H. Lipasti, "Automatic abstraction and fault tolerance in cortical microarchitectures," in ISCA, 2011.

Digital Library

[23]

A. Hashmi, A. Nere, J. J. Thomas, and M. Lipasti, "A case for neuromorphic ISAs," in ASPLOS, 2011.

Digital Library

[24]

R. Hegde and N. R. Shanbhag, "Energy-efficient signal processing via algorithmic noise-tolerance," in ISLPED, 1999.

Digital Library

[25]

A. Joubert, B. Belhadj, O. Temam, and R. Heliot, "Hardware spiking neurons design: Analog or digital?" in IJCNN, 2012.

[26]

L. Leem, H. Cho, J. Bau, Q. A. Jacobson, and S. Mitra, "ERSA: Error resilient system architecture for probabilistic applications," in DATE, 2010.

Digital Library

[27]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO, 2009.

Digital Library

[28]

X. Li and D. Yeung, "Exploiting soft computing for increased fault tolerance," in ASGI, 2006.

[29]

S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, "Flikker: Saving refresh-power in mobile devices through critical data partitioning," in ASPLOS, 2011.

Digital Library

[30]

S. Misailovic, S. Sidiroglou, H. Hoffman, and M. Rinard, "Quality of service profiling," in ICSE, 2010.

Digital Library

[31]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi, "Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0," in MICRO, 2007.

Digital Library

[32]

S. Narayanan, J. Sartori, R. Kumar, and D. L. Jones, "Scalable stochastic processors," in DATE, 2010.

Digital Library

[33]

NetBSD Documentation, "How lazy FPU context switch works," 2011. Available: http://www.netbsd.org/docs/kernel/lazyfpu.html

[34]

K.-S. Oh and K. Jung, "GPU implementation of neural networks," Pattern Recognition, vol. 37, no. 6, pp. 1311-1314, 2004.

[35]

A. Patel, F. Afram, S. Chen, and K. Ghose, "MARSSx86: A full system simulator for x86 CPUs," in DAC, 2011.

Digital Library

[36]

A. Pedram, R. A. van de Geijn, and A. Gerstlauer, "Codesign tradeoffs for high-performance, low-power linear algebra architectures," Computers, IEEE Transactions on, vol. 61, no. 12, Dec. 2012.

Digital Library

[37]

K. Przytula and V. P. Kumar, Eds., Parallel Digital Implementations of Neural Networks. Prentice Hall, 1993.

Digital Library

[38]

A. R. Putnam, D. Bennett, E. Dellinger, J. Mason, and P. Sundararajan, "CHiMPS: A high-level compilation flow for hybrid CPU-FPGA architectures," in FPGA, 2008.

Digital Library

[39]

R. Razdan and M. D. Smith, "A high-performance microarchitecture with hardware-programmable functional units," in MICRO, 1994.

Digital Library

[40]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning internal representations by error propagation," in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, 1986, vol. 1, pp. 318-362.

Digital Library

[41]

A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, "EnerJ: Approximate data types for safe and general low-power computation," in PLDI, 2011.

Digital Library

[42]

J. Schemmel, J. Fieres, and K. Meier, "Wafer-scale integration of analog neural networks," in IJCNN, 2008.

[43]

S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard, "Managing performance vs. accuracy trade-offs with loop perforation," in FSE, 2011.

Digital Library

[44]

S. Tam, B. Gupta, H. Castro, and M. Holler, "Learning on an analog VLSI neural network chip," in SMC, 1990.

[45]

O. Temam, "A defect-tolerant accelerator for emerging high-performance applications," in ISCA, 2012.

Digital Library

[46]

N. Townsend and L. Tarassenko, "Estimations of error bounds for neural-network function approximators," IEEE Transactions on Neural Networks, vol. 10, no. 2, Mar. 1999.

Digital Library

[47]

G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor, "Conservation cores: Reducing the energy of mature computations," in ASPLOS, 2010.

Digital Library

[48]

G. Venkatesh, J. Sampson, N. Goulding, S. K. Venkata, S. Swanson, and M. Taylor, "QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores," in MICRO, 2011.

Digital Library

[49]

V. Wong and M. Horowitz, "Soft error resilience of probabilistic inference applications," in SELSE, 2006.

[50]

J. Zhu and P. Sutton, "FPGA implementations of neural networks: A survey of a decade of progress," in FPL, 2003.

Cited By

Fischer MRitschel T(2024)ZeroGrads: Learning Local Surrogates for Non-Differentiable GraphicsACM Transactions on Graphics10.1145/365817343:4(1-15)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658173
Ghodrati SKinzer SXu HMahapatra RKim YAhn BWang DKarthikeyan LYazdanbakhsh APark JKim NEsmaeilzadeh HTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640365
Hsu KTseng H(2024)Simultaneous and Heterogenous Multithreading: Exploiting Simultaneous and Heterogeneous Parallelism in Accelerator-Rich ArchitecturesIEEE Micro10.1109/MM.2024.341494144:4(11-19)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1109/MM.2024.3414941
Show More Cited By

Index Terms

Neural Acceleration for General-Purpose Approximate Programs

Recommendations

Neural Acceleration for General-Purpose Approximate Programs

This work proposes an approximate algorithmic transformation and a new class of accelerators, called neural processing units (NPUs). NPUs leverage the approximate algorithmic transformation that converts regions of code from a Von Neumann model to a ...
Neural acceleration for GPU throughput processors
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Graphics Processing Units (GPUs) can accelerate diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application ...
Neural acceleration for general-purpose approximate programs

As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are needed to continue improvements in the performance and energy efficiency of general-purpose processors. One such departure is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-45: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

December 2012

487 pages

ISBN:9780769549248

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 December 2012

Check for updates

Author Tags

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

250
Total Citations
View Citations
1,757
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fischer MRitschel T(2024)ZeroGrads: Learning Local Surrogates for Non-Differentiable GraphicsACM Transactions on Graphics10.1145/365817343:4(1-15)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658173
Ghodrati SKinzer SXu HMahapatra RKim YAhn BWang DKarthikeyan LYazdanbakhsh APark JKim NEsmaeilzadeh HTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640365
Hsu KTseng H(2024)Simultaneous and Heterogenous Multithreading: Exploiting Simultaneous and Heterogeneous Parallelism in Accelerator-Rich ArchitecturesIEEE Micro10.1109/MM.2024.341494144:4(11-19)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1109/MM.2024.3414941
Liu YTziantzioulis GWentzlaff D(2023)Building Efficient Neural PrefetcherProceedings of the International Symposium on Memory Systems10.1145/3631882.3631903(1-12)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631903
Dong WRen JCostan ANicolae BSato K(2023)AutoConstruct: Automated Neural Surrogate Model Building and Deployment for HPC ApplicationsProceedings of the 13th Workshop on AI and Scientific Computing at Scale using Flexible Computing10.1145/3589013.3596677(33-40)Online publication date: 10-Aug-2023
https://dl.acm.org/doi/10.1145/3589013.3596677
Dong WKestor GLi DButt AMi NChard K(2023)Auto-HPCnet: An Automatic Framework to Build Neural Network-based Surrogate for High-Performance Computing ApplicationsProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592985(31-44)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592985
Fink ZParasyris KGeorgakoudis GMenon HMohror KArnold DBadia R(2023)HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPUProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607095(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607095
Zhang ZGlova ASherwood TBalkind JAamodt TJerger NSwift M(2023)A Prediction System ServiceProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575714(48-60)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575714
Damsgaard HOmetov ANurmi J(2023)Approximation Opportunities in Edge Computing Hardware: A Systematic Literature ReviewACM Computing Surveys10.1145/357277255:12(1-49)Online publication date: 3-Mar-2023
https://dl.acm.org/doi/10.1145/3572772
Liu HWang YFan WLiu XLi YJain SLiu YJain ATang J(2022)Trustworthy AI: A Computational PerspectiveACM Transactions on Intelligent Systems and Technology10.1145/354687214:1(1-59)Online publication date: 9-Nov-2022
https://dl.acm.org/doi/10.1145/3546872
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents