research-article

Public Access

Deep reuse: streamline CNN inference on the fly via coarse-grained computation reuse

Authors:

Xipeng ShenAuthors Info & Claims

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 438 - 448

https://doi.org/10.1145/3330345.3330384

Published: 26 June 2019 Publication History

Abstract

This paper presents deep reuse, a method for speeding up CNN inferences by detecting and exploiting deep reusable computations on the fly. It empirically reveals the massive similarities among neuron vectors in activation maps, both within CNN inferences on an input and across inputs. It gives an in-depth study on how to effectively turn the similarities into beneficial computation reuse to speed up CNN inferences. The investigation covers various factors, ranging from the clustering methods for similarity detection, to clustering scopes, similarity metrics, and neuron vector granularities. The insights help create deep reuse. As an on-line method, deep reuse is easy to apply, and adapts to each CNN (compressed or not) and its input. Using no special hardware support or CNN model changes, this method speeds up inferences by 1.77--2X (up to 4.3X layer-wise) on the fly with virtually no (<lt0.0005) loss in accuracy.

References

[1]

Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN Accelerators. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]

Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on. 459--468.

Digital Library

[3]

Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and Optimal LSH for Angular Distance. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. MIT Press, Cambridge, MA, USA, 1225--1233.

Digital Library

[4]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture.

Digital Library

[5]

Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. 2017. Towards the Limit of Network Quantization. In 5th International Conference on Learning Representations.

[6]

I. Chung, T. N. Sainath, B. Ramabhadran, M. Picheny, J. Gunnels, V. Austel, U. Chauhari, and B. Kingsbury. 2017. Parallel Deep Neural Network Training for Big Data on Blue Gene/Q. IEEE Transactions on Parallel and Distributed Systems (2017).

Digital Library

[7]

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-sensitive Hashing Scheme Based on P-stable Distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry. ACM, New York, NY, USA, 253--262.

Digital Library

[8]

Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, MarcâĂ&Zacute;Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In NIPS.

[9]

L. Du, Y. Du, Y. Li, J. Su, Y. Kuan, C. Liu, and M. F. Chang. 2018. A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things. IEEE Transactions on Circuits and Systems I: Regular Papers (2018).

[10]

Mikhail Figurnov, Aizhan Ibraimova, Dmitry P Vetrov, and Pushmeet Kohli. 2016. PerforatedCNNs: Acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems. 947--955.

Digital Library

[11]

Yuanxiang Gao, Li Chen, and Baochun Li. 2018. Spotlight: Optimizing Device Placement for Training Deep Neural Networks. In Proceedings of the 35th International Conference on Machine Learning.

[12]

Jinrong Guo, Wantao Liu, Wang Wang, Qu Lu, Songlin Hu, Jizhong Han, and Ruixuan Li. {n. d.}. AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Deep Neural Networks. journal=arXiv preprint arXiv:1901:06773, year = 2019 ({n. d.}).

[13]

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on.

Digital Library

[14]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).

Digital Library

[15]

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016).

[16]

Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. 604--613.

Digital Library

[17]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. Curran Associates Inc., USA, 1097--1105.

Digital Library

[18]

Tung D. Le, Haruki Imai, Yasushi Negishi, and Kiyokuni Kawachiya. {n. d.}. TFLMS: Large Model Support in TensorFlow by Graph Rewriting. journal=arXiv preprint arXiv:1807.02037, year = 2018 ({n. d.}).

[19]

Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[20]

Mu Li. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 2014 International Conference on Big Data Science and Computing.

Digital Library

[21]

Lin Ning, Hui Guan, and Xipeng Shen. 2019. Adaptive Deep Reuse: Accelerating CNN Training on the Fly. In Proceedings of the 35th IEEE International Conference on Data Engineering.

[22]

M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal. 2013. Memory-centric accelerator design for Convolutional Neural Networks. In 2013 IEEE 31st International Conference on Computer Design (ICCD).

[23]

Randall Pittman, Hui Guan, Xipeng Shen, Seung-Hwan Lim, and Robert M. Patton. 2018. Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis.

Digital Library

[24]

H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[25]

Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture.

Digital Library

[26]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations.

[27]

Ryan. Spring and Anshumali Shrivastava. 2017. A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models. arXiv preprint arXiv:1703.05160 (2017).

[28]

Ryan Spring and Anshumali Shrivastava. 2017. Scalable and Sustainable Deep Learning via Randomized Hashing. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Halifax, NS, Canada, 445--454.

Digital Library

[29]

Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York, NY, USA.

Digital Library

[30]

Kengo Terasawa and Yuzuru Tanaka. 2007. Spherical lsh for approximate nearest neighbor search on unit hypersphere. In Workshop on Algorithms and Data Structures. 27--38.

Digital Library

[31]

Sudheendra Vijayanarasimhan, Jonathon Shlens, Rajat Monga, and Jay Yagnik. 2014. Deep Networks With Large Output Spaces. arXiv preprint arXiv:1412.7479 (2014).

[32]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. (2018).

[33]

Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. 2017. Beyond Filters: Compact Feature Map for Portable Deep Model. In Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia.

[34]

Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized Convolutional Neural Networks for Mobile Devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA.

[35]

Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture.

Digital Library

[36]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.

Digital Library

[37]

Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, and P. Frossard. 2017. Adaptive Quantization for Deep Neural Network. ArXiv e-prints (dec 2017).

Cited By

Liu JZhang FGuan JSung HGuo XLong SDu XShen X(2024)Enabling Efficient Deep Learning on MCU With Transient Redundancy EliminationIEEE Transactions on Computers10.1109/TC.2024.344910273:12(2649-2663)Online publication date: Dec-2024
https://doi.org/10.1109/TC.2024.3449102
Im DYoo H(2024)CamPU: A Multi-Camera Processing Unit for Deep Learning-based 3D Spatial Computing Systems2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00014(50-63)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00014
Pinto DArnau JRiera MCruz JGonzález A(2024)Exploiting beam search confidence for energy-efficient speech recognitionThe Journal of Supercomputing10.1007/s11227-024-06351-y80:17(24908-24937)Online publication date: 15-Jul-2024
https://doi.org/10.1007/s11227-024-06351-y
Show More Cited By

Deep reuse: streamline CNN inference on the fly via coarse-grained computation reuse
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Deep Reuse for Deep Learning
On decomposing a deep neural network into modules
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Deep learning is being incorporated in many modern software systems. Deep learning approaches train a deep neural network (DNN) model using training examples, and then use the DNN model for prediction. While the structure of a DNN model as layers is ...
A Novel GPU-Based Efficient Approach for Convolutional Neural Networks with Small Filters

In recent years, convolutional neural networks (CNNs) as important parts of deep neural networks (DNNs) have achieved great successes in the field of computer vision. However, Convolution always takes much computation time in the DNNs. In order to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '19: Proceedings of the ACM International Conference on Supercomputing

June 2019

533 pages

ISBN:9781450360791

DOI:10.1145/3330345

General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ICS '19

Sponsor:

SIGARCH

ICS '19: 2019 International Conference on Supercomputing

June 26 - 28, 2019

Arizona, Phoenix

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
1,396
Total Downloads

Downloads (Last 12 months)282
Downloads (Last 6 weeks)30

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu JZhang FGuan JSung HGuo XLong SDu XShen X(2024)Enabling Efficient Deep Learning on MCU With Transient Redundancy EliminationIEEE Transactions on Computers10.1109/TC.2024.344910273:12(2649-2663)Online publication date: Dec-2024
https://doi.org/10.1109/TC.2024.3449102
Im DYoo H(2024)CamPU: A Multi-Camera Processing Unit for Deep Learning-based 3D Spatial Computing Systems2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00014(50-63)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00014
Pinto DArnau JRiera MCruz JGonzález A(2024)Exploiting beam search confidence for energy-efficient speech recognitionThe Journal of Supercomputing10.1007/s11227-024-06351-y80:17(24908-24937)Online publication date: 15-Jul-2024
https://doi.org/10.1007/s11227-024-06351-y
Li BSamsi SGadepally VTiwari DButt AMi NChard K(2023)Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud ResourcesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592997(3-16)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592997
Liu JZhang FGuan JSung HGuo XDu XShen XAamodt TJerger NSwift M(2023)Space-Efficient TREC for Enabling Deep Learning on MicrocontrollersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582062(644-659)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582062
Kong RLi YYuan YKong LHui PAmiri Sani ANurmi PLiu Y(2023)ConvReLU++: Reference-based Lossless Acceleration of Conv-ReLU Operations on Mobile CPUProceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services10.1145/3581791.3596831(503-515)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3581791.3596831
Wang YMendis CDehnavi MKulkarni MKrishnamoorthy S(2023)TGOptProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577490(354-368)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577490
Zhang FWu RGuan JZheng ZGuo XZhang XDu XShen X(2023)Expanding the Edge: Enabling Efficient Winograd CNN Inference With Deep Reuse on Edge DeviceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326901735:10(10181-10196)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TKDE.2023.3269017
Zhou QWang HYu XLi CBai YYan FXu Y(2023)MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071077(556-569)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071077
Janfaza VWeston KRazavi MMandal SMahmud FHilty AMuzahid A(2023)MERCURY: Accelerating DNN Training By Exploiting Input Similarity2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071051(638-650)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071051
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten