Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3330345.3330384acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Deep reuse: streamline CNN inference on the fly via coarse-grained computation reuse

Published: 26 June 2019 Publication History

Abstract

This paper presents deep reuse, a method for speeding up CNN inferences by detecting and exploiting deep reusable computations on the fly. It empirically reveals the massive similarities among neuron vectors in activation maps, both within CNN inferences on an input and across inputs. It gives an in-depth study on how to effectively turn the similarities into beneficial computation reuse to speed up CNN inferences. The investigation covers various factors, ranging from the clustering methods for similarity detection, to clustering scopes, similarity metrics, and neuron vector granularities. The insights help create deep reuse. As an on-line method, deep reuse is easy to apply, and adapts to each CNN (compressed or not) and its input. Using no special hardware support or CNN model changes, this method speeds up inferences by 1.77--2X (up to 4.3X layer-wise) on the fly with virtually no (<lt0.0005) loss in accuracy.

References

[1]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN Accelerators. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture.
[2]
Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on. 459--468.
[3]
Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and Optimal LSH for Angular Distance. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. MIT Press, Cambridge, MA, USA, 1225--1233.
[4]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[5]
Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. 2017. Towards the Limit of Network Quantization. In 5th International Conference on Learning Representations.
[6]
I. Chung, T. N. Sainath, B. Ramabhadran, M. Picheny, J. Gunnels, V. Austel, U. Chauhari, and B. Kingsbury. 2017. Parallel Deep Neural Network Training for Big Data on Blue Gene/Q. IEEE Transactions on Parallel and Distributed Systems (2017).
[7]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. Locality-sensitive Hashing Scheme Based on P-stable Distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry. ACM, New York, NY, USA, 253--262.
[8]
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, MarcâĂ&Zacute;Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In NIPS.
[9]
L. Du, Y. Du, Y. Li, J. Su, Y. Kuan, C. Liu, and M. F. Chang. 2018. A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things. IEEE Transactions on Circuits and Systems I: Regular Papers (2018).
[10]
Mikhail Figurnov, Aizhan Ibraimova, Dmitry P Vetrov, and Pushmeet Kohli. 2016. PerforatedCNNs: Acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems. 947--955.
[11]
Yuanxiang Gao, Li Chen, and Baochun Li. 2018. Spotlight: Optimizing Device Placement for Training Deep Neural Networks. In Proceedings of the 35th International Conference on Machine Learning.
[12]
Jinrong Guo, Wantao Liu, Wang Wang, Qu Lu, Songlin Hu, Jizhong Han, and Ruixuan Li. {n. d.}. AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Deep Neural Networks. journal=arXiv preprint arXiv:1901:06773, year = 2019 ({n. d.}).
[13]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on.
[14]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[15]
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and&lt; 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016).
[16]
Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. 604--613.
[17]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. Curran Associates Inc., USA, 1097--1105.
[18]
Tung D. Le, Haruki Imai, Yasushi Negishi, and Kiyokuni Kawachiya. {n. d.}. TFLMS: Large Model Support in TensorFlow by Graph Rewriting. journal=arXiv preprint arXiv:1807.02037, year = 2018 ({n. d.}).
[19]
Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL).
[20]
Mu Li. 2014. Scaling Distributed Machine Learning with the Parameter Server. In Proceedings of the 2014 International Conference on Big Data Science and Computing.
[21]
Lin Ning, Hui Guan, and Xipeng Shen. 2019. Adaptive Deep Reuse: Accelerating CNN Training on the Fly. In Proceedings of the 35th IEEE International Conference on Data Engineering.
[22]
M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal. 2013. Memory-centric accelerator design for Convolutional Neural Networks. In 2013 IEEE 31st International Conference on Computer Design (ICCD).
[23]
Randall Pittman, Hui Guan, Xipeng Shen, Seung-Hwan Lim, and Robert M. Patton. 2018. Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis.
[24]
H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[25]
Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Maximizing CNN Accelerator Efficiency Through Resource Partitioning. In Proceedings of the 44th Annual International Symposium on Computer Architecture.
[26]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations.
[27]
Ryan. Spring and Anshumali Shrivastava. 2017. A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models. arXiv preprint arXiv:1703.05160 (2017).
[28]
Ryan Spring and Anshumali Shrivastava. 2017. Scalable and Sustainable Deep Learning via Randomized Hashing. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Halifax, NS, Canada, 445--454.
[29]
Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York, NY, USA.
[30]
Kengo Terasawa and Yuzuru Tanaka. 2007. Spherical lsh for approximate nearest neighbor search on unit hypersphere. In Workshop on Algorithms and Data Structures. 27--38.
[31]
Sudheendra Vijayanarasimhan, Jonathon Shlens, Rajat Monga, and Jay Yagnik. 2014. Deep Networks With Large Output Spaces. arXiv preprint arXiv:1412.7479 (2014).
[32]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. (2018).
[33]
Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. 2017. Beyond Filters: Compact Feature Map for Portable Deep Model. In Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia.
[34]
Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized Convolutional Neural Networks for Mobile Devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA.
[35]
Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture.
[36]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.
[37]
Y. Zhou, S.-M. Moosavi-Dezfooli, N.-M. Cheung, and P. Frossard. 2017. Adaptive Quantization for Deep Neural Network. ArXiv e-prints (dec 2017).

Cited By

View all
  • (2024)Enabling Efficient Deep Learning on MCU With Transient Redundancy EliminationIEEE Transactions on Computers10.1109/TC.2024.344910273:12(2649-2663)Online publication date: Dec-2024
  • (2024)CamPU: A Multi-Camera Processing Unit for Deep Learning-based 3D Spatial Computing Systems2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00014(50-63)Online publication date: 2-Nov-2024
  • (2024)Exploiting beam search confidence for energy-efficient speech recognitionThe Journal of Supercomputing10.1007/s11227-024-06351-y80:17(24908-24937)Online publication date: 15-Jul-2024
  • Show More Cited By
  1. Deep reuse: streamline CNN inference on the fly via coarse-grained computation reuse

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '19: Proceedings of the ACM International Conference on Supercomputing
    June 2019
    533 pages
    ISBN:9781450360791
    DOI:10.1145/3330345
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. deep neural networks
    3. program optimizations

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICS '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)282
    • Downloads (Last 6 weeks)30
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Enabling Efficient Deep Learning on MCU With Transient Redundancy EliminationIEEE Transactions on Computers10.1109/TC.2024.344910273:12(2649-2663)Online publication date: Dec-2024
    • (2024)CamPU: A Multi-Camera Processing Unit for Deep Learning-based 3D Spatial Computing Systems2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00014(50-63)Online publication date: 2-Nov-2024
    • (2024)Exploiting beam search confidence for energy-efficient speech recognitionThe Journal of Supercomputing10.1007/s11227-024-06351-y80:17(24908-24937)Online publication date: 15-Jul-2024
    • (2023)Kairos: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud ResourcesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592997(3-16)Online publication date: 7-Aug-2023
    • (2023)Space-Efficient TREC for Enabling Deep Learning on MicrocontrollersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582062(644-659)Online publication date: 25-Mar-2023
    • (2023)ConvReLU++: Reference-based Lossless Acceleration of Conv-ReLU Operations on Mobile CPUProceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services10.1145/3581791.3596831(503-515)Online publication date: 18-Jun-2023
    • (2023)TGOptProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577490(354-368)Online publication date: 25-Feb-2023
    • (2023)Expanding the Edge: Enabling Efficient Winograd CNN Inference With Deep Reuse on Edge DeviceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326901735:10(10181-10196)Online publication date: 1-Oct-2023
    • (2023)MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071077(556-569)Online publication date: Feb-2023
    • (2023)MERCURY: Accelerating DNN Training By Exploiting Input Similarity2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071051(638-650)Online publication date: Feb-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media