Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Data and Computation Reuse in CNNs Using Memristor TCAMs

Published: 22 December 2022 Publication History

Abstract

Exploiting computational and data reuse in CNNs is crucial for the successful design of resource-constrained platforms. In image recognition applications, high levels of input locality and redundancy present in CNNs have become the golden goose for skipping costly arithmetic operations. One promising technique for this consists in storing function responses of some input patterns into offline lookup tables and replacing online computation with search operations, which are highly efficient when implemented by emerging non-volatile memory technologies. In this work, we rethink both algorithm and architecture for exploiting locality and reuse opportunities by replacing entire convolutions with searches on Content-addressable Memories. By previously calculating convolution results and building compact lookup tables with our novel clustering algorithm, one can evaluate activations at constant time complexity, also requiring a single read operation of the current input tensor. Then, we devise a reconfigurable array of processing elements based on memristive Ternary Content-addressable Memories to efficiently implement the algorithmic solution and meet the flexibility requirements of several CNN architectures. Results show that our design reduces the number of multiplications and memory accesses proportionally to the number of convolutional layer channels. The average performance is 1,172 and 82 FPS for AlexNet and VGG-16 models, thus outperforming state-of-the-art works by 13×.

References

[1]
Doo Seok Jeong, Kyung Min Kim, Sungho Kim, Byung Joon Choi, and Cheol Seong Hwang. 2016. Memristors for energy-efficient new computing paradigms. Adv. Electron. Mater. 2, 9 (2016), 1600090.
[2]
M. G. Sarwar Murshed, Christopher Murphy, Daqing Hou, Nazar Khan, Ganesh Ananthanarayanan, and Faraz Hussain. 2021. Machine learning at the network edge: A survey. ACM Comput. Surveys 54, 8 (2021), 1–37.
[3]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2020. Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13693–13696.
[4]
Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2019. A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfig. Technol. Syst. 12, 1, Article 2 (Mar.2019), 26 pages. DOI:
[5]
Shihui Yin, Zhewei Jiang, Jae-Sun Seo, and Mingoo Seok. 2020. XNOR-SRAM: In-memory computing SRAM macro for binary/ternary deep neural networks. IEEE J. Solid-State Circ. (2020).
[6]
Chuanhao Zhuge, Xinheng Liu, Xiaofan Zhang, Sudeep Gummadi, Jinjun Xiong, and Deming Chen. 2018. Face recognition with hybrid efficient convolution algorithms on FPGAs. In Proceedings of the on Great Lakes Symposium on VLSI. 123–128.
[7]
Zhiqiang Liu, Paul Chow, Jinwei Xu, Jingfei Jiang, Yong Dou, and Jie Zhou. 2019. A uniform architecture design for accelerating 2D and 3D CNNS on FPGAs. Electronics 8, 1 (2019), 65.
[8]
Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2018. Optimizing the convolution operation to accelerate deep neural networks on FPGA. IEEE Trans. Very Large Scale Integr. Syst. 26, 7 (2018), 1354–1367.
[9]
Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2017. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Computer-aided Design Integr. Circ. Syst. 37, 1 (2017), 35–47.
[10]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.
[11]
Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (2016), 105–112.
[12]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.
[13]
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Architect. News 44, 3 (2016), 14–26.
[14]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. ACM SIGARCH Comput. Architect. News 44, 3 (2016), 27–39.
[15]
Xilinx Zynq UltraScale. 2020. MPSoC ZCU102 evaluation kit. Retrieved from https://www.xilinx.com/products/boards-andkits/ek-u1-zcu102-g.html.
[16]
Quan Deng, Youtao Zhang, Minxuan Zhang, and Jun Yang. 2019. LAcc: Exploiting lookup table-based fast and accurate vector multiplication in DRAM-based CNN accelerator. In Proceedings of the 56th Annual Design Automation Conference. 1–6.
[17]
Luca Mocerino, Valerio Tenace, and Andrea Calimera. 2019. Energy-efficient convolutional neural networks via recurrent data reuse. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’19). IEEE, 848–853.
[18]
Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. 2017. LCNN: Lookup-based convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7120–7129.
[19]
Larissa Rozales Gonçalves, Rafael Fão De Moura, and Luigi Carro. 2019. Aggressive energy reduction for video inference with software-only strategies. ACM Trans. Embed. Comput. Syst. 18, 5s (2019), 1–20.
[20]
Rafael Fão de Moura, Paulo C. Santos, João Paulo C. de Lima, Marco A. Z. Alves, Antonio C. S. Beck, and Luigi Carro. 2019. Skipping CNN convolutions through efficient memoization. In Proceedings of the International Conference on Embedded Computer Systems. Springer, 65–76.
[21]
João Paulo Cardoso de Lima, Marcelo Brandalero, and Luigi Carro. 2020. Endurance-aware RRAM-based reconfigurable architecture using TCAM arrays. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 40–46.
[22]
Xun Jiao, Vahideh Akhlaghi, Yu Jiang, and Rajesh K. Gupta. 2018. Energy-efficient neural networks using approximate computation reuse. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’18). IEEE, 1223–1228.
[23]
Quan Deng, Lei Jiang, Youtao Zhang, Minxuan Zhang, and Jun Yang. 2018. DrAcc: A DRAM-based accelerator for accurate CNN inference. In Proceedings of the 55th Annual Design Automation Conference. 1–6.
[24]
Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaaauw, and Reetuparna Das. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 383–396.
[25]
Yu Pan, Peng Ouyang, Yinglin Zhao, Wang Kang, Shouyi Yin, Youguang Zhang, Weisheng Zhao, and Shaojun Wei. 2018. A multilevel cell STT-MRAM-based computing in-memory accelerator for binary convolutional neural network. IEEE Trans. Magn. 54, 11 (2018), 1–5.
[26]
Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, and Fengbo Ren. 2018. A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst. 14, 2 (2018), 1–16.
[27]
Avishek Biswas and Anantha P. Chandrakasan. 2018. Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’18). IEEE, 488–490.
[28]
Xiaoyu Sun, Shihui Yin, Xiaochen Peng, Rui Liu, Jae-sun Seo, and Shimeng Yu. 2018. XNOR-RRAM: A scalable and parallel resistive synaptic architecture for binary neural networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’18). IEEE, 1423–1428.
[29]
Yu Ji, Youyang Zhang, Xinfeng Xie, Shuangchen Li, Peiqi Wang, Xing Hu, Youhui Zhang, and Yuan Xie. 2019. FPSA: A full system stack solution for reconfigurable ReRAM-based NN accelerator architecture. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 733–747.
[30]
Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough. 2018. Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. Retrieved from httsp://arXiv:1803.11232.
[31]
Mohammad Samragh Razlighi, Mohsen Imani, Farinaz Koushanfar, and Tajana Rosing. 2017. Looknn: Neural network with no multiplication. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’17). IEEE, 1775–1780.
[32]
Yaman Umuroglu, Yash Akhauri, Nicholas James Fraser, and Michaela Blott. 2020. LogicNets: Co-designed neural networks and circuits for extreme-throughput applications. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 291–297.
[33]
Zhenhong Liu, Amir Yazdanbakhsh, Dong Kai Wang, Hadi Esmaeilzadeh, and Nam Sung Kim. 2019. AxMemo: Hardware-compiler co-design for approximate code memoization. In Proceedings of the 46th International Symposium on Computer Architecture. 685–697.
[34]
Yasmin Halawani, Baker Mohammad, Muath Abu-Lebdeh, Mahmoud Al-Qutayri, and Said F. Al-Sarawi. 2019. ReRAM-based in-memory computing for search engine and neural network applications. IEEE J. Emerg. Select. Top. Circ. Syst. (2019).
[35]
Xifan Tang, Edouard Giacomin, Giovanni De Micheli, and Pierre-Emmanuel Gaillardon. 2018. Post-P&R performance and power analysis for RRAM-based FPGAs. IEEE J. Emerg. Select. Top. Circ. Syst. 8, 3 (2018), 639–650.
[36]
Alessandro Grossi, Elisa Vianello, Mohamed M. Sabry, Marios Barlas, Laurent Grenouillet, Jean Coignus, Edith Beigne, Tony Wu, Binh Q. Le, Mary K. Wootters, et al. 2019. Resistive RAM endurance: Array-level characterization and correction techniques targeting deep learning applications. IEEE Trans. Electron Devices 66, 3 (2019), 1281–1288.
[37]
Catherine E. Graves, Can Li, Xia Sheng, Wen Ma, Sai Rahul Chalamalasetti, Darrin Miller, James S. Ignowski, Brent Buchanan, Le Zheng, Si-Ty Lam, et al. 2019. Memristor TCAMs accelerate regular expression matching for network intrusion detection. IEEE Trans. Nanotechnol. 18 (2019), 963–970.
[38]
Qing Guo, Xiaochen Guo, Yuxin Bai, Ravi Patel, Engin Ipek, and Eby G. Friedman. 2015. Resistive ternary content addressable memory systems for data-intensive computing. IEEE Micro 35, 5 (2015), 62–71.
[39]
Mohsen Imani, Abbas Rahimi, and Tajana S. Rosing. 2016. Resistive configurable associative memory for approximate computing. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’16). IEEE, 1327–1332.
[40]
Alessandro Grossi, Elisa Vianello, Cristian Zambelli, Pablo Royer, Jean-Philippe Noel, Bastien Giraud, Luca Perniola, Piero Olivo, and Etienne Nowak. 2018. Experimental investigation of 4-kb RRAM arrays programming conditions suitable for TCAM. IEEE Trans. Very Large Scale Integr. Syst. 26, 12 (2018), 2599–2607.
[41]
Somnath Paul and Swarup Bhunia. 2008. Reconfigurable computing using content addressable memory for improved performance and resource usage. In Proceedings of the 45th ACM/IEEE Design Automation Conference. IEEE, 786–791.
[42]
Roman Kaplan, Leonid Yavits, and Ran Ginosar. 2018. PRINS: Processing-in-storage acceleration of machine learning. IEEE Trans. Nanotechnol. 17, 5 (2018), 889–896.
[43]
Yue Zha and Jing Li. 2020. Hyper-AP: Enhancing associative processing through a full-stack optimization. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 846–859.
[44]
Roman Kaplan, Leonid Yavits, Ran Ginosar, and Uri Weiser. 2017. A resistive cam processing-in-storage architecture for dna sequence alignment. IEEE Micro 37, 4 (2017), 20–28.
[45]
Robert Karam, Ruchir Puri, Swaroop Ghosh, and Swarup Bhunia. 2015. Emerging trends in design and applications of memory-based computing and content-addressable memories. Proc. IEEE 103, 8 (2015), 1311–1330.
[46]
Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W. Fletcher. 2018. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 674–687.
[47]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Lab. 27 (2009), 28.
[48]
Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. 2011. The risc-v instruction set manual, volume i: Base user-level isa. EECS Department, UC Berkeley, Tech. Rep. UCB/EECS-2011-62, 116 (2011).
[49]
Elias Ahmed and Jonathan Rose. 2004. The effect of LUT and cluster size on deep-submicron FPGA performance and density. IEEE Trans. Very Large Scale Integr. Syst. 12, 3 (2004), 288–298.
[50]
Xuegong Zhou, Lingli Wang, and Alan Mishchenko. 2020. Fast exact NPN classification by co-designing canonical form and its computation algorithm. IEEE Trans. Comput. (2020).
[51]
João Paulo Cardoso de Lima, Marcelo Brandalero, Michael Hübner, and Luigi Carro. 2021. STAP: An architecture and design tool for automata processing on memristor TCAMs. ACM J. Emerg. Technol. Comput. Syst. 18, 2 (2021), 1–22.
[52]
Jason Cong and Bingjun Xiao. 2013. FPGA-RPI: A novel FPGA architecture with RRAM-based programmable interconnects. IEEE Trans. Very Large Scale Integr. Syst. 22, 4 (2013), 864–877.
[53]
Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. 2014. Compressing deep convolutional networks using vector quantization. Retrieved from https://arXiv:1412.6115.
[54]
Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Retrieved from https://arXiv:1510.00149.
[55]
Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. Retrieved from https://arXiv:1804.02767.
[56]
Kevin E. Murray, Oleg Petelin, Sheng Zhong, Jai Min Wang, Mohamed ElDafrawy, Jean-Philippe Legault, Eugene Sha, Aaron G. Graham, Jean Wu, Matthew J. P. Walker, Hanqing Zeng, Panagiotis Patros, Jason Luu, Kenneth B. Kent, and Vaughn Betz. 2020. VTR 8: High performance CAD and customizable FPGA architecture modelling. ACM Trans. Reconfig. Technol. Syst. (2020).
[57]
Alan Mishchenko, Sungmin Cho, Satrajit Chatterjee, and Robert Brayton. 2007. Combinational and sequential mapping with priority cuts. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. IEEE, 354–361.
[58]
Alan Mishchenko et al. 2007. ABC: A system for sequential synthesis and verification. Retrieved from http://www.eecs.berkeley.edu/alanmi/abc.
[59]
Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. 2012. Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 31, 7 (2012), 994–1007.
[60]
Krste Asanovic, Rimas Avizienis, Jonathan Bachrach, Scott Beamer, David Biancolin, Christopher Celio, Henry Cook, Daniel Dabbelt, John Hauser, Adam Izraelevitz, et al. 2016. The rocket chip generator. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17.
[61]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Architect. News 39, 2 (2011), 1–7.
[62]
Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to \(+1\) or \(-1\). Retrieved from https://arXiv:1602.02830.
[63]
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. 1–6.
[64]
Yanan Sun, Chang Ma, Zhi Li, Yilong Zhao, Jiachen Jiang, Weikang Qian, Rui Yang, Zhezhi He, and Li Jiang. 2021. Unary coding and variation-aware optimal mapping scheme for reliable ReRAM-based neuromorphic computing. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 40, 12 (2021), 2495–2507.
[65]
Huihan Li, Shaocong Wang, Xumeng Zhang, Wei Wang, Rui Yang, Zhong Sun, Wanxiang Feng, Peng Lin, Zhongrui Wang, Linfeng Sun, et al. 2021. Memristive crossbar arrays for storage and computing applications. Adv. Intell. Syst. 3, 9 (2021), 2100017.

Cited By

View all
  • (2022) TCAmM CogniGron : Energy Efficient Memristor-Based TCAM for Match-Action Processing 2022 IEEE International Conference on Rebooting Computing (ICRC)10.1109/ICRC57508.2022.00013(89-99)Online publication date: Dec-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 16, Issue 1
March 2023
403 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/35733111
  • Editor:
  • Deming Chen
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2022
Online AM: 20 July 2022
Accepted: 19 June 2022
Revised: 01 May 2022
Received: 07 February 2022
Published in TRETS Volume 16, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Neural network accelerators
  2. Ternary Content-addressable Memories (TCAM)
  3. in-memory processing
  4. reconfigurable computing

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Coordenação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES)-Finance Code 001
  • National Council for Scientific and Technological Development (CNPq)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)143
  • Downloads (Last 6 weeks)10
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022) TCAmM CogniGron : Energy Efficient Memristor-Based TCAM for Match-Action Processing 2022 IEEE International Conference on Rebooting Computing (ICRC)10.1109/ICRC57508.2022.00013(89-99)Online publication date: Dec-2022

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media