research-article

Data and Computation Reuse in CNNs Using Memristor TCAMs

Authors:

Rafael Fão de Moura,

Joao Paulo Cardoso de Lima,

Luigi CarroAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems, Volume 16, Issue 1

Article No.: 14, Pages 1 - 24

https://doi.org/10.1145/3549536

Published: 22 December 2022 Publication History

Abstract

Exploiting computational and data reuse in CNNs is crucial for the successful design of resource-constrained platforms. In image recognition applications, high levels of input locality and redundancy present in CNNs have become the golden goose for skipping costly arithmetic operations. One promising technique for this consists in storing function responses of some input patterns into offline lookup tables and replacing online computation with search operations, which are highly efficient when implemented by emerging non-volatile memory technologies. In this work, we rethink both algorithm and architecture for exploiting locality and reuse opportunities by replacing entire convolutions with searches on Content-addressable Memories. By previously calculating convolution results and building compact lookup tables with our novel clustering algorithm, one can evaluate activations at constant time complexity, also requiring a single read operation of the current input tensor. Then, we devise a reconfigurable array of processing elements based on memristive Ternary Content-addressable Memories to efficiently implement the algorithmic solution and meet the flexibility requirements of several CNN architectures. Results show that our design reduces the number of multiplications and memory accesses proportionally to the number of convolutional layer channels. The average performance is 1,172 and 82 FPS for AlexNet and VGG-16 models, thus outperforming state-of-the-art works by 13×.

References

[1]

Doo Seok Jeong, Kyung Min Kim, Sungho Kim, Byung Joon Choi, and Cheol Seong Hwang. 2016. Memristors for energy-efficient new computing paradigms. Adv. Electron. Mater. 2, 9 (2016), 1600090.

[2]

M. G. Sarwar Murshed, Christopher Murphy, Daqing Hou, Nazar Khan, Ganesh Ananthanarayanan, and Faraz Hussain. 2021. Machine learning at the network edge: A survey. ACM Comput. Surveys 54, 8 (2021), 1–37.

Digital Library

[3]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2020. Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13693–13696.

[4]

Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2019. A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfig. Technol. Syst. 12, 1, Article 2 (Mar.2019), 26 pages. DOI:

[5]

Shihui Yin, Zhewei Jiang, Jae-Sun Seo, and Mingoo Seok. 2020. XNOR-SRAM: In-memory computing SRAM macro for binary/ternary deep neural networks. IEEE J. Solid-State Circ. (2020).

[6]

Chuanhao Zhuge, Xinheng Liu, Xiaofan Zhang, Sudeep Gummadi, Jinjun Xiong, and Deming Chen. 2018. Face recognition with hybrid efficient convolution algorithms on FPGAs. In Proceedings of the on Great Lakes Symposium on VLSI. 123–128.

Digital Library

[7]

Zhiqiang Liu, Paul Chow, Jinwei Xu, Jingfei Jiang, Yong Dou, and Jie Zhou. 2019. A uniform architecture design for accelerating 2D and 3D CNNS on FPGAs. Electronics 8, 1 (2019), 65.

[8]

Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2018. Optimizing the convolution operation to accelerate deep neural networks on FPGA. IEEE Trans. Very Large Scale Integr. Syst. 26, 7 (2018), 1354–1367.

Digital Library

[9]

Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2017. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Computer-aided Design Integr. Circ. Syst. 37, 1 (2017), 35–47.

[10]

Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.

Digital Library

[11]

Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (2016), 105–112.

Digital Library

[12]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.

Digital Library

[13]

Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Architect. News 44, 3 (2016), 14–26.

Digital Library

[14]

Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. ACM SIGARCH Comput. Architect. News 44, 3 (2016), 27–39.

Digital Library

[15]

Xilinx Zynq UltraScale. 2020. MPSoC ZCU102 evaluation kit. Retrieved from https://www.xilinx.com/products/boards-andkits/ek-u1-zcu102-g.html.

[16]

Quan Deng, Youtao Zhang, Minxuan Zhang, and Jun Yang. 2019. LAcc: Exploiting lookup table-based fast and accurate vector multiplication in DRAM-based CNN accelerator. In Proceedings of the 56th Annual Design Automation Conference. 1–6.

Digital Library

[17]

Luca Mocerino, Valerio Tenace, and Andrea Calimera. 2019. Energy-efficient convolutional neural networks via recurrent data reuse. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’19). IEEE, 848–853.

[18]

Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. 2017. LCNN: Lookup-based convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7120–7129.

[19]

Larissa Rozales Gonçalves, Rafael Fão De Moura, and Luigi Carro. 2019. Aggressive energy reduction for video inference with software-only strategies. ACM Trans. Embed. Comput. Syst. 18, 5s (2019), 1–20.

Digital Library

[20]

Rafael Fão de Moura, Paulo C. Santos, João Paulo C. de Lima, Marco A. Z. Alves, Antonio C. S. Beck, and Luigi Carro. 2019. Skipping CNN convolutions through efficient memoization. In Proceedings of the International Conference on Embedded Computer Systems. Springer, 65–76.

Digital Library

[21]

João Paulo Cardoso de Lima, Marcelo Brandalero, and Luigi Carro. 2020. Endurance-aware RRAM-based reconfigurable architecture using TCAM arrays. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 40–46.

[22]

Xun Jiao, Vahideh Akhlaghi, Yu Jiang, and Rajesh K. Gupta. 2018. Energy-efficient neural networks using approximate computation reuse. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’18). IEEE, 1223–1228.

[23]

Quan Deng, Lei Jiang, Youtao Zhang, Minxuan Zhang, and Jun Yang. 2018. DrAcc: A DRAM-based accelerator for accurate CNN inference. In Proceedings of the 55th Annual Design Automation Conference. 1–6.

Digital Library

[24]

Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaaauw, and Reetuparna Das. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 383–396.

Digital Library

[25]

Yu Pan, Peng Ouyang, Yinglin Zhao, Wang Kang, Shouyi Yin, Youguang Zhang, Weisheng Zhao, and Shaojun Wei. 2018. A multilevel cell STT-MRAM-based computing in-memory accelerator for binary convolutional neural network. IEEE Trans. Magn. 54, 11 (2018), 1–5.

[26]

Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, and Fengbo Ren. 2018. A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst. 14, 2 (2018), 1–16.

Digital Library

[27]

Avishek Biswas and Anantha P. Chandrakasan. 2018. Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’18). IEEE, 488–490.

[28]

Xiaoyu Sun, Shihui Yin, Xiaochen Peng, Rui Liu, Jae-sun Seo, and Shimeng Yu. 2018. XNOR-RRAM: A scalable and parallel resistive synaptic architecture for binary neural networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’18). IEEE, 1423–1428.

[29]

Yu Ji, Youyang Zhang, Xinfeng Xie, Shuangchen Li, Peiqi Wang, Xing Hu, Youhui Zhang, and Yuan Xie. 2019. FPSA: A full system stack solution for reconfigurable ReRAM-based NN accelerator architecture. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 733–747.

Digital Library

[30]

Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough. 2018. Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. Retrieved from httsp://arXiv:1803.11232.

[31]

Mohammad Samragh Razlighi, Mohsen Imani, Farinaz Koushanfar, and Tajana Rosing. 2017. Looknn: Neural network with no multiplication. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’17). IEEE, 1775–1780.

[32]

Yaman Umuroglu, Yash Akhauri, Nicholas James Fraser, and Michaela Blott. 2020. LogicNets: Co-designed neural networks and circuits for extreme-throughput applications. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 291–297.

[33]

Zhenhong Liu, Amir Yazdanbakhsh, Dong Kai Wang, Hadi Esmaeilzadeh, and Nam Sung Kim. 2019. AxMemo: Hardware-compiler co-design for approximate code memoization. In Proceedings of the 46th International Symposium on Computer Architecture. 685–697.

Digital Library

[34]

Yasmin Halawani, Baker Mohammad, Muath Abu-Lebdeh, Mahmoud Al-Qutayri, and Said F. Al-Sarawi. 2019. ReRAM-based in-memory computing for search engine and neural network applications. IEEE J. Emerg. Select. Top. Circ. Syst. (2019).

[35]

Xifan Tang, Edouard Giacomin, Giovanni De Micheli, and Pierre-Emmanuel Gaillardon. 2018. Post-P&R performance and power analysis for RRAM-based FPGAs. IEEE J. Emerg. Select. Top. Circ. Syst. 8, 3 (2018), 639–650.

[36]

Alessandro Grossi, Elisa Vianello, Mohamed M. Sabry, Marios Barlas, Laurent Grenouillet, Jean Coignus, Edith Beigne, Tony Wu, Binh Q. Le, Mary K. Wootters, et al. 2019. Resistive RAM endurance: Array-level characterization and correction techniques targeting deep learning applications. IEEE Trans. Electron Devices 66, 3 (2019), 1281–1288.

[37]

Catherine E. Graves, Can Li, Xia Sheng, Wen Ma, Sai Rahul Chalamalasetti, Darrin Miller, James S. Ignowski, Brent Buchanan, Le Zheng, Si-Ty Lam, et al. 2019. Memristor TCAMs accelerate regular expression matching for network intrusion detection. IEEE Trans. Nanotechnol. 18 (2019), 963–970.

[38]

Qing Guo, Xiaochen Guo, Yuxin Bai, Ravi Patel, Engin Ipek, and Eby G. Friedman. 2015. Resistive ternary content addressable memory systems for data-intensive computing. IEEE Micro 35, 5 (2015), 62–71.

Digital Library

[39]

Mohsen Imani, Abbas Rahimi, and Tajana S. Rosing. 2016. Resistive configurable associative memory for approximate computing. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’16). IEEE, 1327–1332.

[40]

Alessandro Grossi, Elisa Vianello, Cristian Zambelli, Pablo Royer, Jean-Philippe Noel, Bastien Giraud, Luca Perniola, Piero Olivo, and Etienne Nowak. 2018. Experimental investigation of 4-kb RRAM arrays programming conditions suitable for TCAM. IEEE Trans. Very Large Scale Integr. Syst. 26, 12 (2018), 2599–2607.

[41]

Somnath Paul and Swarup Bhunia. 2008. Reconfigurable computing using content addressable memory for improved performance and resource usage. In Proceedings of the 45th ACM/IEEE Design Automation Conference. IEEE, 786–791.

Digital Library

[42]

Roman Kaplan, Leonid Yavits, and Ran Ginosar. 2018. PRINS: Processing-in-storage acceleration of machine learning. IEEE Trans. Nanotechnol. 17, 5 (2018), 889–896.

[43]

Yue Zha and Jing Li. 2020. Hyper-AP: Enhancing associative processing through a full-stack optimization. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 846–859.

Digital Library

[44]

Roman Kaplan, Leonid Yavits, Ran Ginosar, and Uri Weiser. 2017. A resistive cam processing-in-storage architecture for dna sequence alignment. IEEE Micro 37, 4 (2017), 20–28.

Digital Library

[45]

Robert Karam, Ruchir Puri, Swaroop Ghosh, and Swarup Bhunia. 2015. Emerging trends in design and applications of memory-based computing and content-addressable memories. Proc. IEEE 103, 8 (2015), 1311–1330.

[46]

Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W. Fletcher. 2018. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 674–687.

Digital Library

[47]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Lab. 27 (2009), 28.

[48]

Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. 2011. The risc-v instruction set manual, volume i: Base user-level isa. EECS Department, UC Berkeley, Tech. Rep. UCB/EECS-2011-62, 116 (2011).

[49]

Elias Ahmed and Jonathan Rose. 2004. The effect of LUT and cluster size on deep-submicron FPGA performance and density. IEEE Trans. Very Large Scale Integr. Syst. 12, 3 (2004), 288–298.

Digital Library

[50]

Xuegong Zhou, Lingli Wang, and Alan Mishchenko. 2020. Fast exact NPN classification by co-designing canonical form and its computation algorithm. IEEE Trans. Comput. (2020).

[51]

João Paulo Cardoso de Lima, Marcelo Brandalero, Michael Hübner, and Luigi Carro. 2021. STAP: An architecture and design tool for automata processing on memristor TCAMs. ACM J. Emerg. Technol. Comput. Syst. 18, 2 (2021), 1–22.

Digital Library

[52]

Jason Cong and Bingjun Xiao. 2013. FPGA-RPI: A novel FPGA architecture with RRAM-based programmable interconnects. IEEE Trans. Very Large Scale Integr. Syst. 22, 4 (2013), 864–877.

Digital Library

[53]

Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. 2014. Compressing deep convolutional networks using vector quantization. Retrieved from https://arXiv:1412.6115.

[54]

Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Retrieved from https://arXiv:1510.00149.

[55]

Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. Retrieved from https://arXiv:1804.02767.

[56]

Kevin E. Murray, Oleg Petelin, Sheng Zhong, Jai Min Wang, Mohamed ElDafrawy, Jean-Philippe Legault, Eugene Sha, Aaron G. Graham, Jean Wu, Matthew J. P. Walker, Hanqing Zeng, Panagiotis Patros, Jason Luu, Kenneth B. Kent, and Vaughn Betz. 2020. VTR 8: High performance CAD and customizable FPGA architecture modelling. ACM Trans. Reconfig. Technol. Syst. (2020).

Digital Library

[57]

Alan Mishchenko, Sungmin Cho, Satrajit Chatterjee, and Robert Brayton. 2007. Combinational and sequential mapping with priority cuts. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. IEEE, 354–361.

[58]

Alan Mishchenko et al. 2007. ABC: A system for sequential synthesis and verification. Retrieved from http://www.eecs.berkeley.edu/alanmi/abc.

[59]

Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. 2012. Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 31, 7 (2012), 994–1007.

Digital Library

[60]

Krste Asanovic, Rimas Avizienis, Jonathan Bachrach, Scott Beamer, David Biancolin, Christopher Celio, Henry Cook, Daniel Dabbelt, John Hauser, Adam Izraelevitz, et al. 2016. The rocket chip generator. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17.

[61]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Architect. News 39, 2 (2011), 1–7.

Digital Library

[62]

Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to \(+1\) or \(-1\). Retrieved from https://arXiv:1602.02830.

[63]

Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. 1–6.

Digital Library

[64]

Yanan Sun, Chang Ma, Zhi Li, Yilong Zhao, Jiachen Jiang, Weikang Qian, Rui Yang, Zhezhi He, and Li Jiang. 2021. Unary coding and variation-aware optimal mapping scheme for reliable ReRAM-based neuromorphic computing. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 40, 12 (2021), 2495–2507.

[65]

Huihan Li, Shaocong Wang, Xumeng Zhang, Wei Wang, Rui Yang, Zhong Sun, Wanxiang Feng, Peng Lin, Zhongrui Wang, Linfeng Sun, et al. 2021. Memristive crossbar arrays for storage and computing applications. Adv. Intell. Syst. 3, 9 (2021), 2100017.

Cited By

Saleh SGoossens ABanerjee TKoldehofe B(2022) TCAmM CogniGron : Energy Efficient Memristor-Based TCAM for Match-Action Processing 2022 IEEE International Conference on Rebooting Computing (ICRC)10.1109/ICRC57508.2022.00013(89-99)Online publication date: Dec-2022
https://doi.org/10.1109/ICRC57508.2022.00013

Index Terms

Data and Computation Reuse in CNNs Using Memristor TCAMs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
      2. Reconfigurable computing
2. Hardware
  1. Emerging technologies

Recommendations

STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs
Accelerating finite-state automata benefits several emerging application domains that are built on pattern matching. In-memory architectures, such as the Automata Processor (AP), are efficient to speed them up, at least for outperforming traditional von-...
Reprogrammable Non-Linear Circuits Using ReRAM for NN Accelerators
As the massive usage of artificial intelligence techniques spreads in the economy, researchers are exploring new techniques to reduce the energy consumption of Neural Network (NN) applications, especially as the complexity of NNs continues to increase. ...
A comprehensive memory analysis of data intensive workloads on server class architecture
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

The emergence of data analytics frameworks requires computational resources and memory subsystems that can naturally scale to manage massive amounts of diverse data. Given the large size and heterogeneity of the data, it is currently unclear whether ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 16, Issue 1

March 2023

403 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/35733111

Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2022

Online AM: 20 July 2022

Accepted: 19 June 2022

Revised: 01 May 2022

Received: 07 February 2022

Published in TRETS Volume 16, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES)-Finance Code 001
National Council for Scientific and Technological Development (CNPq)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
402
Total Downloads

Downloads (Last 12 months)143
Downloads (Last 6 weeks)10

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Saleh SGoossens ABanerjee TKoldehofe B(2022) TCAmM CogniGron : Energy Efficient Memristor-Based TCAM for Match-Action Processing 2022 IEEE International Conference on Rebooting Computing (ICRC)10.1109/ICRC57508.2022.00013(89-99)Online publication date: Dec-2022
https://doi.org/10.1109/ICRC57508.2022.00013

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents