Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Published: 17 July 2021 Publication History

Abstract

The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control and dataflow, hardware accelerators with the systolic array can calculate traditional convolution very efficiently. However, this advantage also brings new challenges to the systolic array. When computing special types of convolution, such as the small-scale convolution or depthwise convolution, the processing element (PE) utilization rate of the array decreases sharply. The main reason is that the simple architecture design limits the flexibility of the systolic array.
In this article, we design a configurable multi-directional systolic array (CMSA) to address these issues. First, we added a data path to the systolic array. It allows users to split the systolic array through configuration to speed up the calculation of small-scale convolution. Second, we redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies. This allows users to switch the dataflow of the PE array to speed up the calculation of depthwise convolution. In addition, unlike other works, we only make a few changes and modifications to the existing systolic array architecture. It avoids additional hardware overheads and can be easily deployed in application scenarios that require small systolic arrays such as mobile terminals. Based on our evaluation, CMSA can increase the PE utilization rate by up to 1.6 times compared to the typical systolic array when running the last layers of ResNet-18. When running depthwise convolution in MobileNet, CMSA can increase the utilization rate by up to 14.8 times. At the same time, CMSA and the traditional systolic arrays are similar in area and energy consumption.

References

[1]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Architectural Support for Programming Languages and Operating Systems (ASPLOS’14), (Salt Lake City, UT, March 1–5 2014). Rajeev Balasubramonian, Al Davis, and Sarita V. Adve (Eds.). ACM, 269–284. https://doi.org/10.1145/2541940.2541967
[2]
Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circuits 52, 1 (2017), 127–138. https://doi.org/10.1109/JSSC.2016.2616357
[3]
Yu-Hsin Chen, Tien-Ju Yang, Joel S. Emer, and Vivienne Sze. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Sel. Topics Circuits Syst. 9, 2 (2019), 292–308. https://doi.org/10.1109/JETCAS.2019.2910232
[4]
François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), (Honolulu, HI, July 21-26, 2017). IEEE Computer Society, 1800–1807. https://doi.org/10.1109/CVPR.2017.195
[5]
S. Das, A. Roy, K. K. Chandrasekharan, A. Deshwal, and S. Lee. 2020. A systolic dataflow based accelerator for CNNs. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS). 1–5. https://doi.org/10.1109/ISCAS45731.2020.9180403
[6]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’09), 20-25 June 2009, (Miami, FL, June 20–25 2009). IEEE Computer Society, 248–255. https://doi.org/10.1109/CVPR.2009.5206848
[7]
Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In 42nd Annual International Symposium on Computer Architecture(Portland, OR, June 13-17, 2015), Deborah T. Marr and David H. Albonesi (Eds.). ACM, 92–104. https://doi.org/10.1145/2749469.2750389
[8]
Hasan Genc, Ameer Haj-Ali, Vighnesh Iyer, Alon Amid, Howard Mao, John Wright, Colin Schmidt, Jerry Zhao, Albert J. Ou, Max Banister, Yakun Sophia Shao, Borivoje Nikolic, Ion Stoica, and Krste Asanovic. 2019. Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures. CoRR abs/1911.09925 (2019). arXiv:1911.09925 http://arxiv.org/abs/1911.09925.
[9]
Dibakar Gope, Jesse G. Beu, and Matthew Mattina. 2020. High throughput matrix-matrix multiplication between asymmetric bit-width operands. CoRR abs/2008.00638 (2020). arXiv:2008.00638 https://arxiv.org/abs/2008.00638.
[10]
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In 32nd International Conference on Machine Learning (ICML’15), (Lille, France, July 6–11, 2015) (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 1737–1746. http://proceedings.mlr.press/v37/gupta15.html.
[11]
Xing Hao, Guigang Zhang, and Shang Ma. 2016. Deep learning. International Journal of Semantic Computing 10, 03 (2016), 417–439.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16) (Las Vegas, NV, June 27-30, 2016). IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90
[13]
Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Kuan-Yu Chen, Ronald G. Dreslinski, and Trevor N. Mudge. 2020. Sparse-TPU: Adapting systolic arrays for sparse matrices. In 2020 International Conference on Supercomputing (ICS’20) (Barcelona Spain, June 2020), Eduard Ayguadé, Wen-mei W. Hwu, Rosa M. Badia, and H. Peter Hofstee (Eds.). ACM, 19:1–19:12. https://dl.acm.org/doi/10.1145/3392717.3392751.
[14]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017). arXiv:1704.04861 http://arxiv.org/abs/1704.04861.
[15]
Nandan Kumar Jha, Shreyas Ravishankar, Sparsh Mittal, Arvind Kaushik, Dipan Mandal, and Mahesh Chandra. 2020. DRACO: Co-Optimizing hardware utilization, and performance of DNNs on systolic accelerator. In 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’20), (Limassol, Cyprus, July 6-8, 2020). IEEE, 574–579. https://doi.org/10.1109/ISVLSI49217.2020.00088
[16]
Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, 2017. In-datacenter performance analysis of a tensor processing unit. In 44th Annual International Symposium on Computer Architecture (ISCA’17) (Toronto, ON, Canada, June 24-28, 2017). ACM, 1–12. https://doi.org/10.1145/3079856.3080246
[17]
A. Krizhevsky and G. Hinton. 2009. Learning multiple layers of features from tiny images. Handbook of Systemic Autoimmune Diseases 1, 4 (2009).
[18]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90. https://doi.org/10.1145/3065386
[19]
H. T. Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19) (Providence, RI, April 13-17, 2019), Iris Bahar, Maurice Herlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 821–834. https://doi.org/10.1145/3297858.3304028
[20]
H. T. Kung, Bradley McDanel, and Sai Qian Zhang. 2020. Term Revealing: Furthering quantization at run time on quantized DNNs. CoRR abs/2007.06389 (2020). arXiv:2007.06389 https://arxiv.org/abs/2007.06389.
[21]
H. T. Kung, Bradley McDanel, Sai Qian Zhang, Xin Dong, and Chih-Chiang Chen. 2019. Maestro: A memory-on-logic architecture for coordinated parallel use of many systolic arrays. In 30th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’19) (New York, NY, July 15-17, 2019). IEEE, 42–50. https://doi.org/10.1109/ASAP.2019.00-31
[22]
Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. 2019. Understanding reuse, performance, and hardware cost of DNN Dataflow: A data-centric approach. In 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’19) (Columbus, OH, October 12-16, 2019). ACM, 754–768. https://doi.org/10.1145/3352460.3358252
[23]
Hyoukjun Kwon, Michael Pellauer, and Tushar Krishna. 2018. MAESTRO: An open-source infrastructure for modeling dataflows within deep learning accelerators. CoRR abs/1805.02566 (2018). arXiv:1805.02566 http://arxiv.org/abs/1805.02566.
[24]
Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18) (Williamsburg, VA, March 24-28, 2018), Xipeng Shen, James Tuck, Ricardo Bianchini, and Vivek Sarkar (Eds.). ACM, 461–475. https://doi.org/10.1145/3173162.3173176
[25]
Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 4 (1989), 541–551. https://doi.org/10.1162/neco.1989.1.4.541
[26]
Zhi Gang Liu, Paul N. Whatmough, and Matthew Mattina. 2020. Systolic tensor array: An efficient structured-sparse GEMM accelerator for mobile CNN inference. IEEE Comput. Archit. Lett. 19, 1 (2020), 34–37. https://doi.org/10.1109/LCA.2020.2979965
[27]
Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17) (Austin, TX, February 4-8, 2017). IEEE Computer Society, 553–564. https://doi.org/10.1109/HPCA.2017.29
[28]
Sangkug Lym, Armand Behroozi, Wei Wen, Ge Li, Yongkee Kwon, and Mattan Erez. 2019. Mini-batch Serialization: CNN training with inter-layer data reuse. In Machine Learning and Systems 2019 (MLSys 2019) (Stanford, CA, USA, March 31–April 2, 2019), Ameet Talwalkar, Virginia Smith, and Matei Zaharia (Eds.). mlsys.org. https://proceedings.mlsys.org/book/261.pdf.
[29]
Sangkug Lym and Mattan Erez. 2020. FlexSA: Flexible systolic array architecture for efficient pruned DNN model training. CoRR abs/2004.13027 (2020). arXiv:2004.13027 https://arxiv.org/abs/2004.13027.
[30]
Mostafa Mahmoud, Isak Edo Vivancos, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, and Andreas Moshovos. 2020. TensorDash: Exploiting sparsity to accelerate deep neural network training and inference. CoRR abs/2009.00748 (2020). arXiv:2009.00748 https://arxiv.org/abs/2009.00748.
[31]
Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman P. Jouppi, and David A. Patterson. 2020. Google’s Training Chips Revealed: TPUv2 and TPUv3. In IEEE Hot Chips 32 Symposium (HCS’20), (Palo Alto, CA, August 16-18, 2020). IEEE, 1–70. https://doi.org/10.1109/HCS49909.2020.9220735
[32]
Phi-Hung Pham, Darko Jelaca, Clément Farabet, Berin Martini, Yann LeCun, and Eugenio Culurciello. 2012. NeuFlow: Dataflow vision processing system-on-a-chip. In 55th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS’12) (Boise, ID, August 5-8, 2012). IEEE, 1044–1047. https://doi.org/10.1109/MWSCAS.2012.6292202
[33]
Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training. In IEEE International Symposium on High Performance Computer Architecture (HPCA’20) (San Diego, CA, February 22-26, 2020). IEEE, 58–70. https://doi.org/10.1109/HPCA47549.2020.00015
[34]
Ananda Samajdar, Jan Moritz Joseph, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar Krishna. 2020. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’20), (Boston, MA, August 23-25, 2020). IEEE, 58–68. https://doi.org/10.1109/ISPASS48437.2020.00016
[35]
Ananda Samajdar, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator. CoRR abs/1811.02883 (2018). arXiv:1811.02883 http://arxiv.org/abs/1811.02883.
[36]
Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Ross Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel S. Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’19) (Columbus, OH, October 12-16, 2019). ACM, 14–27. https://doi.org/10.1145/3352460.3358302
[37]
Sayeh Sharify, Alberto Delmas Lascorz, Mostafa Mahmoud, Milos Nikolic, Kevin Siu, Dylan Malone Stuart, Zissis Poulos, and Andreas Moshovos. 2019. Laconic deep learning inference acceleration. In 46th International Symposium on Computer Architecture (ISCA’19) (Phoenix, AZ, June 22-26, 2019) Srilatha Bobbie Manne, Hillery C. Hunter, and Erik R. Altman (Eds.). ACM, 304–317. https://doi.org/10.1145/3307650.3322255
[38]
Gil Shomron, Tal Horowitz, and Uri C. Weiser. 2019. SMT-SA: Simultaneous multithreading in systolic arrays. IEEE Comput. Archit. Lett. 18, 2 (2019), 99–102. https://doi.org/10.1109/LCA.2019.2924007
[39]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations (ICLR’15) (San Diego, CA, May 7-9, 2015), Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.1556.
[40]
Mingxing Tan and Quoc V. Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In 36th International Conference on Machine Learning (ICML’19), (Long Beach, CA, June 9-15, 2019) (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 6105–6114. http://proceedings.mlr.press/v97/tan19a.html.
[41]
Yu Wang, Gu-Yeon Wei, and David Brooks. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. CoRR abs/1907.10701 (2019). arXiv:1907.10701 http://arxiv.org/abs/1907.10701.
[42]
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In 54th Annual Design Automation Conference (DAC’17) (Austin, TX, June 18-22, 2017). ACM, 29:1–29:6. https://doi.org/10.1145/3061639.3062207
[43]
Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Bell, Jeff Setter, Kaidi Cao, Heonjae Ha, Christos Kozyrakis, and Mark Horowitz. 2018. DNN dataflow choice is overrated. CoRR abs/1809.04070 (2018). arXiv:1809.04070 http://arxiv.org/abs/1809.04070.

Cited By

View all
  • (2024)SAL: Optimizing the Dataflow of Spin-based Architectures for Lightweight Neural NetworksACM Transactions on Architecture and Code Optimization10.1145/367365421:3(1-27)Online publication date: 14-Jun-2024
  • (2024)Scratchpad Memory Management for Deep Learning AcceleratorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673115(629-639)Online publication date: 12-Aug-2024
  • (2024)ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN TensorsACM Transactions on Architecture and Code Optimization10.1145/365336321:3(1-24)Online publication date: 21-Mar-2024
  • Show More Cited By

Index Terms

  1. Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 4
    December 2021
    497 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3476575
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 July 2021
    Accepted: 01 April 2021
    Revised: 01 February 2021
    Received: 01 November 2020
    Published in TACO Volume 18, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Systolic array
    2. convolutional neural network
    3. hardware accelerator

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • National Natural Science Foundation of China (NSFC)
    • National Key Research and Development Project
    • Science and Technology Innovation project of Hunan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2,149
    • Downloads (Last 6 weeks)235
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)SAL: Optimizing the Dataflow of Spin-based Architectures for Lightweight Neural NetworksACM Transactions on Architecture and Code Optimization10.1145/367365421:3(1-27)Online publication date: 14-Jun-2024
    • (2024)Scratchpad Memory Management for Deep Learning AcceleratorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673115(629-639)Online publication date: 12-Aug-2024
    • (2024)ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN TensorsACM Transactions on Architecture and Code Optimization10.1145/365336321:3(1-24)Online publication date: 21-Mar-2024
    • (2024)EPHA: An Energy-efficient Parallel Hybrid Architecture for ANNs and SNNsACM Transactions on Design Automation of Electronic Systems10.1145/364313429:3(1-28)Online publication date: 14-Mar-2024
    • (2024)SparGD: A Sparse GEMM Accelerator with Dynamic DataflowACM Transactions on Design Automation of Electronic Systems10.1145/363470329:2(1-32)Online publication date: 15-Jan-2024
    • (2024)Design of A Low-Latency General-Purpose CNN Hardware Accelerator Based on Pulsed Arrays on FPGAs2024 20th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)10.1109/ICNC-FSKD64080.2024.10702206(1-8)Online publication date: 27-Jul-2024
    • (2024)S2TAR: Shared Secure Trusted Accelerators with Reconfiguration for Machine Learning in the Cloud2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00038(267-278)Online publication date: 7-Jul-2024
    • (2023)End-to-End Implementation of a Convolutional Neural Network on a 3D-Integrated Image Sensor with Macropixel ArraySensors10.3390/s2304190923:4(1909)Online publication date: 8-Feb-2023
    • (2023)A Survey of Design and Optimization for Systolic Array-based DNN AcceleratorsACM Computing Surveys10.1145/360480256:1(1-37)Online publication date: 25-Aug-2023
    • (2023)Accelerating Attention Mechanism on FPGAs based on Efficient Reconfigurable Systolic ArrayACM Transactions on Embedded Computing Systems10.1145/354993722:6(1-22)Online publication date: 9-Nov-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media