Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3620666.3651349acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations

Published: 27 April 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Deep neural networks (DNNs) have been widely-adopted in various safety-critical applications such as computer vision and autonomous driving. However, as technology scales and applications diversify, coupled with the increasing heterogeneity of underlying hardware architectures, silent data corruption (SDC) has been emerging as a pronouncing threat to the reliability of DNNs. Recent reports from industry hyperscalars underscore the difficulty in addressing SDC due to their "stealthy" nature and elusive manifestation. In this paper, we propose Dr. DNA, a novel approach to enhance the reliability of DNN systems by detecting and mitigating SDCs. Specifically, we formulate and extract a set of unique SDC signatures from the Distribution of Neuron Activations (DNA), based on which we propose early-stage detection and mitigation of SDCs during DNN inference. We perform an extensive evaluation across 3 vision tasks, 5 different datasets, and 10 different models, under 4 different error models. Results show that Dr. DNA achieves 100% SDC detection rate for most cases, 95% detection rate on average and >90% detection rate across all cases, representing 20% - 70% improvement over baselines. Dr. DNA can also mitigate the impact of SDCs by effectively recovering DNN model performance with <1% memory overhead and <2.5% latency overhead.

    References

    [1]
    Ali Asgari Khoshouyeh, Florian Geissler, Syed Qutub, Michael Paulitsch, Prashant Nair, and Karthik Pattabiraman. Structural coding: A low-cost scheme to protect cnns from large-granularity memory faults. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--17, 2023.
    [2]
    Zitao Chen, Guanpeng Li, and Karthik Pattabiraman. A low-cost fault corrector for deep neural networks through range restriction. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1--13. IEEE, 2021.
    [3]
    Zitao Chen, Guanpeng Li, Karthik Pattabiraman, and Nathan De-Bardeleben. Binfi: An efficient fault injector for safety-critical machine learning systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--23, 2019.
    [4]
    Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
    [5]
    Jeff Dean and Amin Vahdat. Exciting directions for ml models and the implications for computing hardware. In Keynote 1, Hot Chips 2023, 2023.
    [6]
    Harish Dattatraya Dixit, Laura Boyle, Gautham Vunnam, Sneha Pendharkar, Matt Beadon, and Sriram Sankar. Detecting silent data corruptions in the wild. arXiv preprint arXiv:2203.08989, 2022.
    [7]
    Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. Silent data corruptions at scale. arXiv preprint arXiv:2102.11245, 2021.
    [8]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
    [9]
    Jannik Fritsch, Tobias Kuehnl, and Andreas Geiger. A new performance measure and evaluation benchmark for road detection algorithms. In 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), pages 1693--1700. IEEE, 2013.
    [10]
    Florian Geissler, Syed Qutub, Michael Paulitsch, and Karthik Pattabiraman. A low-cost strategic monitoring approach for scalable and interpretable error detection in deep neural networks. In International Conference on Computer Safety, Reliability, and Security, pages 75--88. Springer, 2023.
    [11]
    Florian Geissler, Syed Qutub, Sayanta Roychowdhury, Ali Asgari, Yang Peng, Akash Dhamasia, Ralf Graefe, Karthik Pattabiraman, and Michael Paulitsch. Towards a safety case for hardware fault tolerance in convolutional neural networks using activation range supervision. arXiv preprint arXiv:2108.07019, 2021.
    [12]
    Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.
    [13]
    Puneet Gupta et al. Underdesigned and opportunistic computing in presence of hardware variability. TCAD, 2012.
    [14]
    Peter Hazucha et al. Neutron soft error rate measurements in a 90-nm cmos process and scaling trends in sram from 0.25-/spl mu/m to 90-nm generation. In IEDM, 2003.
    [15]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.
    [16]
    Yi He, Prasanna Balaprakash, and Yanjing Li. Fidelity: Efficient resilience analysis framework for deep learning accelerators. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 270--281. IEEE, 2020.
    [17]
    Yi He, Mike Hutton, Steven Chan, Robert De Gruijl, Rama Govindaraju, Nishant Patil, and Yanjing Li. Understanding and mitigating hardware failures in deep learning training systems. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1--16, 2023.
    [18]
    Zhezhi He, Adnan Siraj Rakin, Jingtao Li, Chaitali Chakrabarti, and Deliang Fan. Defending and harnessing the bit-flip based adversarial weight attack. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14095--14103, 2020.
    [19]
    Le-Ha Hoang, Muhammad Abdullah Hanif, and Muhammad Shafique. Ft-clipact: Resilience analysis of deep neural networks and improving their fault tolerance using clipped activation. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1241--1246. IEEE, 2020.
    [20]
    Peter H Hochschild, Paul Turner, Jeffrey C Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E Culler, and Amin Vahdat. Cores that don't count. In Proceedings of the Workshop on Hot Topics in Operating Systems, pages 9--16, 2021.
    [21]
    Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
    [22]
    Navid Khoshavi, Arman Roohi, Connor Broyles, Saman Sargolzaei, Yu Bi, and David Z Pan. Shieldenn: Online accelerated framework for fault-tolerant deep neural network architectures. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1--6. IEEE, 2020.
    [23]
    Sung Kim, Patrick Howe, Thierry Moreau, Armin Alaghi, Luis Ceze, and Visvesh Sathe. Matic: Learning around errors for efficient low-voltage neural network accelerators. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1--6. IEEE, 2018.
    [24]
    Troya Çaǧıl Köylü, Cezar Rodolfo Wedig Reinbrecht, Said Hamdioui, and Mottaqiallah Taouil. Deterministic and statistical strategies to protect anns against fault injection attacks. In 2021 18th International Conference on Privacy, Security and Trust (PST), pages 1--10. IEEE, 2021.
    [25]
    Guanpeng Li, Siva Kumar Sastry Hari, Michael Sullivan, Timothy Tsai, Karthik Pattabiraman, Joel Emer, and Stephen W Keckler. Understanding error propagation in deep learning neural network (dnn) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--12, 2017.
    [26]
    Jingtao Li, Adnan Siraj Rakin, Yan Xiong, Liangliang Chang, Zhezhi He, Deliang Fan, and Chaitali Chakrabarti. Defending bit-flip attack through dnn weight reconstruction. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1--6. IEEE, 2020.
    [27]
    Robert E Lyons and Wouter Vanderkulk. The use of triple-modular redundancy to improve computer reliability. IBM journal of research and development, 6(2):200--209, 1962.
    [28]
    Abdulrahman Mahmoud, Neeraj Aggarwal, Alex Nobbe, Jose Rodrigo Sanchez Vicarte, Sarita V Adve, Christopher W Fletcher, Iuri Frosio, and Siva Kumar Sastry Hari. Pytorchfi: A runtime perturbation tool for dnns. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pages 25--31. IEEE, 2020.
    [29]
    Abdulrahman Mahmoud, Siva Kumar Sastry Hari, Christopher W Fletcher, Sarita V Adve, Charbel Sakr, Naresh Shanbhag, Pavlo Molchanov, Michael B Sullivan, Timothy Tsai, and Stephen W Keckler. Hardnn: Feature map vulnerability evaluation in cnns. arXiv preprint arXiv:2002.09786, 2020.
    [30]
    Abdulrahman Mahmoud, Siva Kumar Sastry Hari, Christopher W Fletcher, Sarita V Adve, Charbel Sakr, Naresh R Shanbhag, Pavlo Molchanov, Michael B Sullivan, Timothy Tsai, and Stephen W Keckler. Optimizing selective protection for cnn resilience. In ISSRE, pages 127--138, 2021.
    [31]
    Abdulrahman Mahmoud, Thierry Tambe, Tarek Aloui, David Brooks, and Gu-Yeon Wei. Goldeneye: A platform for evaluating emerging numerical data formats in dnn accelerators. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 206--214. IEEE, 2022.
    [32]
    Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
    [33]
    Elbruz Ozen and Alex Orailoglu. Boosting bit-error resilience of dnn accelerators through median feature selection. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39(11):3250--3262, 2020.
    [34]
    Elbruz Ozen and Alex Orailoglu. Just say zero: Containing critical bit-error propagation in deep neural networks with anomalous feature suppression. In Proceedings of the 39th International Conference on Computer-Aided Design, pages 1--9, 2020.
    [35]
    Benjamin Ramtoula, Matthew Gadd, Paul Newman, and Daniele De Martini. Visual dna: Representing and comparing images using distributions of neuron activations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11113--11123, 2023.
    [36]
    Brandon Reagen, Udit Gupta, Lillian Pentecost, Paul Whatmough, Sae Kyu Lee, Niamh Mulholland, David Brooks, and Gu-Yeon Wei. Ares: A framework for quantifying the resilience of deep neural networks. In Proceedings of the 55th Annual Design Automation Conference, pages 1--6, 2018.
    [37]
    Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. ACM SIGARCH Computer Architecture News, 44(3):267--278, 2016.
    [38]
    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779--788, 2016.
    [39]
    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211--252, 2015.
    [40]
    Christoph Schorn, Andre Guntoro, and Gerd Ascheid. Efficient on-line error detection and mitigation for deep neural network accelerators. In Computer Safety, Reliability, and Security: 37th International Conference, SAFECOMP 2018, Västerås, Sweden, September 19-21, 2018, Proceedings 37, pages 205--219. Springer, 2018.
    [41]
    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446--2454, 2020.
    [42]
    Emil Talpes, Debjit Das Sarma, Ganesh Venkataramanan, Peter Bannon, Bill McGee, Benjamin Floering, Ankit Jalote, Christopher Hsiong, Sahil Arora, Atchyuth Gorti, et al. Compute solution for tesla's full self-driving computer. IEEE Micro, 40(2):25--35, 2020.
    [43]
    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105--6114. PMLR, 2019.
    [44]
    Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261--24272, 2021.
    [45]
    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347--10357. PMLR, 2021.
    [46]
    Aniruddha N Udipi et al. Lot-ecc: Localized and tiered reliability mechanisms for commodity memory systems. ACM SIGARCH Computer Architecture News, 2012.
    [47]
    Takumi Uezono, Yi He, and Yanjing Li. Achieving automotive safety requirements through functional in-field self-test for deep learning accelerators. In 2022 IEEE International Test Conference (ITC), pages 465--473. IEEE, 2022.
    [48]
    Jialai Wang, Ziyuan Zhang, Meiqi Wang, Han Qiu, Tianwei Zhang, Qi Li, Zongpeng Li, Tao Wei, and Chao Zhang. Aegis: Mitigating targeted bit-flip attacks against deep neural networks. arXiv preprint arXiv:2302.13520, 2023.
    [49]
    Shaobu Wang, Guangyan Zhang, Junyu Wei, Yang Wang, Jiesheng Wu, and Qingchao Luo. Understanding silent data corruptions in a large production cpu population. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 216--230, 2023.
    [50]
    Jinyu Zhan, Ruoxu Sun, Wei Jiang, Yucheng Jiang, Xunzhao Yin, and Cheng Zhuo. Improving fault tolerance for reliable dnn using boundary-aware activation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(10):3414--3425, 2021.
    [51]
    Fangzhen Zhao, Chenyi Zhang, Naipeng Dong, Zefeng You, and Zhenxin Wu. A uniform framework for anomaly detection in deep neural networks. Neural Processing Letters, 54(4):3467--3488, 2022.
    [52]
    Dan Zuras, Mike Cowlishaw, Alex Aiken, Matthew Applegate, David Bailey, Steve Bass, Dileep Bhandarkar, Mahesh Bhat, David Bindel, Sylvie Boldo, et al. Ieee standard for floating-point arithmetic. IEEE Std, 754(2008):1--70, 2008.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
    April 2024
    1106 pages
    ISBN:9798400703867
    DOI:10.1145/3620666
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 April 2024

    Check for updates

    Qualifiers

    • Research-article

    Conference

    ASPLOS '24

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 414
      Total Downloads
    • Downloads (Last 12 months)414
    • Downloads (Last 6 weeks)149
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media