Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Protecting Caches from Soft Errors: A Microarchitect’s Perspective

Published: 11 May 2017 Publication History

Abstract

Soft error is one of the most important design concerns in modern embedded systems with aggressive technology scaling. Among various microarchitectural components in a processor, cache is the most susceptible component to soft errors. Error detection and correction codes are common protection techniques for cache memory due to their design simplicity. In order to design effective protection techniques for caches, it is important to quantitatively estimate the susceptibility of caches without and even with protections. At the architectural level, vulnerability is the metric to quantify the susceptibility of data in caches. However, existing tools and techniques calculate the vulnerability of data in caches through coarse-grained block-level estimation. Further, they ignore common cache protection techniques such as error detection and correction codes. In this article, we demonstrate that our word-level vulnerability estimation is accurate through intensive fault injection campaigns as compared to block-level one. Further, our extensive experiments over benchmark suites reveal several counter-intuitive and interesting results. Parity checking when performed over just reads provides reliable and power-efficient protection than that when performed over both reads and writes. On the other hand, checking error correcting codes only at reads alone can be vulnerable even for single-bit soft errors, while that at both reads and writes provides the perfect reliability.

References

[1]
ARM. 2007. ARM1156T2-S Technical Manual. (2007). http://infocenter.arm.com/help/topic/com.arm.doc.ddi0338g/index.html.
[2]
ARM. 2010. ARM Cortex-R4 and Cortex-R4F Technical Reference Manual. (2010). http://infocenter.arm.com/help/topic/com.arm.doc.ddi0363e/index.html.
[3]
ARM. 2014. Cortex-A8 Technical Reference Manual. (2014). http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344h/index.html.
[4]
G.-H. Asadi, V. S. Mehdi, B. Tahoori, and D. Kaeli. 2005. Balancing performance and reliability in the memory hierarchy. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’05). IEEE Computer Society, Washington, D.C., 269--279.
[5]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, and others. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1--7.
[6]
Michael Demshki and Robert Shiveley. 2010. Advanced reliability for Intel Xeon processor-based servers. Intel Corporation.
[7]
A. Dixit and A. Wood. 2011. The impact of new technology on soft error rates. In IEEE International Reliability Physics Symposium. 5B.4.1--5B.4.7.
[8]
L. Entrena, M. Garcia-Valderas, R. Fernandez-Cardenal, A. Lindoso, M. Portela, and C. Lopez-Ongil. 2012. Soft error sensitivity evaluation of microprocessors by multilevel emulation-based fault injection. IEEE Trans. Comput. 61, 3 (March 2012), 313--322.
[9]
Ronaldo R. Ferreira, Gabriel L. Nazar, Jean Da Rolt, Álvaro F. Moreira, and Luigi Carro. 2016. Live-out register fencing: Interrupt-triggered soft error correction based on the elimination of register-to-register communication. ACM Transactions on Embedded Computing Systems 15, 3, Article 60 (May 2016), 25 pages.
[10]
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the International Workshop on Workload Characterization (WWC-4). IEEE Computer Society, 3--14.
[11]
John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1--17.
[12]
Charlie X. Huang, Bill Zhang, An-Chang Deng, and Burkhard Swirski. 1995. The design and implementation of PowerMill. In International Symposium on Low Power Design (ISLPED’95). ACM, 105--110.
[13]
Imagination. 2012. interAptiv Multiprocessing System Datasheet. (2012).
[14]
R. Jeyapaul and A. Shrivastava. 2011. Smart cache cleaning: Energy efficient vulnerability reduction in embedded processors. In International Conference on Compilers, Architectures and Synthesis for Embedded Systems. 105--114.
[15]
Yohan Ko, Reiley Jeyapaul, Youngbin Kim, Kyoungwoo Lee, and Aviral Shrivastava. 2015. Guidelines to design parity protected write-back L1 data cache. In Design Automation Conference (DAC’15). ACM, Article 24, 6 pages.
[16]
Yohan Ko, Jihoon Kang, Jongwon Lee, Yongjoo Kim, Joonhyun Kim, Hwisoo So, Kyoungwoo Lee, and Yunheung Paek. 2016. Software-based selective validation techniques for robust CGRAs against soft errors. ACM Transactions on Embedded Computing Systems 15, 1, Article 20 (Jan. 2016), 26 pages.
[17]
PaKJW Kudva, J. Kellington, P. Sanda, Ryan McBeth, John Schumann, and Ron Kalla. 2007. Fault injection verification of IBM POWER6 soft error resilience. In Architectural Support for Gigascale Integration Workshop. Citeseer.
[18]
Kyoungwoo Lee, Aviral Shrivastava, Ilya Issenin, Nikil Dutt, and Nalini Venkatasubramanian. 2006. Mitigating soft error failures for multimedia applications by selective data protection. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’06). ACM, 411--420.
[19]
Lin Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. 2004. Soft error and energy consumption interactions: A data cache perspective. In International Symposium on Low Power Electronics and Design. 132--137.
[20]
Mehrtash Manoochehri, Murali Annavaram, and Michel Dubois. 2011. CPPC: Correctable parity protected cache. In International Symposium on Computer Architecture (ISCA’11). ACM, New York, NY, 223--234.
[21]
Frank H. McMahon. 1986. The Livermore Fortran Kernels: A computer test of the numerical performance range. Technical Report. Lawrence Livermore National Lab., CA.
[22]
C. McNairy and D. Soltis. 2003. Itanium 2 processor microarchitecture. Micro, IEEE 23, 2 (2003), 44--55.
[23]
Subhasish Mitra, Norbert Seifert, Ming Zhang, Quan Shi, and Kee Sup Kim. 2005. Robust system design with built-in soft-error resilience. Computer 38, 2 (2005), 43--52.
[24]
Sparsh Mittal and Jeffrey S. Vetter. 2016. Reducing soft-error vulnerability of caches using data compression. In Great Lakes Symposium on VLSI (GLSVLSI’16). ACM, 197--202.
[25]
Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Steven K. Reinhardt, and Todd Austin. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In IEEE/ACM International Symposium on Microarchitecture. 29--40.
[26]
R. Naseer, Y. Boulghassoul, J. Draper, S. DasGupta, and A. Witulski. 2007. Critical charge characterization for soft error rate modeling in 90nm SRAM. In IEEE International Symposium on Circuits and Systems. 1879--1882.
[27]
Richard Phelan. 2003. Addressing soft errors in ARM core-based designs. White Paper, ARM Ltd. (Dec. 2003).
[28]
N. N. Sadler and D. J. Sorin. 2006. Choosing an error protection scheme for a microprocessor’s L1 data cache. In International Conference on Computer Design. 499--505.
[29]
Freescale Semiconductor Application Note. 2007. Error Correction and Error Handling on PowerQUICC III Processors. (2007). http://application-notes.digchip.com/314/314-66495.pdf.
[30]
S. Z. Shazli, M. Abdul-Aziz, M. B. Tahoori, and D. R. Kaeli. 2008. A field analysis of system-level effects of soft errors occurring in microprocessors used in information systems. In IEEE International Test Conference. 1--10.
[31]
C. Slayman. 2010. Alpha particle or neutron SER-What will dominate in future IC technology. (2010).
[32]
Texas Instruments. 2011. AM3359 Sitara Processor. (2011). http://www.ti.com/lit/ds/symlink/am3351.pdf.
[33]
Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, and Norman P. Jouppi. 2008. CACTI 5.1. HP Laboratories, April 2 (2008).
[34]
Nicholas J. Wang and Sanjay J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. Dependable and Secure Computing, IEEE Trans on 3, 3 (2006), 188--201.
[35]
Wei Zhang. 2005a. Computing cache vulnerability to transient errors and its implication. In 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’05). 427--435.
[36]
Wei Zhang. 2005b. Computing cache vulnerability to transient errors and its implication. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 427--435.

Cited By

View all
  • (2024)Designing a Deep Neural Network engine for LLC block reuse prediction to mitigate Soft Error in MulticoreMicroelectronics Reliability10.1016/j.microrel.2024.115377156(115377)Online publication date: May-2024
  • (2024)ECS an endeavor towards providing similar cache reliability behavior in different programsMicroelectronics Reliability10.1016/j.microrel.2023.115295152(115295)Online publication date: Jan-2024
  • (2023)On-Chip Bus Protection against Soft ErrorsElectronics10.3390/electronics1222470612:22(4706)Online publication date: 19-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 16, Issue 4
Special Issue on Secure and Fault-Tolerant Embedded Computing and Regular Papers
November 2017
614 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3092956
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 11 May 2017
Accepted: 01 February 2017
Revised: 01 November 2016
Received: 01 May 2016
Published in TECS Volume 16, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cache
  2. error correction code
  3. parity code
  4. reliability
  5. simulation
  6. soft error
  7. transient fault
  8. vulnerability

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Ministry of Science, ICT, and future Planning
  • MSIP under the Research Project on High Performance and Scalable Manycore Operating System
  • Basic Science Research Program through the National Research Foundation of Korea (NRF)
  • Next-Generation Information Computing Development Program through the NRF
  • National Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)61
  • Downloads (Last 6 weeks)10
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Designing a Deep Neural Network engine for LLC block reuse prediction to mitigate Soft Error in MulticoreMicroelectronics Reliability10.1016/j.microrel.2024.115377156(115377)Online publication date: May-2024
  • (2024)ECS an endeavor towards providing similar cache reliability behavior in different programsMicroelectronics Reliability10.1016/j.microrel.2023.115295152(115295)Online publication date: Jan-2024
  • (2023)On-Chip Bus Protection against Soft ErrorsElectronics10.3390/electronics1222470612:22(4706)Online publication date: 19-Nov-2023
  • (2022)Survey of Software-Implemented Soft Error ProtectionElectronics10.3390/electronics1103045611:3(456)Online publication date: 3-Feb-2022
  • (2022)MLFTCache: Multilevel Fault Tolerance Scheme for Write-Back L2 Cache Under IrradiationIEEE Transactions on Nuclear Science10.1109/TNS.2022.315180569:5(1182-1192)Online publication date: May-2022
  • (2022)Studying error propagation on application data structure and hardwareThe Journal of Supercomputing10.1007/s11227-022-04625-x78:17(18691-18724)Online publication date: 1-Nov-2022
  • (2021)Cache Tag Array Fault Tolerance Method Based on Redundancy and Similarity of Adjacent Cache Line Tag BitsInternational Conference on Frontiers of Electronics, Information and Computation Technologies10.1145/3474198.3478212(1-8)Online publication date: 21-May-2021
  • (2020)Effect of Cache Run-Time Parameters on the Reliability of Embedded Systems2020 CSI/CPSSI International Symposium on Real-Time and Embedded Systems and Technologies (RTEST)10.1109/RTEST49666.2020.9140080(1-6)Online publication date: Jun-2020
  • (2020)Soft errors in DNN accelerators: A comprehensive reviewMicroelectronics Reliability10.1016/j.microrel.2020.113969115(113969)Online publication date: Dec-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media