research-article

A Survey of Techniques for Modeling and Improving Reliability of Computing Systems

Authors:

Jeffrey S. VetterAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 27, Issue 4

Pages 1226 - 1238

https://doi.org/10.1109/TPDS.2015.2426179

Published: 01 April 2016 Publication History

Abstract

Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made ‘reliability’ a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. This paper provides a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. We believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.

References

[1]

S. Borkar, “Designing reliable systems from unreliable components: the challenges of transistor variability and degradation,” Micro, IEEE, vol. 25, no. 6, pp. 10–16, Nov./Dec. 2005.

Digital Library

[2]

J. F. Ziegler, and H. Puchner, SER—History, Trends and Challenges: A Guide for Designing with Memory ICs. San Jose, CA, USA: Cypress, 2004.

[3]

S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, “A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor,” in Proc. 36th Annu. IEEE/ACM Int. Symp. Microarchit., 2003, pp. 29–40.

[4]

E. Normand, “Single-event effects in avionics,” Trans. Nucl. Sci., vol. 43, no. 2, pp. 461–474, Apr. 1996.

[5]

W.-C. Feng, “Making a case for efficient supercomputing,” Queue, vol. 1, no. 7, p. 54, 2003.

Digital Library

[6]

M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. S. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen, “Addressing failures in exascale computing,” Int. J. High Performance Comput. Appl., vol. 28, no. 2, pp. 129–173, May 2014, DOI. 10.1177/1094342014522573.

[7]

L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft error and energy consumption interactions: A data cache perspective,” in Proc. Int. Symp. Low Power Electron. Design, 2004, pp. 132–137.

[8]

D. H. Yoon and M. Erez, “Memory mapped ECC: Low-cost error protection for last level caches,” in Proc. Int. Symp. Comput. Archit., 2009, pp. 116–127.

[9]

N. J. George, C. R. Elks, B. W. Johnson, and J. Lach, “Transient fault models and AVF estimation revisited,” in Proc. IEEE/IFIP Int. Conf. Dependable Syst. Netw., 2010, pp. 477–486.

[10]

S. Mittal, “A survey of architectural techniques for improving cache power efficiency,” Elsevier Sustainable Comput.: Informat. Systems, vol. 4, no. 1, pp. 33–43, 2014.

[11]

R. Riedlinger, R. Arnold, L. Biro, B. Bowhill, J. Crop, K. Duda, E. S. Fetzer, O. Franza, T. Grutkowski, C. Little, C. Morganti, G. Moyer, A. Munch, M. Nagarajan, C. Parks, C. Poirier, B. Repasky, E. Roytman, T. Singh, and M. W. Stefaniw, “A 32 nm, 3.1 billion transistor, 12 wide issue itanium processor for mission-critical servers,” J. Solid-State Circuits , vol. 47, no. 1, pp. 177–193, Jan. 2012.

[12]

S. Rusu, H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin, S. Li, S. Vora, R. Varada, and E. Wang, “Ivytown: A 22nm 15-core enterprise Xeon® processor family,” in Proc. IEEE Int. Solid-State Circuits Conf., 2014, pp. 102–103.

[13]

V. Zyuban, S. Taylor, B. Christensen, A. Hall, C. Gonzalez, J. Friedrich, F. Clougherty, J. Tetzloff, and R. Rao, “IBM POWER7+ design for higher frequency at fixed power, ” IBM J. Res. Develop., vol. 57, no. 6, pp. 1:1–1:18, 2013.

[14]

E. J. Fluhr, J. Friedrich, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, A. Hall, D. Hogenmiller, F. Malgioglio, R. Nett, J. Paredes, J. Pille, D. Plass, R. Puri, P. Restle, D. Shan, K. Stawiasz, Z. T. Deniz, D. Wendel, and M. Ziegler, “5.1 POWER8 $^{\text{TM}}$: A 12-core server-class processor in 22nm SOI with 7.6 Tb/s off-chip bandwidth,” in Proc. IEEE Int. Solid-State Circuits Conf., 2014, pp. 96–97.

[15]

S. Mittal, “A survey of techniques for managing and leveraging caches in GPUs, ” J. Circuits, Syst., Comput., vol. 23, no. 8, 2014, http://www.worldscientific.com/doi/abs/10.1142/S0218126614300025.

[16]

Y. Cai, M. T. Schmitz, A. Ejlali, B. M. Al,-Hashimi, and S. M. Reddy, “Cache size selection for performance, energy and reliability of time-constrained systems,” in Proc. Asia South Pacific Design Autom. Conf., 2006, pp. 923–928.

[17]

A. Biswas, C. Recchia, S. S. Mukherjee, V. Ambrose, L. Chan, A. Jaleel, A. E. Papathanasiou, M. Plaster, and N. Seifert, “Explaining cache SER anomaly using DUE AVF measurement,” in Proc. Int. Symp. High Performance Comput. Archit., 2010, pp. 1–12.

[18]

G.-H. Asadi, V. Sridharan, M. B. Tahoori, and D. Kaeli, “Balancing performance and reliability in the memory hierarchy,” in Proc. Int. Symp. Performance Anal. Syst. Softw. , 2005, pp. 269–279.

[19]

S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt, “Cache scrubbing in microprocessors: Myth or necessity?” in Proc. IEEE Pacific Rim Int. Symp. Dependable Comput., 2004, pp. 37–42.

[20]

S. Mittal, J. S. Vetter, and D. Li, “Improving energy efficiency of embedded DRAM caches for high-end computing systems,” in Proc. 23rd Int. ACM Symp. High Performance Parallel Distrib. Comput., 2014, pp. 99– 110.

Digital Library

[21]

M. Awasthi, M. Shevgoor, K. Sudan, B. Rajendran, R. Balasubramonian, and V. Srinivasan, “Efficient scrub mechanisms for error-prone emerging memories, ” in Proc. Int. Symp. High Performance Comput. Archit., 2012, pp. 1 –12.

[22]

S. Mittal, J. S. Vetter, and D. Li, “A survey of architectural approaches for managing embedded DRAM and non-volatile on-chip caches,” Trans. Parallel Distrib. Syst., 2014, DOI. 10.1109/TPDS.2014.2324563.

[23]

H. Sun, C. Liu, W. Xu, J. Zhao, N. Zheng, and T. Zhang, “Using magnetic RAM to build low-power and soft error-resilient L1 cache,” Trans. Very Large Scale Integr. Syst., vol. 20, no. 1, pp. 19–28, Jan. 2012.

Digital Library

[24]

H. Naeimi, C. Augustine, A. Raychowdhury, S.-L. Lu, and J. Tschanz, “STTRAM scaling and retention failure, ” Intel Technol. J., vol. 17, no. 1, pp. 54–75, 2013.

[25]

N. H. Seong, S. Yeo, and H.-H. S. Lee, “ Tri-level-cell phase change memory: Toward an efficient and reliable memory system,” in Proc. Int. Symp. Comput. Archit., 2013, pp. 440–451.

[26]

V. Sridharan, D. A. Liberty, and D. R. Kaeli, “ A taxonomy to enable error recovery and correction in software,” in Proc. Workshop Quality-Aware Design, 2008, http://research.ihost.com/quad/program.html

[27]

W. Zhang, S. Gurumurthi, M. T. Kandemir, and A. Sivasubramaniam, “ICR: In-cache replication for enhancing data cache reliability,” in Proc. IEEE Int. Conf. Dependable Syst. Netw., 2003, pp. 291–300.

[28]

W. Zhang, “Replication cache: A small fully associative cache to improve data cache reliability,” Trans. Comput., vol. 54, no. 12, pp. 1547–1555, Dec. 2005.

Digital Library

[29]

S. Kim, “Area-efficient error protection for caches,” in Proc. Design, Autom. Test Eur., 2006, pp. 1282–1287.

[30]

B. T. Gold, M. Ferdman, B. Falsafi, and K. Mai, “Mitigating multi-bit soft errors in L1 caches using last-store prediction,” in Proc. Workshop Archit. Support Gigascale Integr. , 2007.

[31]

S. Kim and A. K. Somani, “ Area efficient architectures for information integrity in cache memories,” ACM SIGARCH Comput. Archit. News, vol. 27, no. 2, pp. 246 –255, 1999.

Digital Library

[32]

K. Bhattacharya, N. Ranganathan, and S. Kim, “A framework for correction of multi-bit soft errors in L2 caches based on redundancy,” Trans. Very Large Scale Integr. Syst., vol. 17, no. 2, pp. 194 –206, Feb. 2009.

Digital Library

[33]

V. Sridharan, H. Asadi, M. B. Tahoori, and D. Kaeli, “Reducing data cache susceptibility to soft errors,” Trans. Dependable Secure Comput., vol. 3, no. 4, pp. 353–364, Oct.–Dec. 2006.

Digital Library

[34]

J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. Hoe, “Multi-bit error tolerant caches using two-dimensional error coding, ” in Proc. Int. Symp. Microarchit., 2007, pp. 197 –209.

[35]

A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. S. Mukherjee, and R. Rangan, “Computing architectural vulnerability factors for address-based structures, ” in Proc. Int. Symp. Comput. Archit., 2005, pp. 532– 543.

[36]

C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, “Techniques to reduce the soft error rate of a high-performance microprocessor,” ACM SIGARCH Comput. Archit. News, vol. 32, no. 2, pp. 264–275, 2004.

[37]

W. Zhang, “Computing cache vulnerability to transient errors and its implication,” in Proc. IEEE Int. Symp. Defect Fault Tolerance VLSI Syst., 2005, pp. 427–435.

[38]

A. Shrivastava, J. Lee, and R. Jeyapaul, “Cache vulnerability equations for protecting data in embedded processor caches from soft errors,” in Proc. ACM Sigplan Notices, vol. 45, no. 4, pp. 143 –152, 2010.

[39]

J. Yan and W. Zhang, “ Evaluating instruction cache vulnerability to transient errors,” in Proc. Workshop Memory Performance: Dealing Appl., Syst. Archit., 2006, pp. 21– 28.

[40]

N. N. Sadler and D. J. Sorin, “Choosing an error protection scheme for a microprocessor's L1 data cache, ” in Proc. Int. Conf. Comput. Design, 2007, pp. 499–505.

[41]

S. Wang, J. Hu, and S. G. Ziavras, “On the characterization and optimization of on-chip cache reliability against soft errors,” Trans. Comput., vol. 58, no. 9, pp. 1171–1184, Sep. 2009.

Digital Library

[42]

J. Suh, M. Annavaram, and M. Dubois, “MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets,” in Proc. Int. Symp. High Performance Comput. Archit., 2012, pp. 1–12.

[43]

S. Tavarageri, S. Krishnamoorthy, and P. Sadayappan, “ Compiler-assisted detection of transient memory errors,” in Proc. ACM SIGPLAN Conf. Program. Lang. Design. Implementation, 2014, pp. 204–215.

[44]

M. Sugihara, T. Ishihara, and K. Murakami, “Task scheduling for reliable cache architectures of multiprocessor systems,” in Proc. Design, Autom. Test Eur., 2007, pp. 1490–1495.

Digital Library

[45]

K. Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian, “Mitigating soft error failures for multimedia applications by selective data protection,” in Proc. Int. Conf. Compilers, Archit. Synthesis Embedded Syst., 2006, pp. 411–420.

[46]

J. Xu, R. Shen, and Q. Tan, “PRASE: An approach for program reliability analysis with soft errors,” in Proc. IEEE Pacific Rim Int. Symp. Dependable Comput., 2008, pp. 240–247.

[47]

H. Wang, S. Baldawa, and R. Sangireddy, “Dynamic error detection for dependable cache coherency in multicore architectures,” in Proc. Int. Conf. VLSI Design, 2008, pp. 279–285.

[48]

I. Oz, H. R. Topcuoglu, M. Kandemir, and O. Tosun, “Thread vulnerability in parallel applications, ” J. Parallel Distrib. Comput., vol. 72, no. 10, pp. 1171–1185, 2012.

Digital Library

[49]

R. Jeyapaul and A. Shrivastava, “Enabling energy efficient reliability in embedded systems through smart cache cleaning,” ACM Trans. Design Autom. Electron. Syst., vol. 18, no. 4, pp. 53:1–53:25, 2013.

[50]

H. Zhao, A. Sharifi, S. Srikantaiah, and M. Kandemir, “Feedback control based cache reliability enhancement for emerging multicores,” in Proc. Int. Conf. Comput.-Aided Design, 2011, pp. 56–62.

Digital Library

[51]

A. Chakraborty, H. Homayoun, A. Khajeh, N. Dutt, A. Eltawil, and F. Kurdahi, “E < MC2: less energy through multi-copy cache,” in Proc. Int. Conf. Compilers, Archit. Synthesis Embedded Syst., 2010, pp. 237–246.

[52]

O. Ergin, O. S. Unsal, X. Vera, and A. Gonzalez, “Exploiting narrow values for soft error tolerance,” Comput. Archit. Lett., vol. 5, no. 2, p. 12, 2006.

Digital Library

[53]

I. Kadayif and M. Kandemir, “Modeling and improving data cache reliability,” ACM SIGMETRICS Performance Evaluation Rev., vol. 35, no. 1, p. 12, 2007.

Digital Library

[54]

S. Z. Can, G. Yalcin, O. Ergin, O. Unsal, and A. Cristal, “Bit impact factor: Towards making fair vulnerability comparison, ” Microprocessors Microsyst., vol. 38, pp. 598 –604, 2014.

Digital Library

[55]

J. Sim, G. H. Loh, V. Sridharan, and M. O'Connor, “Resilient die-stacked DRAM caches, ” in Proc. Int. Symp. Comput. Archit., 2013, pp. 416 –427.

[56]

H. Sun, P. Ren, N. Zheng, T. Zhang, and T. Li, “Architecting high-performance energy-efficient soft error resilient cache under 3D integration technology,” Microprocessors Microsyst., vol. 35, no. 4, pp. 371–381, 2011.

Digital Library

[57]

V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi, “Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults,” in Proc. Int. Confe. High Performance Comput., Netw., Storage Anal., 2013, pp. 22:1–22:11.

[58]

W. Zhang and T. Li, “Managing multi-core soft-error reliability through utility-driven cross domain optimization, ” in Proc. Int. Conf. Appl.-Specific Syst., Archit. Processors, 2008, pp. 132–137.

[59]

M. Dimitrov, M. Mantor, and H. Zhou, “Understanding software approaches for GPGPU reliability,” in Proc. Workshop General Purpose Process. Graphics Process. Units, 2009, pp. 94–104.

Digital Library

[60]

J. Lee and A. Shrivastava, “ A compiler optimization to reduce soft errors in register files,” in Proc. ACM Sigplan Notices, vol. 44, no. 7, pp. 41–49, 2009.

[61]

T. M. Jones, M. F. O'boyle, and O. Ergin, “ Evaluating the effects of compiler optimisations on AVF,” in Proc. Workshop Interaction Compilers Comput. Archit., 2008.

[62]

J. Yan and W. Zhang, “ Compiler-guided register reliability improvement against soft errors,” in Proc. ACM Int. Conf. Embedded Softw., 2005, pp. 203–209.

[63]

H. Amrouch and J. Henkel, “Self-immunity technique to improve register file integrity against soft errors, ” in Proc. Int. Conf. VLSI Design, 2011, pp. 189–194.

[64]

H. Tabkhi and G. Schirner, “Application-specific power-efficient approach for reducing register file vulnerability, ” in Proc. Design, Autom. Test Eur. Conf. Exhib., 2012, pp. 574–577.

Digital Library

[65]

J. Xu, Q. Tan, and H. Zhou, “Scheduling instructions for soft errors in register files,” in Proc. Int. Conf. Dependable, Auton. Secure Comput. , 2011, pp. 305–312.

[66]

N. Farazmand, R. Ubal, and D. Kaeli, “Statistical fault injection-based AVF analysis of a GPU architecture,” in Proc. 8th IEEE Workshop Silicon Errors Logic Syst. Effects, 2012.

[67]

S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran, “Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults,” in Proc. 17th Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2012, pp. 123 –134.

[68]

J. Hu, S. Wang, and S. G. Ziavras, “On the exploitation of narrow-width values for improving register file reliability,” Trans. Very Large Scale Integr. Syst., vol. 17, no. 7, pp. 953–963, Jul. 2009.

Digital Library

[69]

J. Tan, Y. Yi, F. Shen, and X. Fu, “Modeling and characterizing GPGPU reliability in the presence of soft errors,” Parallel Comput., vol. 39, no. 9, pp. 520–532, 2013.

[70]

D. Palframan, N. S. Kim, and M. Lipasti, “ Precision-aware soft error protection for GPUs,” in Proc. Int. Symp. High Performance Comput. Archit., 2014, pp. 49–59.

[71]

G. Memik, M. T. Kandemir, and O. Ozturk, “ Increasing register file immunity to transient errors,” in Proc. Design, Autom. Test Eur., 2005, pp. 586–591.

[72]

X. Li, S. V. Adve, P. Bose, and J. A. Rivers, “Online estimation of architectural vulnerability factor for soft errors,” in Proc. Int. Symp. Comput. Archit., 2008, pp. 341–352.

[73]

K. R. Walcott, G. Humphreys, and S. Gurumurthi, “Dynamic prediction of architectural vulnerability from microarchitectural state,” ACM SIGARCH Comput. Archit. News, vol. 35, no. 2, pp. 516–527, 2007.

Digital Library

[74]

W. Zhang and T. Li, “Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology,” in Proc. Int. Symp. Microarchit., 2008, pp. 435–446.

[75]

N. J. Wang, A. Mahesri, and S. J. Patel, “Examining ACE analysis reliability estimates using fault-injection,” in Proc. Int. Symp. Comput. Archit., pp. 460–469, 2007.

[76]

S. Hong and S. Kim, “TEPS: Transient error protection utilizing sub-word parallelism,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, 2009, pp. 286–291.

[77]

A. Sundaram, A. Aakel, D. Lockhart, D. Thaker, and D. Franklin, “Efficient fault tolerance in multi-media applications through selective instruction replication,” in Proc. Workshop Radiation Effects Fault Tolerance Nanometer Technol., 2008, pp. 339–346.

[78]

K. R. Gandhi and N. R. Mahapatra, “Energy-efficient soft-error protection using operand encoding and operation bypass,” in Proc. Int. Conf. VLSI Design, 2008, pp. 45 –51.

[79]

V. Sridharan and D. R. Kaeli, “Eliminating microarchitectural dependency from architectural vulnerability, ” in Proc. Int. Symp. High Performance Comput. Archit., 2009, pp. 117 –128.

[80]

L. Yu, D. Li, S. Mittal, and J. S. Vetter, “Quantitatively modeling application resilience with the data vulnerability factor,” in Proc. ACM/IEEE Int. Conf. High Performance Comput., Netw., Storage, Anal., 2014, pp. 695–706.

[81]

Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu, B. Khessib, K. Vaid, and O. Mutlu, “ Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory, ” in Proc. 44th Annu. /IFIP Int. Conf. Dependable Syst. Netw., 2014, pp. 467–478.

[82]

D. H. Yoon, N. Muralimanohar, J. Chang, P. Ranganathan, N. P. Jouppi, and M. Erez, “FREE-p: Protecting non-volatile memory against both hard and soft errors,” in Proc. Int. Symp. High Performance Comput. Archit., 2011, pp. 466 –477.

[83]

V. Sridharan and D. Liberty, “A Study of DRAM Failures in the Field,” in Proc. Int. Conf. High Performance Comput., Netw., Storage Anal., 2012, pp. 1–11.

[84]

D. H. Yoon and M. Erez, “Virtualized and flexible ECC for main memory,” in Proc. 15th ed. ASPLOS Archit. Support Program. Lang. Oper. Syst., 2010, pp. 397–408.

[85]

B. Schroeder and G. A. Gibson, “A large-scale study of failures in high-performance computing systems,” Trans. Dependable Secure Comput., vol. 7, no. 4, pp. 337 –350, Oct.-Dec. 2010.

Digital Library

[86]

A. A. Hwang, I. A. Stefanovici, and B. Schroeder, “ Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design,” in Proc. Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2012, pp. 111–122.

[87]

B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications,” in Proc. IEEE Int. Symp. Performance Anal. Syst. Softw., 2014, pp. 221–230.

[88]

A. A. Nair, S. Eyerman, L. Eeckhout, and L. K. John, “A first-order mechanistic model for architectural vulnerability factor,” in Proc. Int. Symp. Comput. Archit. , 2012, pp. 273–284.

[89]

M. Demertzi, M. Annavaram, and M. Hall, “Analyzing the effects of compiler optimizations on application reliability,” in Proc. Int. Symp. Workload Characterization, 2011, pp. 184–193.

[90]

K. Swaminathan, R. Mukundrajan, N. Soundararajan, and V. Narayanan, “Towards resilient micro-architectures: Datapath reliability enhancement using STT-MRAM,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, 2011, pp. 236–241.

[91]

N. Soundararajan, N. Vijaykrishnan, and A. Sivasubramaniam, “ Impact of dynamic voltage and frequency scaling on the architectural vulnerability of GALS architectures,” in Proc. Int. Symp. Low Power Electron. Design, 2008, pp. 351–356.

[92]

X. Fu, W. Zhang, T. Li, and J. Fortes, “Optimizing issue queue reliability to soft errors on simultaneous multithreaded architectures,” in Proc. Int. Conf. Parallel Process., 2008, pp. 190–197.

[93]

X. Fu, J. Poe, T. Li, and J. A. Fortes, “Characterizing microarchitecture soft error vulnerability phase behavior,” in Proc. Int. Symp. Model., Anal., Simul. Comput. Telecommun. Syst., 2006, pp. 147–155.

[94]

N. Seifert and N. Tam, “Timing vulnerability factors of sequentials,” Trans. Device Mater. Rel., vol. 4, no. 3, pp. 516–522, Sep. 2004.

[95]

V. Sridharan and D. R. Kaeli, “Using hardware vulnerability factors to enhance AVF analysis,” in Proc. Int. Symp. Comput. Archit., 2010, pp. 461–472.

[96]

D. Borodin and B. H. Juurlink, “Protective redundancy overhead reduction using instruction vulnerability factor, ” in Proc. ACM Int. Conf. Comput. Frontiers, 2010, pp. 319 –326.

Digital Library

[97]

S. Rehman, M. Shafique, F. Kriebel, and J. Henkel, “Reliable software for unreliable hardware: embedded code generation aiming at reliability,” in Proc. Int. Conf. Hardw./Softw. Codesign Syst. Synthesis, 2011, pp. 237–246.

[98]

L. Duan, B. Li, and L. Peng, “Versatile prediction and fast estimation of architectural vulnerability factor from processor performance metrics, ” in Proc. Int. Symp. High Performance Comput. Archit., 2009, pp. 129–140.

[99]

S. Feng, S. Gupta, A. Ansari, and S. Mahlke, “Shoestring: probabilistic soft error reliability on the cheap,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 1, pp. 385–396, 2010.

Digital Library

[100]

N. K. Soundararajan, A. Parashar, and A. Sivasubramaniam, “ Mechanisms for bounding vulnerabilities of processor structures,” ACM SIGARCH Comput. Archit. News, vol. 35, no. 2, pp. 506–515, 2007.

Digital Library

[101]

J. Tan and X. Fu, “ RISE: Improving the streaming processors reliability against soft errors in GPGPUs, ” in Proc. 21st Int. Conf. Parallel Archit. Compilation Tech., 2012, pp. 191 –200.

[102]

X. Vera, J. Abella, J. Carretero, and A. González, “Selective replication: A lightweight technique for soft errors,” ACM Trans. Comput. Syst., vol. 27, no. 4, pp. 8:1–8:30, 2009.

[103]

N. Madan and R. Balasubramonian, “Leveraging 3D technology for improved reliability,” in Proc. Int. Symp. Microarchit., 2007, pp. 223–235.

[104]

S. Rehman, M. Shafique, P. V. Aceituno, F. Kriebel, J.-J. Chen, and J. Henkel, “Leveraging variable function resilience for selective software reliability on unreliable hardware,” in Proc. Conf. Design, Autom. Test Eur., 2013, pp. 1759–1764.

Digital Library

[105]

P. M. Wells, K. Chakraborty, and G. S. Sohi, “ Mixed-mode multicore reliability,” ACM SIGARCH Comput. Archit. News , vol. 37, no. 1, pp. 169–180, 2009.

Digital Library

[106]

G. Sun, E. Kursun, J. Rivers, and Y. Xie, “Exploring the vulnerability of CMPs to soft errors with 3D stacked non-volatile memory,” in Proc. Int. Conf. Comput. Design, 2011, pp. 366–372.

[107]

S. Hari, R. Venkatagiri, S. Adve, and H. Naeimi, “GangES: Gang error simulation for hardware resiliency evaluation,” in Proc. Int. Symp. Comput. Archit., 2014, pp. 61–72.

[108]

A. Savino, S. Carlo, G. Politano, A. Benso, A. Bosio, and G. Di Natale, “Statistical reliability estimation of microprocessor-based systems,” Trans. Comput., vol. 61, no. 11, pp. 1521– 1534, Nov. 2012.

Digital Library

[109]

A. K. Coskun, T. S. Rosing, Y. Leblebici, and G. De Micheli, “A simulation methodology for reliability analysis in multi-core SoCs,” in Proc. ACM Great Lakes Symp. VLSI , 2006, pp. 95–99.

[110]

I. S. Haque and V. S. Pande, “Hard data on soft errors: A large-scale assessment of real-world error rates in GPGPU, ” in Proc. IEEE/ACM Int. Conf. Cluster, Cloud Grid Comput., 2010, pp. 691–696.

[111]

S. Mittal, Y. Cao, and Z. Zhang, “MASTER: A multicore cache energy saving technique using dynamic cache reconfiguration,” Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 8, pp. 1653 –1665, Aug. 2014.

[112]

S. Kaxiras, Z. Hu, and M. Martonosi, “Cache decay: exploiting generational behavior to reduce cache leakage power,” in Proc. Int. Symp. Comput. Archit., 2001, pp. 240–251.

[113]

K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy caches: simple techniques for reducing leakage power, ” in Proc. Int. Symp. Comput. Archit., 2002, pp. 148 –157.

[114]

M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches, ” in Proc. Design Autom. Test Eur., 2015, pp. 1543–1546, ISBN 978-3-9815-3704-8.

Cited By

Gao BWang ZHe ZLuo TWong WZhou Z(2024)IMI: In-memory Multi-job Inference Acceleration for Large Language ModelsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673053(752-761)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673053
Venkatesha SParthasarathi R(2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.1145/3663672
Perez-Cerrolaza JAbella JBorg MDonzella CCerquides JCazorla FEnglund CTauber MNikolakopoulos GFlores J(2024)Artificial Intelligence for Safety-Critical Systems in Industrial and Transportation Domains: A SurveyACM Computing Surveys10.1145/362631456:7(1-40)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3626314
Show More Cited By

Index Terms

A Survey of Techniques for Modeling and Improving Reliability of Computing Systems

Index terms have been assigned to the content through auto-classification.

Recommendations

A survey of processors with explicit multithreading

Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, ...
Improving code density using compression techniques
MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture

We propose a method for compressing programs in embedded processors where instruction memory size dominates cost. A post-compilation analyzer examines a program and replaces common sequences of instructions with a single instruction codeword. A ...
On the exploitation of narrow-width values for improving register file reliability

Protecting the register value and its data buses is crucial to reliable computing in high-performance microprocessors due to the increasing susceptibility of CMOS circuitry to soft errors induced by high-energy particle strikes. Since the register file ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 27, Issue 4

April 2016

312 pages

ISSN:1045-9219

Issue’s Table of Contents

Copyright © 2015.

Publisher

IEEE Press

Publication History

Published: 01 April 2016

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao BWang ZHe ZLuo TWong WZhou Z(2024)IMI: In-memory Multi-job Inference Acceleration for Large Language ModelsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673053(752-761)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673053
Venkatesha SParthasarathi R(2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.1145/3663672
Perez-Cerrolaza JAbella JBorg MDonzella CCerquides JCazorla FEnglund CTauber MNikolakopoulos GFlores J(2024)Artificial Intelligence for Safety-Critical Systems in Industrial and Transportation Domains: A SurveyACM Computing Surveys10.1145/362631456:7(1-40)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3626314
Somoye IMannos TDziki BPlusquellic J(2024)Self-Assertion-Based Countermeasures Within a RISC-V Microprocessor for Coverage of Information Leakage FaultsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.335159243:6(1677-1690)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3351592
Wang JZhu JFu XZang DLi KZhang W(2024)Enhancing Neural Network Reliability: Insights From Hardware/Software Collaboration With Neuron Vulnerability QuantizationIEEE Transactions on Computers10.1109/TC.2024.339849273:8(1953-1966)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TC.2024.3398492
Zhao ZChen XLu Y(2023)Trade-off Mechanism Between Reliability and Performance for Data-flow Soft Error DetectionJournal of Electronic Testing: Theory and Applications10.1007/s10836-023-06087-239:5-6(583-595)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1007/s10836-023-06087-2
Perez-Cerrolaza JAbella JKosmidis LCalderon ACazorla FFlores J(2022)GPU Devices for Safety-Critical Systems: A SurveyACM Computing Surveys10.1145/354952655:7(1-37)Online publication date: 15-Dec-2022
https://dl.acm.org/doi/10.1145/3549526
Topçu BÖz I(2022)Soft error vulnerability prediction of GPGPU applicationsThe Journal of Supercomputing10.1007/s11227-022-04933-279:6(6965-6990)Online publication date: 19-Nov-2022
https://dl.acm.org/doi/10.1007/s11227-022-04933-2
Kosaian JRashmi Kde Supinski BHall MGamblin T(2021)Arithmetic-intensity-guided fault tolerance for neural network inference on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476184(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476184
Öz IArslan S(2021)Predicting the Soft Error Vulnerability of Parallel Applications Using Machine LearningInternational Journal of Parallel Programming10.1007/s10766-021-00707-049:3(410-439)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s10766-021-00707-0
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents