Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Survey of Techniques for Modeling and Improving Reliability of Computing Systems

Published: 01 April 2016 Publication History

Abstract

Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made ‘reliability’ a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. This paper provides a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based on their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. We believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.

References

[1]
S. Borkar, “Designing reliable systems from unreliable components: the challenges of transistor variability and degradation,” Micro, IEEE, vol. 25, no. 6, pp. 10–16, Nov./Dec. 2005.
[2]
J. F. Ziegler, and H. Puchner, SER—History, Trends and Challenges: A Guide for Designing with Memory ICs. San Jose, CA, USA: Cypress, 2004.
[3]
S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, “A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor,” in Proc. 36th Annu. IEEE/ACM Int. Symp. Microarchit., 2003, pp. 29–40.
[4]
E. Normand, “Single-event effects in avionics,” Trans. Nucl. Sci., vol. 43, no. 2, pp. 461–474, Apr. 1996.
[5]
W.-C. Feng, “Making a case for efficient supercomputing,” Queue, vol. 1, no. 7, p. 54, 2003.
[6]
M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. S. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen, “Addressing failures in exascale computing,” Int. J. High Performance Comput. Appl., vol. 28, no. 2, pp. 129–173, May 2014, DOI. 10.1177/1094342014522573.
[7]
L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft error and energy consumption interactions: A data cache perspective,” in Proc. Int. Symp. Low Power Electron. Design, 2004, pp. 132–137.
[8]
D. H. Yoon and M. Erez, “Memory mapped ECC: Low-cost error protection for last level caches,” in Proc. Int. Symp. Comput. Archit., 2009, pp. 116–127.
[9]
N. J. George, C. R. Elks, B. W. Johnson, and J. Lach, “Transient fault models and AVF estimation revisited,” in Proc. IEEE/IFIP Int. Conf. Dependable Syst. Netw., 2010, pp. 477–486.
[10]
S. Mittal, “A survey of architectural techniques for improving cache power efficiency,” Elsevier Sustainable Comput.: Informat. Systems, vol. 4, no. 1, pp. 33–43, 2014.
[11]
R. Riedlinger, R. Arnold, L. Biro, B. Bowhill, J. Crop, K. Duda, E. S. Fetzer, O. Franza, T. Grutkowski, C. Little, C. Morganti, G. Moyer, A. Munch, M. Nagarajan, C. Parks, C. Poirier, B. Repasky, E. Roytman, T. Singh, and M. W. Stefaniw, “A 32 nm, 3.1 billion transistor, 12 wide issue itanium processor for mission-critical servers,” J. Solid-State Circuits , vol. 47, no. 1, pp. 177–193, Jan. 2012.
[12]
S. Rusu, H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin, S. Li, S. Vora, R. Varada, and E. Wang, “Ivytown: A 22nm 15-core enterprise Xeon® processor family,” in Proc. IEEE Int. Solid-State Circuits Conf., 2014, pp. 102–103.
[13]
V. Zyuban, S. Taylor, B. Christensen, A. Hall, C. Gonzalez, J. Friedrich, F. Clougherty, J. Tetzloff, and R. Rao, “IBM POWER7+ design for higher frequency at fixed power, ” IBM J. Res. Develop., vol. 57, no. 6, pp. 1:1–1:18, 2013.
[14]
E. J. Fluhr, J. Friedrich, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, A. Hall, D. Hogenmiller, F. Malgioglio, R. Nett, J. Paredes, J. Pille, D. Plass, R. Puri, P. Restle, D. Shan, K. Stawiasz, Z. T. Deniz, D. Wendel, and M. Ziegler, “5.1 POWER8 $^{\text{TM}}$: A 12-core server-class processor in 22nm SOI with 7.6 Tb/s off-chip bandwidth,” in Proc. IEEE Int. Solid-State Circuits Conf., 2014, pp. 96–97.
[15]
S. Mittal, “A survey of techniques for managing and leveraging caches in GPUs, ” J. Circuits, Syst., Comput., vol. 23, no. 8, 2014, http://www.worldscientific.com/doi/abs/10.1142/S0218126614300025.
[16]
Y. Cai, M. T. Schmitz, A. Ejlali, B. M. Al,-Hashimi, and S. M. Reddy, “Cache size selection for performance, energy and reliability of time-constrained systems,” in Proc. Asia South Pacific Design Autom. Conf., 2006, pp. 923–928.
[17]
A. Biswas, C. Recchia, S. S. Mukherjee, V. Ambrose, L. Chan, A. Jaleel, A. E. Papathanasiou, M. Plaster, and N. Seifert, “Explaining cache SER anomaly using DUE AVF measurement,” in Proc. Int. Symp. High Performance Comput. Archit., 2010, pp. 1–12.
[18]
G.-H. Asadi, V. Sridharan, M. B. Tahoori, and D. Kaeli, “Balancing performance and reliability in the memory hierarchy,” in Proc. Int. Symp. Performance Anal. Syst. Softw. , 2005, pp. 269–279.
[19]
S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt, “Cache scrubbing in microprocessors: Myth or necessity?” in Proc. IEEE Pacific Rim Int. Symp. Dependable Comput., 2004, pp. 37–42.
[20]
S. Mittal, J. S. Vetter, and D. Li, “Improving energy efficiency of embedded DRAM caches for high-end computing systems,” in Proc. 23rd Int. ACM Symp. High Performance Parallel Distrib. Comput., 2014, pp. 99– 110.
[21]
M. Awasthi, M. Shevgoor, K. Sudan, B. Rajendran, R. Balasubramonian, and V. Srinivasan, “Efficient scrub mechanisms for error-prone emerging memories, ” in Proc. Int. Symp. High Performance Comput. Archit., 2012, pp. 1 –12.
[22]
S. Mittal, J. S. Vetter, and D. Li, “A survey of architectural approaches for managing embedded DRAM and non-volatile on-chip caches,” Trans. Parallel Distrib. Syst., 2014, DOI. 10.1109/TPDS.2014.2324563.
[23]
H. Sun, C. Liu, W. Xu, J. Zhao, N. Zheng, and T. Zhang, “Using magnetic RAM to build low-power and soft error-resilient L1 cache,” Trans. Very Large Scale Integr. Syst., vol. 20, no. 1, pp. 19–28, Jan. 2012.
[24]
H. Naeimi, C. Augustine, A. Raychowdhury, S.-L. Lu, and J. Tschanz, “STTRAM scaling and retention failure, ” Intel Technol. J., vol. 17, no. 1, pp. 54–75, 2013.
[25]
N. H. Seong, S. Yeo, and H.-H. S. Lee, “ Tri-level-cell phase change memory: Toward an efficient and reliable memory system,” in Proc. Int. Symp. Comput. Archit., 2013, pp. 440–451.
[26]
V. Sridharan, D. A. Liberty, and D. R. Kaeli, “ A taxonomy to enable error recovery and correction in software,” in Proc. Workshop Quality-Aware Design, 2008, http://research.ihost.com/quad/program.html
[27]
W. Zhang, S. Gurumurthi, M. T. Kandemir, and A. Sivasubramaniam, “ICR: In-cache replication for enhancing data cache reliability,” in Proc. IEEE Int. Conf. Dependable Syst. Netw., 2003, pp. 291–300.
[28]
W. Zhang, “Replication cache: A small fully associative cache to improve data cache reliability,” Trans. Comput., vol. 54, no. 12, pp. 1547–1555, Dec. 2005.
[29]
S. Kim, “Area-efficient error protection for caches,” in Proc. Design, Autom. Test Eur., 2006, pp. 1282–1287.
[30]
B. T. Gold, M. Ferdman, B. Falsafi, and K. Mai, “Mitigating multi-bit soft errors in L1 caches using last-store prediction,” in Proc. Workshop Archit. Support Gigascale Integr. , 2007.
[31]
S. Kim and A. K. Somani, “ Area efficient architectures for information integrity in cache memories,” ACM SIGARCH Comput. Archit. News, vol. 27, no. 2, pp. 246 –255, 1999.
[32]
K. Bhattacharya, N. Ranganathan, and S. Kim, “A framework for correction of multi-bit soft errors in L2 caches based on redundancy,” Trans. Very Large Scale Integr. Syst., vol. 17, no. 2, pp. 194 –206, Feb. 2009.
[33]
V. Sridharan, H. Asadi, M. B. Tahoori, and D. Kaeli, “Reducing data cache susceptibility to soft errors,” Trans. Dependable Secure Comput., vol. 3, no. 4, pp. 353–364, Oct.–Dec. 2006.
[34]
J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. Hoe, “Multi-bit error tolerant caches using two-dimensional error coding, ” in Proc. Int. Symp. Microarchit., 2007, pp. 197 –209.
[35]
A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. S. Mukherjee, and R. Rangan, “Computing architectural vulnerability factors for address-based structures, ” in Proc. Int. Symp. Comput. Archit., 2005, pp. 532– 543.
[36]
C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, “Techniques to reduce the soft error rate of a high-performance microprocessor,” ACM SIGARCH Comput. Archit. News, vol. 32, no. 2, pp. 264–275, 2004.
[37]
W. Zhang, “Computing cache vulnerability to transient errors and its implication,” in Proc. IEEE Int. Symp. Defect Fault Tolerance VLSI Syst., 2005, pp. 427–435.
[38]
A. Shrivastava, J. Lee, and R. Jeyapaul, “Cache vulnerability equations for protecting data in embedded processor caches from soft errors,” in Proc. ACM Sigplan Notices, vol. 45, no. 4, pp. 143 –152, 2010.
[39]
J. Yan and W. Zhang, “ Evaluating instruction cache vulnerability to transient errors,” in Proc. Workshop Memory Performance: Dealing Appl., Syst. Archit., 2006, pp. 21– 28.
[40]
N. N. Sadler and D. J. Sorin, “Choosing an error protection scheme for a microprocessor's L1 data cache, ” in Proc. Int. Conf. Comput. Design, 2007, pp. 499–505.
[41]
S. Wang, J. Hu, and S. G. Ziavras, “On the characterization and optimization of on-chip cache reliability against soft errors,” Trans. Comput., vol. 58, no. 9, pp. 1171–1184, Sep. 2009.
[42]
J. Suh, M. Annavaram, and M. Dubois, “MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets,” in Proc. Int. Symp. High Performance Comput. Archit., 2012, pp. 1–12.
[43]
S. Tavarageri, S. Krishnamoorthy, and P. Sadayappan, “ Compiler-assisted detection of transient memory errors,” in Proc. ACM SIGPLAN Conf. Program. Lang. Design. Implementation, 2014, pp. 204–215.
[44]
M. Sugihara, T. Ishihara, and K. Murakami, “Task scheduling for reliable cache architectures of multiprocessor systems,” in Proc. Design, Autom. Test Eur., 2007, pp. 1490–1495.
[45]
K. Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkatasubramanian, “Mitigating soft error failures for multimedia applications by selective data protection,” in Proc. Int. Conf. Compilers, Archit. Synthesis Embedded Syst., 2006, pp. 411–420.
[46]
J. Xu, R. Shen, and Q. Tan, “PRASE: An approach for program reliability analysis with soft errors,” in Proc. IEEE Pacific Rim Int. Symp. Dependable Comput., 2008, pp. 240–247.
[47]
H. Wang, S. Baldawa, and R. Sangireddy, “Dynamic error detection for dependable cache coherency in multicore architectures,” in Proc. Int. Conf. VLSI Design, 2008, pp. 279–285.
[48]
I. Oz, H. R. Topcuoglu, M. Kandemir, and O. Tosun, “Thread vulnerability in parallel applications, ” J. Parallel Distrib. Comput., vol. 72, no. 10, pp. 1171–1185, 2012.
[49]
R. Jeyapaul and A. Shrivastava, “Enabling energy efficient reliability in embedded systems through smart cache cleaning,” ACM Trans. Design Autom. Electron. Syst., vol. 18, no. 4, pp. 53:1–53:25, 2013.
[50]
H. Zhao, A. Sharifi, S. Srikantaiah, and M. Kandemir, “Feedback control based cache reliability enhancement for emerging multicores,” in Proc. Int. Conf. Comput.-Aided Design, 2011, pp. 56–62.
[51]
A. Chakraborty, H. Homayoun, A. Khajeh, N. Dutt, A. Eltawil, and F. Kurdahi, “E < MC2: less energy through multi-copy cache,” in Proc. Int. Conf. Compilers, Archit. Synthesis Embedded Syst., 2010, pp. 237–246.
[52]
O. Ergin, O. S. Unsal, X. Vera, and A. Gonzalez, “Exploiting narrow values for soft error tolerance,” Comput. Archit. Lett., vol. 5, no. 2, p. 12, 2006.
[53]
I. Kadayif and M. Kandemir, “Modeling and improving data cache reliability,” ACM SIGMETRICS Performance Evaluation Rev., vol. 35, no. 1, p. 12, 2007.
[54]
S. Z. Can, G. Yalcin, O. Ergin, O. Unsal, and A. Cristal, “Bit impact factor: Towards making fair vulnerability comparison, ” Microprocessors Microsyst., vol. 38, pp. 598 –604, 2014.
[55]
J. Sim, G. H. Loh, V. Sridharan, and M. O'Connor, “Resilient die-stacked DRAM caches, ” in Proc. Int. Symp. Comput. Archit., 2013, pp. 416 –427.
[56]
H. Sun, P. Ren, N. Zheng, T. Zhang, and T. Li, “Architecting high-performance energy-efficient soft error resilient cache under 3D integration technology,” Microprocessors Microsyst., vol. 35, no. 4, pp. 371–381, 2011.
[57]
V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi, “Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults,” in Proc. Int. Confe. High Performance Comput., Netw., Storage Anal., 2013, pp. 22:1–22:11.
[58]
W. Zhang and T. Li, “Managing multi-core soft-error reliability through utility-driven cross domain optimization, ” in Proc. Int. Conf. Appl.-Specific Syst., Archit. Processors, 2008, pp. 132–137.
[59]
M. Dimitrov, M. Mantor, and H. Zhou, “Understanding software approaches for GPGPU reliability,” in Proc. Workshop General Purpose Process. Graphics Process. Units, 2009, pp. 94–104.
[60]
J. Lee and A. Shrivastava, “ A compiler optimization to reduce soft errors in register files,” in Proc. ACM Sigplan Notices, vol. 44, no. 7, pp. 41–49, 2009.
[61]
T. M. Jones, M. F. O'boyle, and O. Ergin, “ Evaluating the effects of compiler optimisations on AVF,” in Proc. Workshop Interaction Compilers Comput. Archit., 2008.
[62]
J. Yan and W. Zhang, “ Compiler-guided register reliability improvement against soft errors,” in Proc. ACM Int. Conf. Embedded Softw., 2005, pp. 203–209.
[63]
H. Amrouch and J. Henkel, “Self-immunity technique to improve register file integrity against soft errors, ” in Proc. Int. Conf. VLSI Design, 2011, pp. 189–194.
[64]
H. Tabkhi and G. Schirner, “Application-specific power-efficient approach for reducing register file vulnerability, ” in Proc. Design, Autom. Test Eur. Conf. Exhib., 2012, pp. 574–577.
[65]
J. Xu, Q. Tan, and H. Zhou, “Scheduling instructions for soft errors in register files,” in Proc. Int. Conf. Dependable, Auton. Secure Comput. , 2011, pp. 305–312.
[66]
N. Farazmand, R. Ubal, and D. Kaeli, “Statistical fault injection-based AVF analysis of a GPU architecture,” in Proc. 8th IEEE Workshop Silicon Errors Logic Syst. Effects, 2012.
[67]
S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran, “Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults,” in Proc. 17th Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2012, pp. 123 –134.
[68]
J. Hu, S. Wang, and S. G. Ziavras, “On the exploitation of narrow-width values for improving register file reliability,” Trans. Very Large Scale Integr. Syst., vol. 17, no. 7, pp. 953–963, Jul. 2009.
[69]
J. Tan, Y. Yi, F. Shen, and X. Fu, “Modeling and characterizing GPGPU reliability in the presence of soft errors,” Parallel Comput., vol. 39, no. 9, pp. 520–532, 2013.
[70]
D. Palframan, N. S. Kim, and M. Lipasti, “ Precision-aware soft error protection for GPUs,” in Proc. Int. Symp. High Performance Comput. Archit., 2014, pp. 49–59.
[71]
G. Memik, M. T. Kandemir, and O. Ozturk, “ Increasing register file immunity to transient errors,” in Proc. Design, Autom. Test Eur., 2005, pp. 586–591.
[72]
X. Li, S. V. Adve, P. Bose, and J. A. Rivers, “Online estimation of architectural vulnerability factor for soft errors,” in Proc. Int. Symp. Comput. Archit., 2008, pp. 341–352.
[73]
K. R. Walcott, G. Humphreys, and S. Gurumurthi, “Dynamic prediction of architectural vulnerability from microarchitectural state,” ACM SIGARCH Comput. Archit. News, vol. 35, no. 2, pp. 516–527, 2007.
[74]
W. Zhang and T. Li, “Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology,” in Proc. Int. Symp. Microarchit., 2008, pp. 435–446.
[75]
N. J. Wang, A. Mahesri, and S. J. Patel, “Examining ACE analysis reliability estimates using fault-injection,” in Proc. Int. Symp. Comput. Archit., pp. 460–469, 2007.
[76]
S. Hong and S. Kim, “TEPS: Transient error protection utilizing sub-word parallelism,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, 2009, pp. 286–291.
[77]
A. Sundaram, A. Aakel, D. Lockhart, D. Thaker, and D. Franklin, “Efficient fault tolerance in multi-media applications through selective instruction replication,” in Proc. Workshop Radiation Effects Fault Tolerance Nanometer Technol., 2008, pp. 339–346.
[78]
K. R. Gandhi and N. R. Mahapatra, “Energy-efficient soft-error protection using operand encoding and operation bypass,” in Proc. Int. Conf. VLSI Design, 2008, pp. 45 –51.
[79]
V. Sridharan and D. R. Kaeli, “Eliminating microarchitectural dependency from architectural vulnerability, ” in Proc. Int. Symp. High Performance Comput. Archit., 2009, pp. 117 –128.
[80]
L. Yu, D. Li, S. Mittal, and J. S. Vetter, “Quantitatively modeling application resilience with the data vulnerability factor,” in Proc. ACM/IEEE Int. Conf. High Performance Comput., Netw., Storage, Anal., 2014, pp. 695–706.
[81]
Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu, B. Khessib, K. Vaid, and O. Mutlu, “ Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory, ” in Proc. 44th Annu. /IFIP Int. Conf. Dependable Syst. Netw., 2014, pp. 467–478.
[82]
D. H. Yoon, N. Muralimanohar, J. Chang, P. Ranganathan, N. P. Jouppi, and M. Erez, “FREE-p: Protecting non-volatile memory against both hard and soft errors,” in Proc. Int. Symp. High Performance Comput. Archit., 2011, pp. 466 –477.
[83]
V. Sridharan and D. Liberty, “A Study of DRAM Failures in the Field,” in Proc. Int. Conf. High Performance Comput., Netw., Storage Anal., 2012, pp. 1–11.
[84]
D. H. Yoon and M. Erez, “Virtualized and flexible ECC for main memory,” in Proc. 15th ed. ASPLOS Archit. Support Program. Lang. Oper. Syst., 2010, pp. 397–408.
[85]
B. Schroeder and G. A. Gibson, “A large-scale study of failures in high-performance computing systems,” Trans. Dependable Secure Comput., vol. 7, no. 4, pp. 337 –350, Oct.-Dec. 2010.
[86]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder, “ Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design,” in Proc. Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2012, pp. 111–122.
[87]
B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications,” in Proc. IEEE Int. Symp. Performance Anal. Syst. Softw., 2014, pp. 221–230.
[88]
A. A. Nair, S. Eyerman, L. Eeckhout, and L. K. John, “A first-order mechanistic model for architectural vulnerability factor,” in Proc. Int. Symp. Comput. Archit. , 2012, pp. 273–284.
[89]
M. Demertzi, M. Annavaram, and M. Hall, “Analyzing the effects of compiler optimizations on application reliability,” in Proc. Int. Symp. Workload Characterization, 2011, pp. 184–193.
[90]
K. Swaminathan, R. Mukundrajan, N. Soundararajan, and V. Narayanan, “Towards resilient micro-architectures: Datapath reliability enhancement using STT-MRAM,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, 2011, pp. 236–241.
[91]
N. Soundararajan, N. Vijaykrishnan, and A. Sivasubramaniam, “ Impact of dynamic voltage and frequency scaling on the architectural vulnerability of GALS architectures,” in Proc. Int. Symp. Low Power Electron. Design, 2008, pp. 351–356.
[92]
X. Fu, W. Zhang, T. Li, and J. Fortes, “Optimizing issue queue reliability to soft errors on simultaneous multithreaded architectures,” in Proc. Int. Conf. Parallel Process., 2008, pp. 190–197.
[93]
X. Fu, J. Poe, T. Li, and J. A. Fortes, “Characterizing microarchitecture soft error vulnerability phase behavior,” in Proc. Int. Symp. Model., Anal., Simul. Comput. Telecommun. Syst., 2006, pp. 147–155.
[94]
N. Seifert and N. Tam, “Timing vulnerability factors of sequentials,” Trans. Device Mater. Rel., vol. 4, no. 3, pp. 516–522, Sep. 2004.
[95]
V. Sridharan and D. R. Kaeli, “Using hardware vulnerability factors to enhance AVF analysis,” in Proc. Int. Symp. Comput. Archit., 2010, pp. 461–472.
[96]
D. Borodin and B. H. Juurlink, “Protective redundancy overhead reduction using instruction vulnerability factor, ” in Proc. ACM Int. Conf. Comput. Frontiers, 2010, pp. 319 –326.
[97]
S. Rehman, M. Shafique, F. Kriebel, and J. Henkel, “Reliable software for unreliable hardware: embedded code generation aiming at reliability,” in Proc. Int. Conf. Hardw./Softw. Codesign Syst. Synthesis, 2011, pp. 237–246.
[98]
L. Duan, B. Li, and L. Peng, “Versatile prediction and fast estimation of architectural vulnerability factor from processor performance metrics, ” in Proc. Int. Symp. High Performance Comput. Archit., 2009, pp. 129–140.
[99]
S. Feng, S. Gupta, A. Ansari, and S. Mahlke, “Shoestring: probabilistic soft error reliability on the cheap,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 1, pp. 385–396, 2010.
[100]
N. K. Soundararajan, A. Parashar, and A. Sivasubramaniam, “ Mechanisms for bounding vulnerabilities of processor structures,” ACM SIGARCH Comput. Archit. News, vol. 35, no. 2, pp. 506–515, 2007.
[101]
J. Tan and X. Fu, “ RISE: Improving the streaming processors reliability against soft errors in GPGPUs, ” in Proc. 21st Int. Conf. Parallel Archit. Compilation Tech., 2012, pp. 191 –200.
[102]
X. Vera, J. Abella, J. Carretero, and A. González, “Selective replication: A lightweight technique for soft errors,” ACM Trans. Comput. Syst., vol. 27, no. 4, pp. 8:1–8:30, 2009.
[103]
N. Madan and R. Balasubramonian, “Leveraging 3D technology for improved reliability,” in Proc. Int. Symp. Microarchit., 2007, pp. 223–235.
[104]
S. Rehman, M. Shafique, P. V. Aceituno, F. Kriebel, J.-J. Chen, and J. Henkel, “Leveraging variable function resilience for selective software reliability on unreliable hardware,” in Proc. Conf. Design, Autom. Test Eur., 2013, pp. 1759–1764.
[105]
P. M. Wells, K. Chakraborty, and G. S. Sohi, “ Mixed-mode multicore reliability,” ACM SIGARCH Comput. Archit. News , vol. 37, no. 1, pp. 169–180, 2009.
[106]
G. Sun, E. Kursun, J. Rivers, and Y. Xie, “Exploring the vulnerability of CMPs to soft errors with 3D stacked non-volatile memory,” in Proc. Int. Conf. Comput. Design, 2011, pp. 366–372.
[107]
S. Hari, R. Venkatagiri, S. Adve, and H. Naeimi, “GangES: Gang error simulation for hardware resiliency evaluation,” in Proc. Int. Symp. Comput. Archit., 2014, pp. 61–72.
[108]
A. Savino, S. Carlo, G. Politano, A. Benso, A. Bosio, and G. Di Natale, “Statistical reliability estimation of microprocessor-based systems,” Trans. Comput., vol. 61, no. 11, pp. 1521– 1534, Nov. 2012.
[109]
A. K. Coskun, T. S. Rosing, Y. Leblebici, and G. De Micheli, “A simulation methodology for reliability analysis in multi-core SoCs,” in Proc. ACM Great Lakes Symp. VLSI , 2006, pp. 95–99.
[110]
I. S. Haque and V. S. Pande, “Hard data on soft errors: A large-scale assessment of real-world error rates in GPGPU, ” in Proc. IEEE/ACM Int. Conf. Cluster, Cloud Grid Comput., 2010, pp. 691–696.
[111]
S. Mittal, Y. Cao, and Z. Zhang, “MASTER: A multicore cache energy saving technique using dynamic cache reconfiguration,” Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 8, pp. 1653 –1665, Aug. 2014.
[112]
S. Kaxiras, Z. Hu, and M. Martonosi, “Cache decay: exploiting generational behavior to reduce cache leakage power,” in Proc. Int. Symp. Comput. Archit., 2001, pp. 240–251.
[113]
K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy caches: simple techniques for reducing leakage power, ” in Proc. Int. Symp. Comput. Archit., 2002, pp. 148 –157.
[114]
M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches, ” in Proc. Design Autom. Test Eur., 2015, pp. 1543–1546, ISBN 978-3-9815-3704-8.

Cited By

View all
  • (2024)IMI: In-memory Multi-job Inference Acceleration for Large Language ModelsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673053(752-761)Online publication date: 12-Aug-2024
  • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 6-May-2024
  • (2024)Artificial Intelligence for Safety-Critical Systems in Industrial and Transportation Domains: A SurveyACM Computing Surveys10.1145/362631456:7(1-40)Online publication date: 9-Apr-2024
  • Show More Cited By

Index Terms

  1. A Survey of Techniques for Modeling and Improving Reliability of Computing Systems
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image IEEE Transactions on Parallel and Distributed Systems
          IEEE Transactions on Parallel and Distributed Systems  Volume 27, Issue 4
          April 2016
          312 pages

          Publisher

          IEEE Press

          Publication History

          Published: 01 April 2016

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 06 Oct 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)IMI: In-memory Multi-job Inference Acceleration for Large Language ModelsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673053(752-761)Online publication date: 12-Aug-2024
          • (2024)Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and ReliabilityACM Computing Surveys10.1145/366367256:11(1-76)Online publication date: 6-May-2024
          • (2024)Artificial Intelligence for Safety-Critical Systems in Industrial and Transportation Domains: A SurveyACM Computing Surveys10.1145/362631456:7(1-40)Online publication date: 9-Apr-2024
          • (2024)Self-Assertion-Based Countermeasures Within a RISC-V Microprocessor for Coverage of Information Leakage FaultsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.335159243:6(1677-1690)Online publication date: 11-Jan-2024
          • (2024)Enhancing Neural Network Reliability: Insights From Hardware/Software Collaboration With Neuron Vulnerability QuantizationIEEE Transactions on Computers10.1109/TC.2024.339849273:8(1953-1966)Online publication date: 1-Aug-2024
          • (2023)Trade-off Mechanism Between Reliability and Performance for Data-flow Soft Error DetectionJournal of Electronic Testing: Theory and Applications10.1007/s10836-023-06087-239:5-6(583-595)Online publication date: 1-Dec-2023
          • (2022)GPU Devices for Safety-Critical Systems: A SurveyACM Computing Surveys10.1145/354952655:7(1-37)Online publication date: 15-Dec-2022
          • (2022)Soft error vulnerability prediction of GPGPU applicationsThe Journal of Supercomputing10.1007/s11227-022-04933-279:6(6965-6990)Online publication date: 19-Nov-2022
          • (2021)Arithmetic-intensity-guided fault tolerance for neural network inference on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476184(1-15)Online publication date: 14-Nov-2021
          • (2021)Predicting the Soft Error Vulnerability of Parallel Applications Using Machine LearningInternational Journal of Parallel Programming10.1007/s10766-021-00707-049:3(410-439)Online publication date: 1-Jun-2021
          • Show More Cited By

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media