Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

An Accurate Cross-Layer Approach for Online Architectural Vulnerability Estimation

Published: 17 September 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Processor soft-error rates are projected to increase as feature sizes scale down, necessitating the adoption of reliability-enhancing techniques, but power and performance overhead remain a concern of such techniques. Dynamic cross-layer techniques are a promising way to improve the cost-effectiveness of resilient systems. As a foundation for making such a system, we propose a cross-layer approach for estimating the architectural vulnerability of a processor core online that works by combining information from software, compiler, and microarchitectural layers at runtime. The hardware layer combines the metadata from software and compiler layers with microarchitectural measurements to estimate architectural vulnerability online. We describe our design and evaluate it in detail on a set of SPEC CPU 2006 applications. We find that our online AVF estimate is highly accurate with respect to a postmortem AVF analysis, with only 0.46% average absolute error. Also, our design incurs negligible performance impact for SPEC2006 applications and about 1.2% for a Monte Carlo application, requires approximately 1.4% area overhead, and costs about 3.3% more power on average. We compare our technique against two prior online AVF estimation techniques, one using a linear regression to estimate AVF and another based on PVF-HVF; our evaluation finds that our approach, on average, is more accurate. Our case study of a Monte Carlo simulation shows that our AVF estimate can adapt to the inherent resiliency of the algorithm. Finally, we demonstrate the effectiveness of our approach using a dynamic protection scheme that limits vulnerability to soft errors while reducing the energy consumption by an average of 4.8%, and with a target normalized SER of 10%, compared to enabling a simple parity+ECC protection at all times.

    References

    [1]
    Arijit Biswas, Niranjan Soundararajan, Shubhendu S. Mukherjee, and Sudhanva Gurumurthi. 2009. Quantized AVF: A means of capturing vulnerability variations over small windows of time. In SELSE’09.
    [2]
    S. Borkar. 2005. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. Micro.
    [3]
    H. Cho and L. Leem. 2012. ERSA: Error resilient system architecture for probabilistic applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
    [4]
    N. Choudhary, S. Wadhavkar, T. Shah, H. Mayukh, J. Gandhi, B. Dwiel, S. Navada, H. Najaf-abadi, and E. Rotenberg. 2011. FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. ISCA 11--22.
    [5]
    C. Constantinescu. 2007. Intermittent faults in VLSI circuits. In Proceedings of the SELSE Workshop.
    [6]
    Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An Architectural Framework for Software Recovery of Hardware Faults. Vol. 38. ACM.
    [7]
    M. Dechene, J. E. Forbes, and E. Rotenberg. 2010. Multithreaded Instruction Sharing. Technical Report. Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC.
    [8]
    L. Duan, Bin L, and L. Peng. 2009. Versatile prediction and fast estimation of architectural vulnerability factor from processor performance metrics. IEEE 15th International Symposium on HPCA. 129--140.
    [9]
    B. Farahani and S. Safari. 2015. A cross-layer approach to online adaptive reliability prediction of transient faults. In IEEE International Symposium on DFTS.
    [10]
    X. Fu, J. Poe, T. Li, and J. Fortes. 2006. Characterizing microarchitecture soft error vulnerability phase behavior. In 14th IEEE International Symposium on Modeling, Analysis, and Simulation. 147--155.
    [11]
    M. A. Gomaa and T. N. Vijaykumar. 2005. Opportunistic transient-fault detection. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). 172--183.
    [12]
    S. Gupta, A. Ansari, and S. Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. ACM SIGPLAN Notices.
    [13]
    S. Hari, S. Adve, H. Naeimi, and P. Ramachandran. 2012. Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In ASPLOS’12. ACM.
    [14]
    Siva Kumar Sastry Hari, Radha Venkatagiri, Sarita V. Adve, and Helia Naeimi. 2014. GangES: Gang error simulation for hardware resiliency evaluation. In International Symposium on Computer Architecture (ISCA’14). 61--72.
    [15]
    Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose. 2004. Microarchitectural techniques for power gating of execution units (ISLPED’04). ACM, New York, NY, 32--37.
    [16]
    M. Kondo, H. Kobyashi, R. Sakamoto, M. Wada, J. Tsukamoto, M. Namiki, W. Wang, H. Amano, K. Matsunaga, M. Kudo, K. Usami, T. Komoda, and H. Nakamura. 2014. Design and evaluation of fine-grained power-gating for embedded microprocessors. In DATE’14. 1--6.
    [17]
    H. Li, J. Mundy, W. Paterson, et al. 2007. Thermally-induced soft errors in nanoscale CMOS circuits. International Symposium on Nanoscale Architectures.
    [18]
    M. L. Li, P. Ramachandran, S. K. Sahoo, Sarita V. Adve, Vikram Adve, and Yuanyuan Zhou. 2008. SWAT: An error resilient system. IEEE SELSE Workshop.
    [19]
    X. Li, S. V. Adve, P. Bose, and J. A. Rivers. 2008. Online estimation of architectural vulnerability factor for soft errors. In International Symposium on Computer Architecture (ISCA’08).
    [20]
    A. Meixner, M. E. Bauer, and D. J. Sorin. 2007. Argus: Low-cost, comprehensive error detection in simple cores. 40th Annual IEEE/ACM International Symposium on Microarchitecture. 210--222.
    [21]
    Steven S. Muchnick. 1997. Advanced Compiler Design Implementation. Morgan Kaufmann, San Francisco, CA.
    [22]
    S. S. Mukherjee and J. Emer. 2005. The soft error problem: An architectural perspective. In HPCA.
    [23]
    S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. 2002. Detailed design and evaluation of redundant multithreading alternatives. In 29th Annual International Symposium on Computer Architecture, IEEE Computer Society, 99--110.
    [24]
    S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In MICRO’03. 29--40.
    [25]
    A. A. Nair, S. Eyerman, L. Eeckhout, and L. K. John. 2012. A first-order mechanistic model for architectural vulnerability factor. In 39th Annual International Symposium on Computer Architecture (ISCA’12). 273--284.
    [26]
    NanGate. 2011. NanGate 45nm Open Cell Library. Retrieved August 17, 2016 from http://www.nangate.com.
    [27]
    N. Oh, S. Mitra, and E. J. McCluskey. 2002. ED/sup 4/I: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers 51, 2, 180--199.
    [28]
    V. K. Reddy, E. Rotenberg, and S. Parthasarathy. 2006. Understanding prediction-based partial redundant threading for low-overhead, high-coverage fault tolerance. In ASPLOS’06.
    [29]
    S. Rehman, M. Shafique, F. Kriebel, and J. Henkel. 2011. Reliable software for unreliable hardware: Embedded code generation aiming at reliability. In CODES+ISSS’11. ACM.
    [30]
    G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. August. 2005a. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization.
    [31]
    G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. 2005b. Design and evaluation of hybrid fault-detection systems. In Proceedings of the 32nd Annual International Symposium on Computer Architecture. ACM.
    [32]
    G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. 2005c. Software-controlled fault tolerance. Transactions on Architecture and Code Optimization 2, 4.
    [33]
    Rosettacode.org. 2016. Monte Carlo Simulation program. Retrieved August 16, 2016 from http://rosettacode.org/wiki/Monte_Carlo_methods.
    [34]
    A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In PLDI’11. 164--174.
    [35]
    J. Sharkey, N. Abu-Ghazeleh, D. Ponomarev, K. Ghose, and A. Aggarwal. 2006. Trade-offs in transient fault recovery schemes for redundant multithreaded processors. In HiPC’06. Springer-Verlag.
    [36]
    Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. SIGOPS Opererating Systems Review 36, 5, 45--57.
    [37]
    A. Shye, J. Blomstedt, T. Moseley, V. J. Reddi, and D. A. Connors. 2009. PLR: A software approach to transient fault tolerance for multicore architectures. IEEE Transactions on Dependable and Secure Computing 6, 2, 135--148.
    [38]
    N. K. Soundararajan, A. Parashar, and A. Sivasubramaniam. 2007. Mechanisms for bounding vulnerabilities of processor structures. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM.
    [39]
    V. Sridharan and D. R. Kaeli. 2009. Eliminating microarchitectural dependency from architectural vulnerability. IEEE 15th International Symposium on High Performance Computer Architecture. 117--128.
    [40]
    Vilas Sridharan and David R. Kaeli. 2010. Using hardware vulnerability factors to enhance AVF analysis. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM.
    [41]
    Hamed Tabkhi and Gunar Schirner. 2015. A joint SW/HW approach for reducing register file vulnerability. ACM Transactions on Architectural Code Optimization. 12, 2, Article 9, 28 pages.
    [42]
    J. M. Tendler, J. S. Dodson, and J. S. Fields. 2002. POWER4 system microarchitecture. IBM Journal of Research and Development 46, 1, 5--25.
    [43]
    K. R. Walcott, G. Humphreys, and S. Gurumurthi. 2007. Dynamic prediction of architectural vulnerability from microarchitectural state. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM.
    [44]
    N. Wang, M. Fertig, and Sanjay Patel. 2003. Y-branches: When you come to a fork in the road, take it. In Proceedings of the 12th International Conference on PACT’03. 56--66.
    [45]
    N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. 2004. Characterizing the effects of transient faults on a high-performance processor pipeline. In International Conference on Dependable Systems and Networks. 61--70.
    [46]
    S. J. E. Wilton and N. P. Jouppi. 1996. CACTI: An enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits 31, 5, 677--688.
    [47]
    Yun Zhang, Jae W. Lee, Nick P. Johnson, and David I. August. 2010. DAFT: Decoupled acyclic fault tolerance. In Proceedings of the 19th International Conference on PACT’10. ACM.

    Cited By

    View all
    • (2021)Methods for Improving the Reliability of Intelligent Semiconductor2021 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia)10.1109/ICCE-Asia53811.2021.9641987(1-4)Online publication date: 1-Nov-2021
    • (2020)Gem5Panalyzer: A Light-weight tool for Early-stage Architectural Reliability Evaluation & Prediction2020 IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS)10.1109/MWSCAS48704.2020.9184536(482-485)Online publication date: Aug-2020
    • (2019)MinotaurProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304050(1087-1103)Online publication date: 4-Apr-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 3
    September 2016
    207 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2988523
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 September 2016
    Accepted: 01 July 2016
    Revised: 01 July 2016
    Received: 01 December 2015
    Published in TACO Volume 13, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Architectural vulnerability factor
    2. cross-layer reliability

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)34
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Methods for Improving the Reliability of Intelligent Semiconductor2021 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia)10.1109/ICCE-Asia53811.2021.9641987(1-4)Online publication date: 1-Nov-2021
    • (2020)Gem5Panalyzer: A Light-weight tool for Early-stage Architectural Reliability Evaluation & Prediction2020 IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS)10.1109/MWSCAS48704.2020.9184536(482-485)Online publication date: Aug-2020
    • (2019)MinotaurProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304050(1087-1103)Online publication date: 4-Apr-2019
    • (2018)PRISMProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291748(1-14)Online publication date: 11-Nov-2018
    • (2018)PRISMProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00072(1-14)Online publication date: 11-Nov-2018
    • (2017)Characterizing the impact of soft errors across microarchitectural structures and implications for predictability2017 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC.2017.8167782(250-260)Online publication date: Oct-2017

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media