Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2694344.2694354acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

CommGuard: Mitigating Communication Errors in Error-Prone Parallel Execution

Published: 14 March 2015 Publication History
  • Get Citation Alerts
  • Abstract

    As semiconductor technology scales towards ever-smaller transistor sizes, hardware fault rates are increasing. Since important application classes (e.g., multimedia, streaming workloads) are data-error-tolerant, recent research has proposed techniques that seek to save energy or improve yield by exploiting error tolerance at the architecture/microarchitecture level. Even seemingly error-tolerant applications, however, will crash or hang due to control-flow/memory addressing errors. In parallel computation, errors involving inter-thread communication can have equally catastrophic effects. Our work explores techniques that mitigate the impact of potentially catastrophic errors in parallel computation, while still garnering power, cost, or yield benefits from data error tolerance. Our proposed CommGuard solution uses FSM-based checkers to pad and discard data in order to maintain semantic alignment between program control flow and the data communicated between processors. CommGuard techniques are low overhead and they exploit application information already provided by some parallel programming languages (e.g. StreamIt). By converting potentially catastrophic communication errors into potentially tolerable data errors, CommGuard allows important streaming applications like JPEG and MP3 decoding to execute without crashing and to sustain good output quality, even for errors as frequent as every 500μs.

    References

    [1]
    A. R. Alameldeen, I. Wagner, Z. Chishti, W. Wu, C. Wilkerson, and S.-L. Lu, "Energy-efficient cache design using variable-strength error-correcting codes," in Proceedings of the Annual International Symposium on Computer Architecture, 2011.
    [2]
    N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1--7, 2011.
    [3]
    Z. Budimlic, M. Burke, V. Cave, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. Peixotto, V. Sarkar, F. Schlimbach, and S. Tasirlar, "Concurrent collections," Scientific Programming, vol. 18, no. 3-4, pp. 203--217, 2010.
    [4]
    M. Carbin, S. Misailovic, and M. C. Rinard, "Verifying quantitative reliability for programs that execute on unreliable hardware," in Proceedings of the International Conference on Object Oriented Programming Systems Languages and Applications, 2013.
    [5]
    B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou, "Denovo: Rethinking the memory hierarchy for disciplined parallelism," in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2011.
    [6]
    M. Clemens, B. Sierawski, K. Warren, M. Mendenhall, N. Dodds, R. Weller, R. Reed, P. Dodd, M. Shaneyfelt, J. Schwank, S. Wender, and R. Baumann, "The effects of neutron energy and high-z materials on single event upsets and multiple cell upsets," IEEE Transactions on Nuclear Science, 2011.
    [7]
    C. Constantinescu, "Trends and challenges in vlsi circuit reliability," IEEE Micro, vol. 23, no. 4, pp. 14--19, 2003.
    [8]
    J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Commun. ACM, vol. 51, no. 1, pp. 107--113, 2008.
    [9]
    E. W. Dijkstra, "Self-stabilizing systems in spite of distributed control," Commun. ACM, vol. 17, no. 11, pp. 643--644, 1974.
    [10]
    Y. h. Eom and B. Demsky, "Self-stabilizing java," in Proceedings of the Conference on Programming Language Design and Implementation, 2012.
    [11]
    D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, "Razor: A low-power pipeline based on circuit-level timing speculation," in Proceedings of the Annual International Symposium on Microarchitecture, 2003.
    [12]
    G. Gielen, P. De Wit, E. Maricau, J. Loeckx, J. Martín-Martínez, B. Kaczer, G. Groeseneken, R. Rodríguez, and M. Nafría, "Emerging yield and reliability challenges in nanometer cmos technologies," in Proceedings of the Conference on Design, Automation and Test in Europe, 2008.
    [13]
    R. Hegde and N. R. Shanbhag, "Energy-efficient signal processing via algorithmic noise-tolerance," in Proceedings of the International Symposium on Low Power Electronics and Design, 1999.
    [14]
    W. Huang, M. Stan, S. Gurumurthi, R. Ribando, and K. Skadron, "Interaction of scaling trends in processor architecture and cooling," in Semiconductor Thermal Measurement and Management Sym., 2010.
    [15]
    Intel Corporation, vol. 3A, pp. 8--16, 2014. {Online}. Available: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
    [16]
    ITRS, "ITRS process integration, devices, and structures," http://public.itrs.net/Links/2011ITRS/2011Chapters/2011PIDS.pdf, ITRS, 2011.
    [17]
    K. Kuhn, M. Giles, D. Becher, P. Kolar, A. Kornfeld, R. Kotlyar, S. Ma, A. Maheshwari, and S. Mudanai, "Process technology variation," IEEE Transactions on Electron Devices, 2011.
    [18]
    L. Leem, H. Cho, J. Bau, Q. A. Jacobson, and S. Mitra, "ERSA: Error resilient system architecture for probabilistic applications," in Proceedings of the Conference on Design, Automation and Test in Europe, 2010.
    [19]
    S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, "Flikker: Saving dram refresh-power through critical data partitioning," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2011.
    [20]
    P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A full system simulation platform," Computer, vol. 35, no. 2, pp. 50--58, 2002.
    [21]
    A. Meixner, M. E. Bauer, and D. Sorin, "Argus: Low-cost, comprehensive error detection in simple cores," in Proceedings of the Annual International Symposium on Microarchitecture, 2007.
    [22]
    S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives," in Proceedings of the Annual International Symposium on Computer Architecture, 2002.
    [23]
    S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance micro- processor," in Proceedings of the Annual International Symposium on Microarchitecture, 2003.
    [24]
    S. Mukherjee, Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., 2008.
    [25]
    A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, "EnerJ: Approximate data types for safe and general low-power computation," in Proceedings of the Conference on Programming Language Design and Implementation, 2011.
    [26]
    B. Sierawski, R. Reed, M. Mendenhall, R. Weller, R. Schrimpf, S.-J. Wen, R. Wong, N. Tam, and R. Baumann, "Effects of scaling on muon-induced soft errors," in International Reliability Physics Symposium, 2011.
    [27]
    T. Stathaki, Image Fusion: Algorithms and Applications. Academic Press, 2008.
    [28]
    G. Stoll and K. Brandenburg, "The iso/mpeg-audio codec: A generic standard for coding of high quality digital audio," in Audio Engineering Society Convention, 1992.
    [29]
    W. Thies, M. Karczmarek, and S. P. Amarasinghe, "StreamIt: A language for streaming applications," in Proceedings of the International Conference on Compiler Construction, 2002.
    [30]
    A. Thomas and K. Pattabiraman, "Error detector placement for soft computation," in Proceedings of the Conference on Dependable Systems and Networks, 2013.
    [31]
    G. K. Wallace, "The JPEG still picture compression standard," Commun. ACM, vol. 34, no. 4, 1991.
    [32]
    Y. Yetim, M. Martonosi, and S. Malik, "Extracting useful computation from error-prone processors for streaming applications," in Proceedings of the Conference on Design, Automation and Test in Europe, 2013.

    Cited By

    View all
    • (2020)Exploiting Errors for EfficiencyACM Computing Surveys10.1145/339489853:3(1-39)Online publication date: 12-Jun-2020
    • (2018)Approximate CommunicationACM Computing Surveys10.1145/314581251:1(1-32)Online publication date: 10-Jan-2018
    • (2017)PPUACM Journal on Emerging Technologies in Computing Systems10.1145/299050213:3(1-29)Online publication date: 14-Apr-2017
    • Show More Cited By

    Index Terms

    1. CommGuard: Mitigating Communication Errors in Error-Prone Parallel Execution

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
      March 2015
      720 pages
      ISBN:9781450328357
      DOI:10.1145/2694344
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 March 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. application-level error tolerance
      2. high-level programming languages
      3. parallel computing

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ASPLOS '15

      Acceptance Rates

      ASPLOS '15 Paper Acceptance Rate 48 of 287 submissions, 17%;
      Overall Acceptance Rate 535 of 2,713 submissions, 20%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Exploiting Errors for EfficiencyACM Computing Surveys10.1145/339489853:3(1-39)Online publication date: 12-Jun-2020
      • (2018)Approximate CommunicationACM Computing Surveys10.1145/314581251:1(1-32)Online publication date: 10-Jan-2018
      • (2017)PPUACM Journal on Emerging Technologies in Computing Systems10.1145/299050213:3(1-29)Online publication date: 14-Apr-2017
      • (2016)Protecting Code Regions on Asymmetrically Reliable CachesProceedings of the 29th International Conference on Architecture of Computing Systems -- ARCS 2016 - Volume 963710.1007/978-3-319-30695-7_28(375-387)Online publication date: 4-Apr-2016
      • (2015)Error-Tolerant ProcessorsProceedings of the IEEE/ACM International Conference on Computer-Aided Design10.5555/2840819.2840860(286-293)Online publication date: 2-Nov-2015
      • (2015)Error-tolerant processors: Formal specification and verification2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)10.1109/ICCAD.2015.7372582(286-293)Online publication date: Nov-2015
      • (2016)Protecting Code Regions on Asymmetrically Reliable CachesProceedings of the 29th International Conference on Architecture of Computing Systems -- ARCS 2016 - Volume 963710.1007/978-3-319-30695-7_28(375-387)Online publication date: 4-Apr-2016
      • (2016)Asymmetrically reliable caches for multicore architectures under performance and energy constraintsCluster Computing10.1007/s10586-016-0641-219:4(1819-1833)Online publication date: 1-Dec-2016
      • (2016)Protecting Code Regions on Asymmetrically Reliable CachesProceedings of the 29th International Conference on Architecture of Computing Systems -- ARCS 2016 - Volume 963710.1007/978-3-319-30695-7_28(375-387)Online publication date: 4-Apr-2016

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media