Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3307650.3322215acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Open access

AxMemo: hardware-compiler co-design for approximate code memoization

Published: 22 June 2019 Publication History

Abstract

Historically, continuous improvements in general-purpose processors have fueled the economic success and growth of the IT industry. However, the diminishing benefits from transistor scaling and conventional optimization techniques necessitates moving beyond common practices. Approximate computing is one such unconventional technique that has shown promise in pushing the boundaries of general-purpose processing. This paper sets out to employ approximation for processors that are commonly used in cyber-physical domains and may become building blocks of Internet of Things. To this end, we propose AxMemo to exploit the computation redundancy that stems from data similarity in the inputs of code blocks. Such input behavior is prevalent in cyber-physical systems as they deal with real-world data that naturally harbors redundancy. Therefore, in contrast to existing memoization techniques that replace costly floating-point arithmetic operations with limited number of inputs, AxMemo focuses on memoizing blocks of code with potentially many inputs. As such, AxMemo aims to replace long sequences of instructions with a few hash and lookup operations. By reducing the number of dynamic instructions, AxMemo alleviates the von Neumann and execution overheads of passing instructions through the processor pipeline altogether. The challenge AxMemo facing is to provide low-cost hashing mechanisms that can generate rather unique signature for each multi-input combination. To address this challenge, we develop a novel use of Cyclic Redundancy Checking (CRC) to hash the inputs. To increase lookup table hit rate, AxMemo employs a two-level memoization lookup, which utilizes small dedicated SRAM and spare storage in the last level cache. These solutions enable AxMemo to efficiently memoize relatively large code regions with variable input sizes and types using the same underlying hardware. Our experiment shows that AxMemo offers 2.64× speedup and 2.58 × energy reduction with mere 0.2% of quality loss averaged across ten benchmarks. These benefits come with an area overhead of just 2.1%.

References

[1]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In International Conference on Parallel Architectures and Compilation Techniques.
[2]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Computuer Architecture News 39, 2 (2011), 1--7.
[3]
Iulian Brumar, Marc Casas, Miquel Moreto, Mateo Valero, and Gurindar S. Sohi. 2017. ATM: Approximate Task Memoization in the Runtime System. In IEEE International Parallel and Distributed Processing Symposium.
[4]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization.
[5]
Daniel A. Connors and Wen-Mei W. Hwu. 1999. Compiler-directed Dynamic Computation Reuse: Rationale and Initial Results. In IEEE/ACM International Symposium on Microarchitecture.
[6]
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark Silicon and the End of Multicore Scaling. In IEEE/ACM International Symposium on Computer Architecture.
[7]
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural Acceleration for General-Purpose Approximate Programs. In IEEE/ACM International Symposium on Microarchitecture.
[8]
Mohsen Imani, Daniel Peroni, and Tajana Rosing. 2018. Nvalt: Nonvolatile Approximate Lookup Table for GPU Acceleration. IEEE Embedded Systems Letters 10, 1 (2018).
[9]
Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the Future of Parallel Computing. IEEE Micro 31, 5 (Sept. 2011), 7--17.
[10]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In IEEE/ACM International Symposium on Microarchitecture. ACM, 469--480.
[11]
Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value Locality and Load Value Prediction. In International Conference on Architectural Support for Programming Languages and Operating Systems.
[12]
Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM Refresh-power Through Critical Data Partitioning. In International Conference on Architectural Support for Programming Languages and Operating Systems.
[13]
Zhenhong Liu, Daniel Wong, and Nam Sung Kim. 2018. Load-Triggered Warp Approximation on GPU. In The International Symposium on Low Power Electronics and Design.
[14]
Joshua San Miguel, Jorge Albericio, Natalie Enright Jerger, and Aamer Jaleel. 2016. The Bunker Cache for Spatio-value Approximation. In IEEE/ACM International Symposium on Microarchitecture.
[15]
Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. 2014. Load Value Approximation. In IEEE/ACM International Symposium. on Microarchitecture.
[16]
Alexander Moreno and Tucker Balch. 2014. Speeding up Large-Scale Financial Recomputation with Memoization. Seventh Workshop on High Performance Computational Finance (2014).
[17]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. {n. d.}. CACTI 6.0: A Tool to Model Large Caches. In HP Technical Report HPL-2009-85.
[18]
W. W. Peterson and D. T. Brown. 1961. Cyclic Codes for Error Detection. In Proceedings of the IRE, Vol. 49.
[19]
Mohammad Samragh Razlighi, Mohsen Imani, Farinaz Koushanfar, and Tajana Rosing. 2017. LookNN: Neural Network with No Multiplication. In Design, Automation & Test in Europe Conference & Exhibition.
[20]
Stephen E. Richardson. 1992. Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation. Technical Report.
[21]
Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based Approximation for Data Parallel Applications. In International Conference on Architectural Support for Programming Languages and Operating Systems.
[22]
Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning Approximation for Graphics Engines. In IEEE/ACM International Symposium on Microarchitecture.
[23]
Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2013. Approximate Storage in Solid-state Memories. In IEEE/ACM International Symposium on Microarchitecture.
[24]
Yakun Sophia Shao and David Brooks. 2013. ISA-Independent Workload Characterization and its Implications for Specialized Architectures. In IEEE International Symposium on Performance Analysis of Systems and Software.
[25]
Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2014. Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. In IEEE/ACM International Symposium on Computer Architecture.
[26]
Stelios Sidiroglou, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing Performance vs. Accuracy Trade-offs With Loop Perforation. In ACM SIGSOFT Symposium and European Conference on Foundations of Software Engineering.
[27]
Sharad Sinha and Wei Zhang. 2016. Low-Power FPGA Design Using Memoization-Based Approximate Computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, 8 (2016).
[28]
Tomoaki Tsumura, Ikuma Suzuki, Yasuki Ikeuchi, Hiroshi Matsuo, Hiroshi Nakashima, and Yasuhiko Nakashima. 2007. Design and Evaluation of an Auto-memoization Processor. In The IASTED International Multi-Conference: Parallel and Distributed Computing and Networks.
[29]
James Tuck, Wonsun Ahn, Luis Ceze, and Josep Torrellas. 2008. SoftSig: Software-exposed Hardware Signatures for Code Analysis and Optimization. In International Conference on Architectural Support for Programming Languages and Operating Systems.
[30]
Georgios Tziantzioulis, Nikos Hardavellas, and Simone Campanoni. 2018. Temporal Approximate Function Memoization. IEEE Micro 38, 4 (2018), 60--70.
[31]
Amir Yazdanbakhsh, Divya Mahajan, Hadi Esmaeilzadeh, and Pejman Lotfi-Kamran. 2017. AxBench: A Multiplatform Benchmark Suite for Approximate Computing. IEEE Design & Test 34, 2 (2017), 60--68.
[32]
Amir Yazdanbakhsh, Jongse Park, Hardik Sharma, Pejman Lotfi-Kamran, and Hadi Esmaeilzadeh. 2015. Neural Acceleration for GPU Throughput Processors. In IEEE/ACM International Symposium on Microarchitecture.
[33]
Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2016. RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads. ACM Transactions on Architecture and Code Optimization 12, 4 (2016).
[34]
Guowei Zhang and Daniel Sanchez. 2018. Leveraging Hardware Caches for Memoization. IEEE Computer Architecture Letters 17, 1 (2018), 59--63.

Cited By

View all
  • (2024)Iterative construction of energy and quality-efficient approximate multipliers utilizing lower bit-length counterpartsThe Journal of Supercomputing10.1007/s11227-024-06212-880:13(19210-19247)Online publication date: 26-May-2024
  • (2022)Data and Computation Reuse in CNNs Using Memristor TCAMsACM Transactions on Reconfigurable Technology and Systems10.1145/354953616:1(1-24)Online publication date: 22-Dec-2022
  • (2022)Tensorox: Accelerating GPU Applications via Neural Approximation on Unused Tensor CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309323933:2(429-443)Online publication date: 1-Feb-2022
  • Show More Cited By

Index Terms

  1. AxMemo: hardware-compiler co-design for approximate code memoization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
    June 2019
    849 pages
    ISBN:9781450366694
    DOI:10.1145/3307650
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE-CS\DATC: IEEE Computer Society

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. approximate computing
    2. hardware-software co-design
    3. memoization

    Qualifiers

    • Research-article

    Conference

    ISCA '19
    Sponsor:

    Acceptance Rates

    ISCA '19 Paper Acceptance Rate 62 of 365 submissions, 17%;
    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)126
    • Downloads (Last 6 weeks)18
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Iterative construction of energy and quality-efficient approximate multipliers utilizing lower bit-length counterpartsThe Journal of Supercomputing10.1007/s11227-024-06212-880:13(19210-19247)Online publication date: 26-May-2024
    • (2022)Data and Computation Reuse in CNNs Using Memristor TCAMsACM Transactions on Reconfigurable Technology and Systems10.1145/354953616:1(1-24)Online publication date: 22-Dec-2022
    • (2022)Tensorox: Accelerating GPU Applications via Neural Approximation on Unused Tensor CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309323933:2(429-443)Online publication date: 1-Feb-2022
    • (2022)Approximate function memoizationConcurrency and Computation: Practice and Experience10.1002/cpe.720434:23Online publication date: 20-Jul-2022
    • (2021)MD-HMProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460365(215-226)Online publication date: 3-Jun-2021
    • (2021)MERCI: efficient embedding reduction on commodity hardware via sub-query memoizationProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446717(302-313)Online publication date: 19-Apr-2021
    • (2021)Stochastic Iterative Approximation: Software/hardware techniques for adjusting aggressiveness of approximation2021 IEEE 39th International Conference on Computer Design (ICCD)10.1109/ICCD53106.2021.00023(74-82)Online publication date: Oct-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media