Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy

Published: 25 October 2016 Publication History
  • Get Citation Alerts
  • Abstract

    In this article, we propose Aggregation-in-Memory (AIM), a new processing-in-memory system designed for energy efficiency and near-term adoption. In order to efficiently perform aggregation, we implement simple aggregation operations in main memory and develop a locality-adaptive host architecture for in-memory aggregation, called cache-conscious aggregation. Through this, AIM executes aggregation at the most energy-efficient location among all levels of the memory hierarchy. Moreover, AIM minimally changes existing sequential programming models and provides fully automated compiler toolchain, thereby allowing unmodified legacy software to use AIM. Evaluations show that AIM greatly improves the energy efficiency of main memory and the system performance.

    References

    [1]
    Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015a. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the International Symposium on Computer Architecture. 105--117.
    [2]
    Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015b. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the International Symposium on Computer Architecture. 336--348.
    [3]
    Jung Ho Ahn, Mattan Erez, and William J. Dally. 2005. Scatter-add in data parallel architectures. In Proceedings of the International Symposium on High-Performance Computer Architecture. 132--142.
    [4]
    Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H. Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 workshop. IEEE Micro 34, 4 (July 2014), 36--42.
    [5]
    Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the International World-Wide Web Conference. 14--18.
    [6]
    Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization. 44--54.
    [7]
    Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Proceedings of the Design, Automation and Test in Europe. 33--38.
    [8]
    Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (Dec. 2011), 1:1--1:25.
    [9]
    Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. 2002. The architecture of the DIVA processing-in-memory chip. In Proceedings of the International Conference on Supercomputing. 14--25.
    [10]
    Duncan G. Elliott, Michael Stumm, W. Martin. Snelgrove, Christian Cojocaru, and Robert McKenzie. 1999. Computational RAM: Implementing processors in memory. IEEE Design Test Comput. 16, 1 (Jan. 1999), 32--41.
    [11]
    Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In Proceedings of the International Symposium on High Performance Computer Architecture. 283--295.
    [12]
    L. A. Feldkamp, L. C. Davis, and J. W. Kress. 1984. Practical cone-beam algorithm. J. Opt. Soc. Am. A 1, 6 (June 1984), 612--619.
    [13]
    María Jesús Garzarán, Milos Prvulovic, Ye Zhang, Alin Jula, Hao Yu, Lawrence Rauchwerger, and Josep Torrellas. 2001. Architectural support for parallel reductions in scalable shared-memory multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 243--254.
    [14]
    Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook Shin, and Joonseok Park. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing. 57.
    [15]
    Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.0.
    [16]
    Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. 2012. Green-Marl: A DSL for easy and efficient graph analysis. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 349--362.
    [17]
    Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 78--88.
    [18]
    Sungpack Hong, Semih Salihoglu, Jennifer Widom, and Kunle Olukotun. 2014. Simplifying scalable graph processing with a domain-specific language. In Proceedings of the International Symposium on Code Generation and Optimization. 208--218.
    [19]
    U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2009. PEGASUS: A peta-scale graph mining system implementation and observations. In Proceedings of the International Conference on Data Mining. 229--238.
    [20]
    Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Lam, Pratap Pattnaik, and Josep Torrellas. 1999. FlexRAM: Toward an advanced intelligent memory system. In Proceedings of the International Conference on Computer Design. 192--201.
    [21]
    Daehyun Kim, Mainak Chaudhuri, and Mark Heinrich. 2002. Leveraging cache coherence in active memory systems. In Proceedings of the International Conference on Supercomputing. 2--13.
    [22]
    Peter M. Kogge. 1994. EXECUBE—A new architecture for scaleable MPPs. In Proceedings of the International Conference on Parallel Processing. 77--84.
    [23]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems. 1097--1105.
    [24]
    Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization. 75--86.
    [25]
    Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2010. DRAM-aware Last-level Cache Writeback: Reducing Write-caused Interference in Memory Systems. Technical Report No. TR-HPS-2010-002. The University of Texas at Austin.
    [26]
    Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. (June 2014). http://snap.stanford.edu/data.
    [27]
    Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture. 469--480.
    [28]
    Kyu-Nam Lim, Woong-Ju Jang, Hyung-Sik Won, Kang-Yeol Lee, Hyungsoo Kim, Dong-Whee Kim, Mi-Hyun Cho, Seung-Lo Kim, Jong-Ho Kang, Keun-Woo Park, and Byung-Tae Jeong. 2012. A 1.2V 23nm 6F2 4Gb DDR3 SDRAM with local-bitline sense amplifier, hybrid LIO sense amplifier and dummy-less array architecture. In International Solid-State Circuits Conference Digest of Technical Papers. 42--44.
    [29]
    Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation. 190--200.
    [30]
    Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the International Conference on Management of Data. 135--146.
    [31]
    Micron Technology 2007. Calculating Memory System Power for DDR3. Micron Technology.
    [32]
    Micron Technology 2009. 4Gb: ×4, ×8, ×16 DDR3 SDRAM. Micron Technology.
    [33]
    Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical ReportNo. HPL-2009-85. HP Laboratories.
    [34]
    Ravi Nair, Samuel F. Antao, Carlo Bertolli, Pradip Bose, Jose R. Brunheroto, Tong Chen, Chen-Yong Cher, Carlos H. A. Costa, Jun Doi, Constantinos Evangelinos, Bruce M. Fleischer, Thomas W. Fox, Diego Sanchez Gallo, Leopold Grinberg, John A. Gunnels, Arpith C. Jacob, Philip Jacob, Hans M. Jacobson, Tejas Karkhanis, C. Kim, Jaime H. Moreno, J. Kevin O’Brien, Martin Ohmacht, Yoonho Park, Daniel A. Prener, Bryan S. Rosenburg, Kyung Dong Ryu, Olivier Sallenave, Mauricio J. Serrano, Patrick D. M. Siegl, Krishnan Sugavanam, and Zehra Sura. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM J. Res. Dev. 59, 2/3 (March 2015), 17:1--17:14.
    [35]
    Mark Oskin, Frederic T. Chong, and Timothy Sherwood. 1998. Active pages: A computation model for intelligent memory. In Proceedings of the International Symposium on Computer Architecture. 192--203.
    [36]
    David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A case for intelligent RAM. IEEE Micro 17, 2 (March 1997), 34--44.
    [37]
    Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 190--200.
    [38]
    Christopher Rohkohl, Benjamin Keck, Hannes Hofmann, and Joachim Hornegger. 2009. RabbitCT—An open platform for benchmarking 3D cone-beam reconstruction algorithms. Med. Phys. 36, 9 (Sept. 2009), 3940--3944.
    [39]
    Yousef Sadd. 2003. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics.
    [40]
    Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2014. The dirty-block index. In Proceeding of the International Symposium on Computer Architecture. 157--168.
    [41]
    Vivek Seshadri, Kevin Hsieh, Amirali Boroum, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. 2015. Fast bulk bitwise AND and OR in DRAM. IEEE Comput. Architec. Lett. 14, 2 (July-Dec. 2015), 127--131.
    [42]
    Thomas L. Sterling and Hans P. Zima. 2002. Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing. In Proceedings of the ACM/IEEE Conference on Supercomputing. 48.
    [43]
    Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. In Proceedings of the International Symposium on Computer Architecture. 72--82.
    [44]
    Stavros Volos, Javier Picorel, Babak Falsafi, and Boris Grot. 2014. BuMP: Bulk memory access prediction and streaming. In Proceedings of the International Symposium on Microarchitecture. 545--557.
    [45]
    Zhe Wang, Samira M. Khan, and Daniel A. Jiménez. 2012. Improving writeback efficiency with decoupled last-write prediction. In Proceedings of the International Symposium on Computer Architecture. 309--320.
    [46]
    Dong Ping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing. 85--98.
    [47]
    Guowei Zhang, Webb Horn, and Daniel Sanchez. 2015. Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. In Proceedings of the International Symposium on Microarchitecture.
    [48]
    Tao Zhang, Ke Chen, Cong Xu, Guangyu Sun, Tao Wang, and Yuan Xie. 2014. Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation. In Proceedings of the International Symposium on Computer Architecture. 349--360.
    [49]
    Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proceedings of the International 3D Systems Integration Conference. 1--7.

    Cited By

    View all
    • (2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: May-2022
    • (2021)Computing En-Route for Near-Data ProcessingIEEE Transactions on Computers10.1109/TC.2021.306337870:6(906-921)Online publication date: 1-Jun-2021
    • (2019)MAPIM: Mat Parallelism for High Performance Processing in Non-volatile Memory Architecture20th International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED.2019.8697441(145-150)Online publication date: Mar-2019
    • Show More Cited By

    Index Terms

    1. AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 4
      December 2016
      648 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3012405
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 October 2016
      Accepted: 01 August 2016
      Revised: 01 July 2016
      Received: 01 May 2016
      Published in TACO Volume 13, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Processing-in-memory
      2. aggregation
      3. locality-adaptive execution
      4. near-data processing

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • Research Resettlement Fund for the new faculty of Seoul National University and the IT R8D program of MKE/KEIT
      • Embedded System Software for New Memory based Smart Devices

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)68
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: May-2022
      • (2021)Computing En-Route for Near-Data ProcessingIEEE Transactions on Computers10.1109/TC.2021.306337870:6(906-921)Online publication date: 1-Jun-2021
      • (2019)MAPIM: Mat Parallelism for High Performance Processing in Non-volatile Memory Architecture20th International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED.2019.8697441(145-150)Online publication date: Mar-2019
      • (2019)Active-Routing: Compute on the Way for Near-Data Processing2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00018(674-686)Online publication date: Mar-2019
      • (2018)SCOPEProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00062(696-709)Online publication date: 20-Oct-2018
      • (2017)DRISAProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123977(288-301)Online publication date: 14-Oct-2017

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media