research-article

Open access

AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy

Authors:

Kiyoung ChoiAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 4

Article No.: 34, Pages 1 - 24

https://doi.org/10.1145/2994149

Published: 25 October 2016 Publication History

Abstract

In this article, we propose Aggregation-in-Memory (AIM), a new processing-in-memory system designed for energy efficiency and near-term adoption. In order to efficiently perform aggregation, we implement simple aggregation operations in main memory and develop a locality-adaptive host architecture for in-memory aggregation, called cache-conscious aggregation. Through this, AIM executes aggregation at the most energy-efficient location among all levels of the memory hierarchy. Moreover, AIM minimally changes existing sequential programming models and provides fully automated compiler toolchain, thereby allowing unmodified legacy software to use AIM. Evaluations show that AIM greatly improves the energy efficiency of main memory and the system performance.

References

[1]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015a. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the International Symposium on Computer Architecture. 105--117.

Digital Library

[2]

Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015b. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the International Symposium on Computer Architecture. 336--348.

Digital Library

[3]

Jung Ho Ahn, Mattan Erez, and William J. Dally. 2005. Scatter-add in data parallel architectures. In Proceedings of the International Symposium on High-Performance Computer Architecture. 132--142.

Digital Library

[4]

Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H. Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 workshop. IEEE Micro 34, 4 (July 2014), 36--42.

[5]

Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the International World-Wide Web Conference. 14--18.

Digital Library

[6]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization. 44--54.

Digital Library

[7]

Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Proceedings of the Design, Automation and Test in Europe. 33--38.

Digital Library

[8]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (Dec. 2011), 1:1--1:25.

Digital Library

[9]

Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. 2002. The architecture of the DIVA processing-in-memory chip. In Proceedings of the International Conference on Supercomputing. 14--25.

Digital Library

[10]

Duncan G. Elliott, Michael Stumm, W. Martin. Snelgrove, Christian Cojocaru, and Robert McKenzie. 1999. Computational RAM: Implementing processors in memory. IEEE Design Test Comput. 16, 1 (Jan. 1999), 32--41.

Digital Library

[11]

Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In Proceedings of the International Symposium on High Performance Computer Architecture. 283--295.

[12]

L. A. Feldkamp, L. C. Davis, and J. W. Kress. 1984. Practical cone-beam algorithm. J. Opt. Soc. Am. A 1, 6 (June 1984), 612--619.

[13]

María Jesús Garzarán, Milos Prvulovic, Ye Zhang, Alin Jula, Hao Yu, Lawrence Rauchwerger, and Josep Torrellas. 2001. Architectural support for parallel reductions in scalable shared-memory multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 243--254.

Digital Library

[14]

Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook Shin, and Joonseok Park. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing. 57.

Digital Library

[15]

Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.0.

[16]

Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. 2012. Green-Marl: A DSL for easy and efficient graph analysis. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 349--362.

Digital Library

[17]

Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 78--88.

Digital Library

[18]

Sungpack Hong, Semih Salihoglu, Jennifer Widom, and Kunle Olukotun. 2014. Simplifying scalable graph processing with a domain-specific language. In Proceedings of the International Symposium on Code Generation and Optimization. 208--218.

Digital Library

[19]

U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2009. PEGASUS: A peta-scale graph mining system implementation and observations. In Proceedings of the International Conference on Data Mining. 229--238.

Digital Library

[20]

Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Lam, Pratap Pattnaik, and Josep Torrellas. 1999. FlexRAM: Toward an advanced intelligent memory system. In Proceedings of the International Conference on Computer Design. 192--201.

Digital Library

[21]

Daehyun Kim, Mainak Chaudhuri, and Mark Heinrich. 2002. Leveraging cache coherence in active memory systems. In Proceedings of the International Conference on Supercomputing. 2--13.

Digital Library

[22]

Peter M. Kogge. 1994. EXECUBE—A new architecture for scaleable MPPs. In Proceedings of the International Conference on Parallel Processing. 77--84.

Digital Library

[23]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems. 1097--1105.

Digital Library

[24]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization. 75--86.

Digital Library

[25]

Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2010. DRAM-aware Last-level Cache Writeback: Reducing Write-caused Interference in Memory Systems. Technical Report No. TR-HPS-2010-002. The University of Texas at Austin.

[26]

Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. (June 2014). http://snap.stanford.edu/data.

[27]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture. 469--480.

Digital Library

[28]

Kyu-Nam Lim, Woong-Ju Jang, Hyung-Sik Won, Kang-Yeol Lee, Hyungsoo Kim, Dong-Whee Kim, Mi-Hyun Cho, Seung-Lo Kim, Jong-Ho Kang, Keun-Woo Park, and Byung-Tae Jeong. 2012. A 1.2V 23nm 6F² 4Gb DDR3 SDRAM with local-bitline sense amplifier, hybrid LIO sense amplifier and dummy-less array architecture. In International Solid-State Circuits Conference Digest of Technical Papers. 42--44.

[29]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation. 190--200.

Digital Library

[30]

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the International Conference on Management of Data. 135--146.

Digital Library

[31]

Micron Technology 2007. Calculating Memory System Power for DDR3. Micron Technology.

[32]

Micron Technology 2009. 4Gb: ×4, ×8, ×16 DDR3 SDRAM. Micron Technology.

[33]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical ReportNo. HPL-2009-85. HP Laboratories.

[34]

Ravi Nair, Samuel F. Antao, Carlo Bertolli, Pradip Bose, Jose R. Brunheroto, Tong Chen, Chen-Yong Cher, Carlos H. A. Costa, Jun Doi, Constantinos Evangelinos, Bruce M. Fleischer, Thomas W. Fox, Diego Sanchez Gallo, Leopold Grinberg, John A. Gunnels, Arpith C. Jacob, Philip Jacob, Hans M. Jacobson, Tejas Karkhanis, C. Kim, Jaime H. Moreno, J. Kevin O’Brien, Martin Ohmacht, Yoonho Park, Daniel A. Prener, Bryan S. Rosenburg, Kyung Dong Ryu, Olivier Sallenave, Mauricio J. Serrano, Patrick D. M. Siegl, Krishnan Sugavanam, and Zehra Sura. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM J. Res. Dev. 59, 2/3 (March 2015), 17:1--17:14.

Digital Library

[35]

Mark Oskin, Frederic T. Chong, and Timothy Sherwood. 1998. Active pages: A computation model for intelligent memory. In Proceedings of the International Symposium on Computer Architecture. 192--203.

Digital Library

[36]

David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A case for intelligent RAM. IEEE Micro 17, 2 (March 1997), 34--44.

Digital Library

[37]

Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Proceedings of the International Symposium on Performance Analysis of Systems and Software. 190--200.

[38]

Christopher Rohkohl, Benjamin Keck, Hannes Hofmann, and Joachim Hornegger. 2009. RabbitCT—An open platform for benchmarking 3D cone-beam reconstruction algorithms. Med. Phys. 36, 9 (Sept. 2009), 3940--3944.

[39]

Yousef Sadd. 2003. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics.

Digital Library

[40]

Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2014. The dirty-block index. In Proceeding of the International Symposium on Computer Architecture. 157--168.

Digital Library

[41]

Vivek Seshadri, Kevin Hsieh, Amirali Boroum, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. 2015. Fast bulk bitwise AND and OR in DRAM. IEEE Comput. Architec. Lett. 14, 2 (July-Dec. 2015), 127--131.

Digital Library

[42]

Thomas L. Sterling and Hans P. Zima. 2002. Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing. In Proceedings of the ACM/IEEE Conference on Supercomputing. 48.

Digital Library

[43]

Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. In Proceedings of the International Symposium on Computer Architecture. 72--82.

Digital Library

[44]

Stavros Volos, Javier Picorel, Babak Falsafi, and Boris Grot. 2014. BuMP: Bulk memory access prediction and streaming. In Proceedings of the International Symposium on Microarchitecture. 545--557.

Digital Library

[45]

Zhe Wang, Samira M. Khan, and Daniel A. Jiménez. 2012. Improving writeback efficiency with decoupled last-write prediction. In Proceedings of the International Symposium on Computer Architecture. 309--320.

Digital Library

[46]

Dong Ping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing. 85--98.

Digital Library

[47]

Guowei Zhang, Webb Horn, and Daniel Sanchez. 2015. Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. In Proceedings of the International Symposium on Microarchitecture.

Digital Library

[48]

Tao Zhang, Ke Chen, Cong Xu, Guangyu Sun, Tao Wang, and Yuan Xie. 2014. Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation. In Proceedings of the International Symposium on Computer Architecture. 349--360.

Digital Library

[49]

Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In Proceedings of the International 3D Systems Integration Conference. 1--7.

Cited By

Dalmia PMahapatra RSinclair M(2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: May-2022
https://doi.org/10.1109/HPCA53966.2022.00056
Huang JMajumder PKim SFulton TPuli RYum KKim E(2021)Computing En-Route for Near-Data ProcessingIEEE Transactions on Computers10.1109/TC.2021.306337870:6(906-921)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TC.2021.3063378
Sim JKim MKim YGupta SKhaleghi BRosing T(2019)MAPIM: Mat Parallelism for High Performance Processing in Non-volatile Memory Architecture20th International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED.2019.8697441(145-150)Online publication date: Mar-2019
https://doi.org/10.1109/ISQED.2019.8697441
Show More Cited By

Index Terms

AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Processing data where it makes sense: Enabling in-memory computation
Abstract
Today’s systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: (1) data access from ...
Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation
GLSVLSI '19: Proceedings of the 2019 Great Lakes Symposium on VLSI

Today's systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: 1) data access from memory is already a key ...
CORUSCANT: Fast Efficient Processing-in-Racetrack Memories
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

The growth in data needs of modern applications has created significant challenges for modern systems leading to a "memory wall." Spintronic Domain-Wall Memory (DWM), provides near-SRAM read/write performance, energy savings and non-volatility, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 4

December 2016

648 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3012405

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2016

Accepted: 01 August 2016

Revised: 01 July 2016

Received: 01 May 2016

Published in TACO Volume 13, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Research Resettlement Fund for the new faculty of Seoul National University and the IT R8D program of MKE/KEIT
Embedded System Software for New Memory based Smart Devices

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
702
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)7

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dalmia PMahapatra RSinclair M(2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: May-2022
https://doi.org/10.1109/HPCA53966.2022.00056
Huang JMajumder PKim SFulton TPuli RYum KKim E(2021)Computing En-Route for Near-Data ProcessingIEEE Transactions on Computers10.1109/TC.2021.306337870:6(906-921)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TC.2021.3063378
Sim JKim MKim YGupta SKhaleghi BRosing T(2019)MAPIM: Mat Parallelism for High Performance Processing in Non-volatile Memory Architecture20th International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED.2019.8697441(145-150)Online publication date: Mar-2019
https://doi.org/10.1109/ISQED.2019.8697441
Huang JReddy Puli RMajumder PKim SBoyapati RYum KKim E(2019)Active-Routing: Compute on the Way for Near-Data Processing2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00018(674-686)Online publication date: Mar-2019
https://doi.org/10.1109/HPCA.2019.00018
Li SGlova AHu XGu PNiu DMalladi KZheng HBrennan BXie YOskin MInoue K(2018)SCOPEProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00062(696-709)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00062
Li SNiu DMalladi KZheng HBrennan BXie YHunter HMoreno JEmer JSanchez D(2017)DRISAProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123977(288-301)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123977

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents