research-article

Understanding object-level memory access patterns across the spectrum

Authors:

Nosayba El-Sayed,

Sudharshan S. Vazhkudai,

Daniel SanchezAuthors Info & Claims

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 25, Pages 1 - 12

https://doi.org/10.1145/3126908.3126917

Published: 12 November 2017 Publication History

Abstract

Memory accesses limit the performance and scalability of countless applications. Many design and optimization efforts will benefit from an in-depth understanding of memory access behavior, which is not offered by extant access tracing and profiling methods.

In this paper, we adopt a holistic memory access profiling approach to enable a better understanding of program-system memory interactions. We have developed a two-pass tool adopting fast online and slow offline profiling, with which we have profiled, at the variable/object level, a collection of 38 representative applications spanning major domains (HPC, personal computing, data analytics, AI, graph processing, and datacenter workloads), at varying problem sizes. We have performed detailed result analysis and code examination. Our findings provide new insights into application memory behavior, including insights on per-object access patterns, adoption of data structures, and memory-access changes at different problem sizes. We find that scientific computation applications exhibit distinct behaviors compared to datacenter workloads, motivating separate memory system design/optimizations.

References

[1]

Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C Smith, Berk Hess, and Erik Lindahl. 2015. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX (2015).

[2]

Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (2010).

Digital Library

[3]

Joseph Antony, Pete P Janes, and Alistair P Rendell. Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/Hyper Transport. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2006.

Digital Library

[4]

David A Barrett and Benjamin G Zorn. Using lifetime predictors to improve memory allocation performance. In ACM SIGPLAN Notices, 1993.

Digital Library

[5]

Luiz André Barroso, Kourosh Gharachorloo, and Edouard Bugnion. 1998. Memory system characterization of commercial workloads. ACM SIGARCH Computer Architecture News (1998).

Digital Library

[6]

Bradford M Beckmann and David A Wood. Managing wire delay in large chip-multiprocessor caches. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2004.

Digital Library

[7]

Emery D Berger, Benjamin G Zorn, and Kathryn S McKinley. OOPSLA 2002: Reconsidering custom memory allocation. In ACM SIGPLAN Notices, 2013.

Digital Library

[8]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC benchmark suite: Characterization and architectural implications. In IEEE Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008.

Digital Library

[9]

Gary Bradski and Adrian Kaehler. Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Inc, 2008.

Digital Library

[10]

Brad Calder, Chandra Krintz, Simmi John, and Todd Austin. Cache-conscious data placement. In ACM Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1998.

Digital Library

[11]

Trishul M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In ACM Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2001.

Digital Library

[12]

Zeshan Chishti, Michael D Powell, and TN Vijaykumar. Optimizing replication, communication, and capacity allocation in CMPs. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2015.

Digital Library

[13]

Aaron Darling, Lucas Carey, and Wu-chun Feng. 2003. The design, implementation, and evaluation of mpiBLAST. Proceedings of ClusterWorld (2003).

[14]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. Traffic management: A holistic approach to memory placement on NUMA systems. In ACM SIGPLAN Notices, 2013.

Digital Library

[15]

Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance analysis. In ACM SIGPLAN Notices, 2003.

Digital Library

[16]

Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. Data tiering in heterogeneous memory systems. In ACM Proceedings of the European Conference on Computer Systems (EuroSys), 2016.

Digital Library

[17]

Michael Ferdman, Pejman Lotfi-Kamran, Ken Balet, and Babak Falsafi. Cuckoo directory: A scalable directory for many-core systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2011.

Digital Library

[18]

Xiaofeng Gao, Michael Laurenzano, Beth Simon, and Allan Snavely. Reducing overheads for acquiring dynamic memory traces. In IEEE International Symposium on Workload Characterization (IISWC), 2005.

[19]

Xiaofeng Gao and Allan Snavely. Exploiting stability to reduce time-space cost for memory tracing. In International Conference on Computational Science (ICCS), 2003.

Digital Library

[20]

Jayesh Gaur, Alaa R Alameldeen, and Sreenivas Subramoney. Base-victim compression: An opportunistic cache compression architecture. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2016.

Digital Library

[21]

Alfredo Giménez, Todd Gamblin, Barry Rountree, Abhinav Bhatele, Ilir Jusufi, Peer-Timo Bremer, and Bernd Hamann. Dissecting on-node memory access performance: a semantic approach. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2014.

Digital Library

[22]

Saurabh Gupta, Ping Xiang, Yi Yang, and Huiyang Zhou. 2013. Locality principle revisited: A probability-based quantitative approach. J. Parallel and Distrib. Comput. (2013).

[23]

Simon D. Hammond, Arun F. Rodrigues, and Gwendolyn R. Voskuilen. Multi-Level memory policies: what you add is more important than what you take out. In Proceedings of the Second International Symposium on Memory Systems (MEMSYS), 2016.

Digital Library

[24]

Stavros Harizopoulos, Daniel J Abadi, Samuel Madden, and Michael Stonebraker. OLTP through the looking glass, and what we found there. In ACM Proceedings of the International Conference on Management of Data (SIGMOD), 2008.

Digital Library

[25]

Akanksha Jain and Calvin Lin. Back to the future: Leveraging Belady's algorithm for improved cache replacement. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2016.

Digital Library

[26]

Aamer Jaleel, Kevin B Theobald, Simon C Steely Jr, and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). ACM SIGARCH Computer Architecture News (2010).

Digital Library

[27]

Tomislav Janjusic and Krishna Kavi. 2013. Gleipnir: A memory profiling and tracing tool. ACM SIGARCH Computer Architecture News (2013).

Digital Library

[28]

Zhang Jing, Deng Lin, and Dou Yong. Data locality characterization of OLTP applications and its effects on cache performance. In International Conference on Advanced Computer Theory and Engineering (ICACTE), 2010.

[29]

Mark Johnson, Irena Zaretskaya, Yan Raytselis, Yuri Merezhuk, Scott McGinnis, and Thomas L Madden. 2008. NCBI BLAST: a better web interface. Nucleic Acids Research (2008).

[30]

Harshad Kasture and Daniel Sanchez. Tailbench: A benchmark suite and evaluation methodology for latency-critical applications. In IEEE International Symposium on Workload Characterization (IISWC), 2016.

[31]

Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing GPU concurrency in heterogeneous architectures. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2014.

Digital Library

[32]

Sandia National Laboratories. 2007. LAMMPS Molecular Dynamics Simulator. (2007). http://lammps.sandia.gov/.

[33]

Xiaoyao Liang, Gu-Yeon Wei, and David Brooks. Revival: A variation-tolerant architecture using voltage interpolation and variable latency. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2008.

Digital Library

[34]

Xu Liu and John Mellor-Crummey. A data-centric profiler for parallel programs. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013.

Digital Library

[35]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Sigplan Notices, 2005.

Digital Library

[36]

Raman Manikantan, Kaushik Rajan, and Ramaswamy Govindarajan. Probabilistic shared cache management (PriSM). In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2012.

Digital Library

[37]

Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache craftiness for fast multicore key-value storage. In ACM Proceedings of the European Conference on Computer Systems (EuroSys), 2012.

Digital Library

[38]

Jaydeep Marathe, Frank Mueller, Tushar Mohan, Bronis R de Supinski, Sally A McKee, and Andy Yoo. METRIC: Tracking down inefficiencies in the memory hierarchy via binary rewriting. In International Symposium on Code Generation and Optimization (CGO), 2003.

Digital Library

[39]

Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. Whirlpool: Improving dynamic cache management with static data classification. In ACM Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016.

Digital Library

[40]

Richard C Murphy and Peter M Kogge. 2007. On the memory access patterns of supercomputer applications: Benchmark selection and its implications. IEEE Trans. Comput. (2007).

Digital Library

[41]

Arun Arvind Nair, Stijn Eyerman, Lieven Eeckhout, and Lizy Kurian John. A first-order mechanistic model for architectural vulnerability factor. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2012.

Digital Library

[42]

NASA. 2007. The NAS Parallel Benchmarks. (2007). https://www.nas.nasa.gov/publications/npb.html.

[43]

U. D. of Energy. 2007. DOE exascale initiative technical roadmap. (2007). http://extremecomputing.labworks.org/hardware/collaboration/EI-RoadMapV21-SanDiego.pdf.

[44]

Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. RTHMS: A tool for data placement on hybrid memory system. In ACM Proceedings of the SIGPLAN International Symposium on Memory Management (ISMM), 2017.

Digital Library

[45]

Sokhom Pheng and Clark Verbrugge. Dynamic data structure analysis for Java programs. In IEEE Proceedings of the International Conference on Program Comprehension (ICPC), 2006.

Digital Library

[46]

Seth H Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2014.

[47]

Easwaran Raman and David I. August. Recursive data structure profiling. In ACM Proceedings of the Workshop on Memory System Performance (MSP), 2005.

Digital Library

[48]

Shai Rubin, Rastislav Bodík, and Trishul Chilimbi. An efficient profile-analysis framework for data-layout optimizations. In ACM Proceedings of the SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2002.

Digital Library

[49]

Matthew L Seidl and Benjamin G Zorn. 1997. Predicting references to dynamically allocated objects. University of Colorado Technical Report (1997).

[50]

Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry. The dirty-block index. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2014.

Digital Library

[51]

Julian Shun and Guy E Blelloch. Ligra: A lightweight graph processing framework for shared memory. In ACM Sigplan Notices, 2013.

Digital Library

[52]

Julian Shun, Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: The problem based benchmark suite. In ACM Proceedings of the Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), 2012.

Digital Library

[53]

The Standard Performance Evaluation Corporation (SPEC). 2007. The SPEC benchmarks. (2007). http://www.spec.org/.

[54]

TOP500. 2007. TOP500 Supercomputer Sites. (2007). http://www.top500.org/.

[55]

Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy transactions in multicore in-memory databases. In ACM Proceedings of the Symposium on Operating Systems Principles (SOSP), 2013.

Digital Library

[56]

Gwendolyn Voskuilen, Arun F. Rodrigues, and Simon D. Hammond. Analyzing allocation behavior for multi-level memory. In Proceedings of the International Symposium on Memory Systems (MEMSYS), 2016.

Digital Library

[57]

Chao Wang, Sudharshan S Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an aggregate SSD store as a memory partition in extreme-scale machines. In IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2012.

Digital Library

[58]

Ruisheng Wang and Lizhong Chen. Futility scaling: High-associativity cache partitioning. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2014.

Digital Library

[59]

Yijian Wang and David Kaeli. Profile-guided I/O partitioning. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2003.

Digital Library

[60]

Jonathan Weinberg, Michael O McCracken, Erich Strohmaier, and Allan Snavely. Quantifying locality in the memory access patterns of hpc applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2005.

Digital Library

[61]

Thomas F Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi. 2005. Temporal streaming of shared memory. ACM SIGARCH Computer Architecture News (2005).

Digital Library

[62]

Qiang Wu, Artem Pyatakov, Alexey Spiridonov, Easwaran Raman, Douglas W. Clark, and David I. August. Exposing memory access regularities using object-relative memory profiling. In International Symposium on Code Generation and Optimization (CGO), 2004.

Digital Library

[63]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Proceedings of the Conference on Networked Systems Design and Implementation (NSDI), 2012.

Digital Library

Cited By

Chen DTong DYang CYi JCheng X(2023)FlexPointer: Fast Address Translation Based on Range TLB and Tagged PointersACM Transactions on Architecture and Code Optimization10.1145/357985420:2(1-24)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1145/3579854
Zhou FWu SYue JJin HShen J(2023)Object Fingerprint Cache for Heterogeneous Memory SystemIEEE Transactions on Computers10.1109/TC.2023.325185272:9(2496-2507)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TC.2023.3251852
Marinelli TGómez Pérez JTenllado CCatthoor F(2023)COMPAD: A heterogeneous cache-scratchpad CPU architecture with data layout compaction for embedded loop-dominated applicationsJournal of Systems Architecture10.1016/j.sysarc.2023.103022145(103022)Online publication date: Dec-2023
https://doi.org/10.1016/j.sysarc.2023.103022
Show More Cited By

Recommendations

Temporal characterization of memory access behaviors in SPEC CPU2017 workloads: Analysis and synthesis
Abstract
The SPEC CPU2017 benchmark suite has received wide attention in both academia and industry. However, few work have studied the memory behaviors in SPEC CPU2017 workloads from a time dependence perspective. We run all SPEC CPU2017 benchmarks and ...
Highlights
- Observing some interesting phenomena not seen in SPEC CPU2006 workloads.
- The correlation of access intervals in SPEC CPU2017 differs significantly from CPU2006.
- All Hurst estimates confirm the wide existence of self-similarity in ...
Understanding the trade-offs in multi-level cell ReRAM memory design
DAC '13: Proceedings of the 50th Annual Design Automation Conference

Resistive Random Access Memory (ReRAM) is one of the most promising emerging memory technologies as a potential replacement for DRAM memory and/or NAND Flash. Multi-level cell (MLC) ReRAM, which can store multiple bits in a single ReRAM cell, can ...
SPEC CPU2006 sensitivity to memory page sizes

SPEC CPU2006 is a compute-intensive industry standard benchmark suite published in August 2006. This paper characterizes the memory access behavior of SPEC CPU2006 running on IBM POWER5+ microprocessors. We measure the maximum and average memory usage ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2017

801 pages

ISBN:9781450351140

DOI:10.1145/3126908

General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN

Copyright © 2017 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Qatar Foundation
National Key R&D Program of China
National Natural Science Foundation of China
National Research Foundation of Korea (NRF) by the Korea Government (MISP)

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

Colorado, Denver

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
449
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)5

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen DTong DYang CYi JCheng X(2023)FlexPointer: Fast Address Translation Based on Range TLB and Tagged PointersACM Transactions on Architecture and Code Optimization10.1145/357985420:2(1-24)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1145/3579854
Zhou FWu SYue JJin HShen J(2023)Object Fingerprint Cache for Heterogeneous Memory SystemIEEE Transactions on Computers10.1109/TC.2023.325185272:9(2496-2507)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TC.2023.3251852
Marinelli TGómez Pérez JTenllado CCatthoor F(2023)COMPAD: A heterogeneous cache-scratchpad CPU architecture with data layout compaction for embedded loop-dominated applicationsJournal of Systems Architecture10.1016/j.sysarc.2023.103022145(103022)Online publication date: Dec-2023
https://doi.org/10.1016/j.sysarc.2023.103022
Li JMichelogiannakis GCook BCooray DChen Y(2023)Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s PerlmutterHigh Performance Computing10.1007/978-3-031-32041-5_16(297-316)Online publication date: 10-May-2023
https://doi.org/10.1007/978-3-031-32041-5_16
Zhang JSwift MLi JFalsafi BFerdman MLu SWenisch T(2022)Software-defined address mapping: a case on 3D memoryProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507774(70-83)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507774
Wen FQin MGratz PReddy N(2022)Software Hint-Driven Data Management for Hybrid Memory in Mobile SystemsACM Transactions on Embedded Computing Systems10.1145/349453621:1(1-18)Online publication date: 14-Jan-2022
https://dl.acm.org/doi/10.1145/3494536
Wang XXu CWang KYan FZhao D(2022)Memory Scaling of Cloud-Based Big Data Systems: A Hybrid ApproachIEEE Transactions on Big Data10.1109/TBDATA.2020.30355228:5(1259-1272)Online publication date: 1-Oct-2022
https://doi.org/10.1109/TBDATA.2020.3035522
Peng IKarlin IGokhale MShoga KLegendre MGamblin T(2021)A Holistic View of Memory Utilization on HPC Systems: Current and Future TrendsProceedings of the International Symposium on Memory Systems10.1145/3488423.3519336(1-11)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3488423.3519336
Wen FQin MGratz PReddy N(2021)OpenMem: Hardware/Software Cooperative Management for Mobile Memory System2021 58th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18074.2021.9586186(109-114)Online publication date: 5-Dec-2021
https://doi.org/10.1109/DAC18074.2021.9586186
Park JJamil SKhan ALee SKim Y(2020)ScaleML: Machine Learning based Heap Memory Object Scaling Prediction2020 9th Non-Volatile Memory Systems and Applications Symposium (NVMSA)10.1109/NVMSA51238.2020.9188162(1-6)Online publication date: Aug-2020
https://doi.org/10.1109/NVMSA51238.2020.9188162
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents