Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3126908.3126917acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Understanding object-level memory access patterns across the spectrum

Published: 12 November 2017 Publication History

Abstract

Memory accesses limit the performance and scalability of countless applications. Many design and optimization efforts will benefit from an in-depth understanding of memory access behavior, which is not offered by extant access tracing and profiling methods.
In this paper, we adopt a holistic memory access profiling approach to enable a better understanding of program-system memory interactions. We have developed a two-pass tool adopting fast online and slow offline profiling, with which we have profiled, at the variable/object level, a collection of 38 representative applications spanning major domains (HPC, personal computing, data analytics, AI, graph processing, and datacenter workloads), at varying problem sizes. We have performed detailed result analysis and code examination. Our findings provide new insights into application memory behavior, including insights on per-object access patterns, adoption of data structures, and memory-access changes at different problem sizes. We find that scientific computation applications exhibit distinct behaviors compared to datacenter workloads, motivating separate memory system design/optimizations.

References

[1]
Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C Smith, Berk Hess, and Erik Lindahl. 2015. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX (2015).
[2]
Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (2010).
[3]
Joseph Antony, Pete P Janes, and Alistair P Rendell. Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/Hyper Transport. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2006.
[4]
David A Barrett and Benjamin G Zorn. Using lifetime predictors to improve memory allocation performance. In ACM SIGPLAN Notices, 1993.
[5]
Luiz André Barroso, Kourosh Gharachorloo, and Edouard Bugnion. 1998. Memory system characterization of commercial workloads. ACM SIGARCH Computer Architecture News (1998).
[6]
Bradford M Beckmann and David A Wood. Managing wire delay in large chip-multiprocessor caches. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2004.
[7]
Emery D Berger, Benjamin G Zorn, and Kathryn S McKinley. OOPSLA 2002: Reconsidering custom memory allocation. In ACM SIGPLAN Notices, 2013.
[8]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC benchmark suite: Characterization and architectural implications. In IEEE Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008.
[9]
Gary Bradski and Adrian Kaehler. Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Inc, 2008.
[10]
Brad Calder, Chandra Krintz, Simmi John, and Todd Austin. Cache-conscious data placement. In ACM Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1998.
[11]
Trishul M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In ACM Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2001.
[12]
Zeshan Chishti, Michael D Powell, and TN Vijaykumar. Optimizing replication, communication, and capacity allocation in CMPs. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2015.
[13]
Aaron Darling, Lucas Carey, and Wu-chun Feng. 2003. The design, implementation, and evaluation of mpiBLAST. Proceedings of ClusterWorld (2003).
[14]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. Traffic management: A holistic approach to memory placement on NUMA systems. In ACM SIGPLAN Notices, 2013.
[15]
Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance analysis. In ACM SIGPLAN Notices, 2003.
[16]
Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. Data tiering in heterogeneous memory systems. In ACM Proceedings of the European Conference on Computer Systems (EuroSys), 2016.
[17]
Michael Ferdman, Pejman Lotfi-Kamran, Ken Balet, and Babak Falsafi. Cuckoo directory: A scalable directory for many-core systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2011.
[18]
Xiaofeng Gao, Michael Laurenzano, Beth Simon, and Allan Snavely. Reducing overheads for acquiring dynamic memory traces. In IEEE International Symposium on Workload Characterization (IISWC), 2005.
[19]
Xiaofeng Gao and Allan Snavely. Exploiting stability to reduce time-space cost for memory tracing. In International Conference on Computational Science (ICCS), 2003.
[20]
Jayesh Gaur, Alaa R Alameldeen, and Sreenivas Subramoney. Base-victim compression: An opportunistic cache compression architecture. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2016.
[21]
Alfredo Giménez, Todd Gamblin, Barry Rountree, Abhinav Bhatele, Ilir Jusufi, Peer-Timo Bremer, and Bernd Hamann. Dissecting on-node memory access performance: a semantic approach. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2014.
[22]
Saurabh Gupta, Ping Xiang, Yi Yang, and Huiyang Zhou. 2013. Locality principle revisited: A probability-based quantitative approach. J. Parallel and Distrib. Comput. (2013).
[23]
Simon D. Hammond, Arun F. Rodrigues, and Gwendolyn R. Voskuilen. Multi-Level memory policies: what you add is more important than what you take out. In Proceedings of the Second International Symposium on Memory Systems (MEMSYS), 2016.
[24]
Stavros Harizopoulos, Daniel J Abadi, Samuel Madden, and Michael Stonebraker. OLTP through the looking glass, and what we found there. In ACM Proceedings of the International Conference on Management of Data (SIGMOD), 2008.
[25]
Akanksha Jain and Calvin Lin. Back to the future: Leveraging Belady's algorithm for improved cache replacement. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2016.
[26]
Aamer Jaleel, Kevin B Theobald, Simon C Steely Jr, and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). ACM SIGARCH Computer Architecture News (2010).
[27]
Tomislav Janjusic and Krishna Kavi. 2013. Gleipnir: A memory profiling and tracing tool. ACM SIGARCH Computer Architecture News (2013).
[28]
Zhang Jing, Deng Lin, and Dou Yong. Data locality characterization of OLTP applications and its effects on cache performance. In International Conference on Advanced Computer Theory and Engineering (ICACTE), 2010.
[29]
Mark Johnson, Irena Zaretskaya, Yan Raytselis, Yuri Merezhuk, Scott McGinnis, and Thomas L Madden. 2008. NCBI BLAST: a better web interface. Nucleic Acids Research (2008).
[30]
Harshad Kasture and Daniel Sanchez. Tailbench: A benchmark suite and evaluation methodology for latency-critical applications. In IEEE International Symposium on Workload Characterization (IISWC), 2016.
[31]
Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing GPU concurrency in heterogeneous architectures. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2014.
[32]
Sandia National Laboratories. 2007. LAMMPS Molecular Dynamics Simulator. (2007). http://lammps.sandia.gov/.
[33]
Xiaoyao Liang, Gu-Yeon Wei, and David Brooks. Revival: A variation-tolerant architecture using voltage interpolation and variable latency. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2008.
[34]
Xu Liu and John Mellor-Crummey. A data-centric profiler for parallel programs. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013.
[35]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Sigplan Notices, 2005.
[36]
Raman Manikantan, Kaushik Rajan, and Ramaswamy Govindarajan. Probabilistic shared cache management (PriSM). In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2012.
[37]
Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache craftiness for fast multicore key-value storage. In ACM Proceedings of the European Conference on Computer Systems (EuroSys), 2012.
[38]
Jaydeep Marathe, Frank Mueller, Tushar Mohan, Bronis R de Supinski, Sally A McKee, and Andy Yoo. METRIC: Tracking down inefficiencies in the memory hierarchy via binary rewriting. In International Symposium on Code Generation and Optimization (CGO), 2003.
[39]
Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. Whirlpool: Improving dynamic cache management with static data classification. In ACM Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016.
[40]
Richard C Murphy and Peter M Kogge. 2007. On the memory access patterns of supercomputer applications: Benchmark selection and its implications. IEEE Trans. Comput. (2007).
[41]
Arun Arvind Nair, Stijn Eyerman, Lieven Eeckhout, and Lizy Kurian John. A first-order mechanistic model for architectural vulnerability factor. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2012.
[42]
NASA. 2007. The NAS Parallel Benchmarks. (2007). https://www.nas.nasa.gov/publications/npb.html.
[43]
U. D. of Energy. 2007. DOE exascale initiative technical roadmap. (2007). http://extremecomputing.labworks.org/hardware/collaboration/EI-RoadMapV21-SanDiego.pdf.
[44]
Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. RTHMS: A tool for data placement on hybrid memory system. In ACM Proceedings of the SIGPLAN International Symposium on Memory Management (ISMM), 2017.
[45]
Sokhom Pheng and Clark Verbrugge. Dynamic data structure analysis for Java programs. In IEEE Proceedings of the International Conference on Program Comprehension (ICPC), 2006.
[46]
Seth H Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2014.
[47]
Easwaran Raman and David I. August. Recursive data structure profiling. In ACM Proceedings of the Workshop on Memory System Performance (MSP), 2005.
[48]
Shai Rubin, Rastislav Bodík, and Trishul Chilimbi. An efficient profile-analysis framework for data-layout optimizations. In ACM Proceedings of the SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2002.
[49]
Matthew L Seidl and Benjamin G Zorn. 1997. Predicting references to dynamically allocated objects. University of Colorado Technical Report (1997).
[50]
Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry. The dirty-block index. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2014.
[51]
Julian Shun and Guy E Blelloch. Ligra: A lightweight graph processing framework for shared memory. In ACM Sigplan Notices, 2013.
[52]
Julian Shun, Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: The problem based benchmark suite. In ACM Proceedings of the Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), 2012.
[53]
The Standard Performance Evaluation Corporation (SPEC). 2007. The SPEC benchmarks. (2007). http://www.spec.org/.
[54]
TOP500. 2007. TOP500 Supercomputer Sites. (2007). http://www.top500.org/.
[55]
Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy transactions in multicore in-memory databases. In ACM Proceedings of the Symposium on Operating Systems Principles (SOSP), 2013.
[56]
Gwendolyn Voskuilen, Arun F. Rodrigues, and Simon D. Hammond. Analyzing allocation behavior for multi-level memory. In Proceedings of the International Symposium on Memory Systems (MEMSYS), 2016.
[57]
Chao Wang, Sudharshan S Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an aggregate SSD store as a memory partition in extreme-scale machines. In IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2012.
[58]
Ruisheng Wang and Lizhong Chen. Futility scaling: High-associativity cache partitioning. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2014.
[59]
Yijian Wang and David Kaeli. Profile-guided I/O partitioning. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2003.
[60]
Jonathan Weinberg, Michael O McCracken, Erich Strohmaier, and Allan Snavely. Quantifying locality in the memory access patterns of hpc applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2005.
[61]
Thomas F Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi. 2005. Temporal streaming of shared memory. ACM SIGARCH Computer Architecture News (2005).
[62]
Qiang Wu, Artem Pyatakov, Alexey Spiridonov, Easwaran Raman, Douglas W. Clark, and David I. August. Exposing memory access regularities using object-relative memory profiling. In International Symposium on Code Generation and Optimization (CGO), 2004.
[63]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Proceedings of the Conference on Networked Systems Design and Implementation (NSDI), 2012.

Cited By

View all
  • (2023)FlexPointer: Fast Address Translation Based on Range TLB and Tagged PointersACM Transactions on Architecture and Code Optimization10.1145/357985420:2(1-24)Online publication date: 1-Mar-2023
  • (2023)Object Fingerprint Cache for Heterogeneous Memory SystemIEEE Transactions on Computers10.1109/TC.2023.325185272:9(2496-2507)Online publication date: 1-Sep-2023
  • (2023)COMPAD: A heterogeneous cache-scratchpad CPU architecture with data layout compaction for embedded loop-dominated applicationsJournal of Systems Architecture10.1016/j.sysarc.2023.103022145(103022)Online publication date: Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
  • General Chair:
  • Bernd Mohr,
  • Program Chair:
  • Padma Raghavan
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data types and structures
  2. memory profiling
  3. object access patterns
  4. tracing
  5. workload characterization

Qualifiers

  • Research-article

Funding Sources

  • Qatar Foundation
  • National Key R&D Program of China
  • National Natural Science Foundation of China
  • National Research Foundation of Korea (NRF) by the Korea Government (MISP)

Conference

SC '17
Sponsor:

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)5
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)FlexPointer: Fast Address Translation Based on Range TLB and Tagged PointersACM Transactions on Architecture and Code Optimization10.1145/357985420:2(1-24)Online publication date: 1-Mar-2023
  • (2023)Object Fingerprint Cache for Heterogeneous Memory SystemIEEE Transactions on Computers10.1109/TC.2023.325185272:9(2496-2507)Online publication date: 1-Sep-2023
  • (2023)COMPAD: A heterogeneous cache-scratchpad CPU architecture with data layout compaction for embedded loop-dominated applicationsJournal of Systems Architecture10.1016/j.sysarc.2023.103022145(103022)Online publication date: Dec-2023
  • (2023)Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s PerlmutterHigh Performance Computing10.1007/978-3-031-32041-5_16(297-316)Online publication date: 10-May-2023
  • (2022)Software-defined address mapping: a case on 3D memoryProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507774(70-83)Online publication date: 28-Feb-2022
  • (2022)Software Hint-Driven Data Management for Hybrid Memory in Mobile SystemsACM Transactions on Embedded Computing Systems10.1145/349453621:1(1-18)Online publication date: 14-Jan-2022
  • (2022)Memory Scaling of Cloud-Based Big Data Systems: A Hybrid ApproachIEEE Transactions on Big Data10.1109/TBDATA.2020.30355228:5(1259-1272)Online publication date: 1-Oct-2022
  • (2021)A Holistic View of Memory Utilization on HPC Systems: Current and Future TrendsProceedings of the International Symposium on Memory Systems10.1145/3488423.3519336(1-11)Online publication date: 27-Sep-2021
  • (2021)OpenMem: Hardware/Software Cooperative Management for Mobile Memory System2021 58th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18074.2021.9586186(109-114)Online publication date: 5-Dec-2021
  • (2020)ScaleML: Machine Learning based Heap Memory Object Scaling Prediction2020 9th Non-Volatile Memory Systems and Applications Symposium (NVMSA)10.1109/NVMSA51238.2020.9188162(1-6)Online publication date: Aug-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media