research-article

Open access

MemPerf: Profiling Allocator-Induced Performance Slowdowns

Authors:

Steven (Jiaxun) Tang,

Guangming Zeng,

Tongping LiuAuthors Info & Claims

Proceedings of the ACM on Programming Languages, Volume 7, Issue OOPSLA2

Article No.: 272, Pages 1418 - 1441

https://doi.org/10.1145/3622848

Published: 16 October 2023 Publication History

Abstract

The memory allocator plays a key role in the performance of applications, but none of the existing profilers can pinpoint performance slowdowns caused by a memory allocator. Consequently, programmers may spend time improving application code incorrectly or unnecessarily, achieving low or no performance improvement. This paper designs the first profiler—MemPerf—to identify allocator-induced performance slowdowns without comparing against another allocator. Based on the key observation that an allocator may impact the whole life-cycle of heap objects, including the accesses (or uses) of these objects, MemPerf proposes a life-cycle based detection to identify slowdowns caused by slow memory management operations and slow accesses separately. For the prior one, MemPerf proposes a thread-aware and type-aware performance modeling to identify slow management operations. For slow memory accesses, MemPerf utilizes a top-down approach to identify all possible reasons for slow memory accesses introduced by the allocator, mainly due to cache and TLB misses, and further proposes a unified method to identify them correctly and efficiently. Based on our extensive evaluation, MemPerf reports 98% medium and large allocator-reduced slowdowns (larger than 5%) correctly without reporting any false positives. MemPerf also pinpoints multiple known and unknown design issues in widely-used allocators.

References

[1]

Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2015. Fast, multicore-scalable, low-fragmentation memory allocation through large virtual memory and global data structures. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015. 451–469. https://doi.org/10.1145/2814270.2814294

Digital Library

[2]

Mohammad Mejbah ul Alam, Tongping Liu, Guangming Zeng, and Abdullah Muzahid. 2017. SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys ’17). 298–313. isbn:978-1-4503-4938-3 https://doi.org/10.1145/3064176.3064186

Digital Library

[3]

Android Community. 2020. View the Java heap and memory allocations with citep. https://developer.android.com/studio/profile/memory-profiler

[4]

The OpenMP ARB. 2022. The OpenMP API Specification For Parallel Programming. https://www.openmp.org/

[5]

L.A. Barroso, K. Gharachorloo, and E. Bugnion. 1998. Memory system characterization of commercial workloads. In Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235). 3–14. https://doi.org/10.1109/ISCA.1998.694758

[6]

Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: A Scalable Memory Allocator for Multithreaded Applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX). 117–128. isbn:1-58113-317-0 https://doi.org/10.1145/378993.379232

Digital Library

[7]

Christian Bienia and Kai Li. 2009. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors. http://www-mount.ece.umn.edu/ jjyi/MoBS/2009/program/02E-Bienia.pdf

[8]

Milind Chabbi, Shasha Wen, and Xu Liu. 2018. Featherlight On-the-Fly False-Sharing Detection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). 152–167. isbn:9781450349826 https://doi.org/10.1145/3178487.3178499

Digital Library

[9]

K. Chang, C. Kao, Y. Chen, and G. Chen. 2014. Memory behavior profiler for Android applications. In 2014 IEEE 3rd Global Conference on Consumer Electronics (GCCE). 634–635. issn:2378-8143 https://doi.org/10.1109/GCCE.2014.7031343

[10]

I. Chihaia and T. Gross. 2004. Effectiveness of simple memory models for performance prediction. In Performance Analysis of Systems and Software, 2004 IEEE International Symposium on - ISPASS. 98–105. https://doi.org/10.1109/ISPASS.2004.1291361

[11]

James Clause and Alessandro Orso. 2010. LEAKPOINT: Pinpointing the Causes of Memory Leaks. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1 (ICSE ’10). 515–524. isbn:978-1-60558-719-6 https://doi.org/10.1145/1806799.1806874

Digital Library

[12]

Jon Coppeard. 2017. Allocate all JS data in a separate jemalloc arena. https://bugzilla.mozilla.org/show_bug.cgi?id=1410132

[13]

Charlie Curtsinger and Emery D. Berger. 2015. Coz: Finding Code That Counts with Causal Profiling. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP ’15). 184–197. isbn:978-1-4503-3834-9 https://doi.org/10.1145/2815400.2815409

Digital Library

[14]

Stephane Eranian, Eric Gouriou, Tipp Moseley, and Willem de Bruijn. 2015. Linux kernel profiling with perf. https://perf.wiki.kernel.org/index.php/Tutorial

[15]

Jason Evans. 2016. Scalable memory allocation using jemalloc. https://krebsonsecurity.com/2016/10/ddos-on-dyn-impacts-twitter-spotify-reddit/

[16]

Tais B. Ferreira, Rivalino Matias, Autran Macêdo, and Lucio B. Araujo. 2011. A Comparison of Memory Allocators for Multicore and Multithread Applications: A Quantitative Approach. In 2011 Brazilian Symposium on Computing System Engineering. 200–205. https://doi.org/10.1109/SBESC.2011.29

Digital Library

[17]

Free Software Foundation, Inc. 2015. The GNU C Library: Allocation Debugging. http://www.gnu.org/software/libc/manual/html_node/Allocation-Debugging.html

[18]

Sanjay Ghemawat. 2005. Profiling heap usage. http://goog-perftools.sourceforge.net/doc/heap_profiler.html

[19]

Sanjay Ghemawat and Paul Menage. 2007. TCMalloc: Thread-caching malloc, 2007. http://goog-perftools.sourceforge.net/doc/tcmalloc.html

[20]

Mel Gorman. 2015. malloc: Reduce worst-case behaviour with madvise and refault overhead. https://patchwork.ozlabs.org/project/glibc/patch/[email protected]/

[21]

Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. 1982. gprof: a Call Graph Execution Profiler. In SIGPLAN Symposium on Compiler Construction. 120–126. https://doi.org/10.1145/872726.806987

Digital Library

[22]

Intel Corporation. 1997. Using the rdtsc instruction for performance monitoring. https://www.ccsl.carleton.ca/ jamuir/rdtscpm1.pdf Techn. Ber., Tech. Rep., Intel Corporation, 22.

[23]

Tanvir Ahmed Khan, Yifan Zhao, Gilles Pokam, Barzan Mozafari, and Baris Kasikci. 2019. Huron: Hybrid False Sharing Detection and Repair. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA. 453–468. isbn:9781450367127 https://doi.org/10.1145/3314221.3314644

Digital Library

[24]

Bradley C. Kuszmaul. 2015. SuperMalloc: a super fast multithreaded malloc for 64-bit machines. In Proceedings of the 2015 International Symposium on Memory Management (ISMM ’15). Association for Computing Machinery, New York, NY, USA. 41–55. isbn:9781450335898 https://doi.org/10.1145/2754169.2754178

Digital Library

[25]

Lawrence Livermore National Laboratory. 2018. CORAL-2 Benchmarks. https://asc.llnl.gov/coral-2-benchmarks

[26]

Woo Hyong Lee, J. Morris Chang, and Yusuf Hasan. 2000. A Dynamic Memory Measuring Tool for C++ Programs. In Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSET’00) (ASSET ’00). IEEE Computer Society, Washington, DC, USA. 155–. isbn:0-7695-0559-7 https://doi.org/10.1109/ASSET.2000.888070

[27]

Daan Leijen. 2020. mimalloc. https://github.com/microsoft/mimalloc

[28]

John Levon and Philippe Elie. 2004. Oprofile: A system profiler for linux. https://oprofile.sourceforge.io/news/

[29]

Tongping Liu and Emery D. Berger. 2011. SHERIFF: precise detection and automatic mitigation of false sharing. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications (OOPSLA ’11). 3–18. isbn:978-1-4503-0940-0 https://doi.org/10.1145/2048066.2048070

Digital Library

[30]

Tongping Liu and Xu Liu. 2016. Cheetah: Detecting False Sharing Efficiently and Effectively. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016). 1–11. isbn:978-1-4503-3778-6 https://doi.org/10.1145/2854038.2854039

Digital Library

[31]

Tongping Liu, Chen Tian, Ziang Hu, and Emery D. Berger. 2014. PREDATOR: Predictive False Sharing Detection. 49, Association for Computing Machinery, New York, NY, USA. issn:0362-1340 https://doi.org/10.1145/2692916.2555244

Digital Library

[32]

Xu Liu and John Mellor-Crummey. 2013. A Data-Centric Profiler for Parallel Programs. Association for Computing Machinery, New York, NY, USA. isbn:9781450323789 https://doi.org/10.1145/2503210.2503297

Digital Library

[33]

Xu Liu, Kamal Sharma, and John Mellor-Crummey. 2014. ArrayTool: A Lightweight Profiler to Guide Array Regrouping. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). 405–416. isbn:978-1-4503-2809-8 https://doi.org/10.1145/2628071.2628102

Digital Library

[34]

X. Liu, K. Sharma, and J. Mellor-Crummey. 2014. ArrayTool: A lightweight profiler to guide array regrouping. In 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT). 405–415. https://doi.org/10.1145/2628071.2628102

Digital Library

[35]

Brandon Lucia. [n. d.]. MultiCacheSim: A coherent multiprocessor cache simulator. https://courses.cs.washington.edu/courses/cse471/11sp/sim.html.

[36]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’05). ACM, New York, NY, USA. 190–200. isbn:1-59593-056-6 https://doi.org/10.1145/1065010.1065034

Digital Library

[37]

Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Gilles Pokam, Chris J. Newburn, and Joseph Devietti. 2016. LASER: Light, Accurate Sharing dEtection and Repair. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 261–273. https://doi.org/10.1109/HPCA.2016.7446070

[38]

Adrian LUPASC and Viorica Popoiu. 2014. Dynamic Memory Allocation–Clr Profiler. 118 pages.

[39]

R. Matias, T. Borges Ferreira, and A. Macedo. 2011. An experimental study on user-level memory allocators in middleware applications. In 2011 IEEE International Conference on Systems, Man, and Cybernetics. 2431–2436. issn:1062-922X https://doi.org/10.1109/ICSMC.2011.6084042

[40]

Oliver Perks, Simon D. Hammond, Simon J. Pennycook, and Stephen A. Jarvis. 2011. WMTools - Assessing Parallel Application Memory Utilisation at Scale. In Proceedings of the 8th European Conference on Computer Performance Engineering (EPEW’11). Springer-Verlag, Berlin, Heidelberg. 148–162. isbn:978-3-642-24748-4 https://doi.org/10.1007/978-3-642-24749-1_12

Digital Library

[41]

Aleksey Pesterev, Nickolai Zeldovich, and Robert T. Morris. 2010. Locating Cache Performance Bottlenecks Using Data Profiling. In Proceedings of the 5th European Conference on Computer Systems (EuroSys ’10). 335–348. isbn:9781605585772 https://doi.org/10.1145/1755913.1755947

Digital Library

[42]

The Open MPI Project. 2022. Open MPI: Open Source High Performance Computing. https://www.open-mpi.org/

[43]

Kirill Rogozhin. 2014. Controlling memory consumption with Intel® Threading Building Blocks (Intel® TBB) scalable allocator. https://software.intel.com/content/www/us/en/develop/articles/controlling-memory-consumption-with-intel-threading-building-blocks-intel-tbb-scalable.html

[44]

Probir Roy, Shuaiwen Leon Song, Sriram Krishnamoorthy, and Xu Liu. 2018. Lightweight Detection of Cache Conflicts. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). Association for Computing Machinery, New York, NY, USA. 200–213. isbn:9781450356176 https://doi.org/10.1145/3168819

Digital Library

[45]

Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. 2012. AddressSanitizer: a fast address sanity checker. In Proceedings of the 2012 USENIX conference on Annual Technical Conference (USENIX ATC’12). USENIX Association, Berkeley, CA, USA. 28–28. http://dl.acm.org/citation.cfm?id=2342821.2342849

Digital Library

[46]

Tianwei Sheng, Neil Vachharajani, Stephane Eranian, Robert Hundt, Wenguang Chen, and Weimin Zheng. 2011. RACEZ: A Lightweight and Non-invasive Race Detection Tool for Production Applications. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11). 401–410. isbn:978-1-4503-0445-0 https://doi.org/10.1145/1985793.1985848

Digital Library

[47]

Oliver Yang. 2015. Pitfalls of TSC usage. http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/

[48]

Tingting Yu and Michael Pradel. 2016. SyncProf: detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken, Germany, July 18-20, 2016, Andreas Zeller and Abhik Roychoudhury (Eds.). 389–400. https://doi.org/10.1145/2931037.2931070

Digital Library

[49]

Matej Zavrtanik and Jurij Mihelic. [n. d.]. Experimental Evaluation and Comparison of Memory Allocators in the GNU/Linux Operating System. http://ipsitransactions.org/journals/papers/tar/2017jan/p10.pdf

[50]

Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. 2011. Dynamic Cache Contention Detection in Multi-Threaded Applications. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’11). Association for Computing Machinery, New York, NY, USA. 27–38. isbn:9781450306874 https://doi.org/10.1145/1952682.1952688

Digital Library

[51]

Fang Zhou, Yifan Gan, Sixiang Ma, and Yang Wang. 2018. wPerf: Generic Off-CPU Analysis to Identify Bottleneck Waiting Events. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, Andrea C. Arpaci-Dusseau and Geoff Voelker (Eds.). USENIX Association, 527–543. https://www.usenix.org/conference/osdi18/presentation/zhou

[52]

Benjamin Zorn and Paul N. Hilfinger. 1988. A Memory Allocation Profiler for C and Lisp Programs. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/1988/5382.html

Index Terms

MemPerf: Profiling Allocator-Induced Performance Slowdowns
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

NUMAlloc: A Faster NUMA Memory Allocator
ISMM 2023: Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management

The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the cooperation of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allocators do not support ...
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems

The non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
The intelligent memory allocator selector

Memory fragmentation is a serious obstacle preventing efficient memory usage. Garbage collectors may solve the problem; however, they cause serious performance impact, memory and energy consumption. Therefore, various memory allocators have been ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages

Proceedings of the ACM on Programming Languages Volume 7, Issue OOPSLA2

October 2023

2250 pages

EISSN:2475-1421

DOI:10.1145/3554312

Editor:
Michael Hicks
Amazon, USA

Issue’s Table of Contents

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2023

Published in PACMPL Volume 7, Issue OOPSLA2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
717
Total Downloads

Downloads (Last 12 months)383
Downloads (Last 6 weeks)30

Reflects downloads up to 16 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents