Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

MemPerf: Profiling Allocator-Induced Performance Slowdowns

Published: 16 October 2023 Publication History

Abstract

The memory allocator plays a key role in the performance of applications, but none of the existing profilers can pinpoint performance slowdowns caused by a memory allocator. Consequently, programmers may spend time improving application code incorrectly or unnecessarily, achieving low or no performance improvement. This paper designs the first profiler—MemPerf—to identify allocator-induced performance slowdowns without comparing against another allocator. Based on the key observation that an allocator may impact the whole life-cycle of heap objects, including the accesses (or uses) of these objects, MemPerf proposes a life-cycle based detection to identify slowdowns caused by slow memory management operations and slow accesses separately. For the prior one, MemPerf proposes a thread-aware and type-aware performance modeling to identify slow management operations. For slow memory accesses, MemPerf utilizes a top-down approach to identify all possible reasons for slow memory accesses introduced by the allocator, mainly due to cache and TLB misses, and further proposes a unified method to identify them correctly and efficiently. Based on our extensive evaluation, MemPerf reports 98% medium and large allocator-reduced slowdowns (larger than 5%) correctly without reporting any false positives. MemPerf also pinpoints multiple known and unknown design issues in widely-used allocators.

References

[1]
Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2015. Fast, multicore-scalable, low-fragmentation memory allocation through large virtual memory and global data structures. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015. 451–469. https://doi.org/10.1145/2814270.2814294
[2]
Mohammad Mejbah ul Alam, Tongping Liu, Guangming Zeng, and Abdullah Muzahid. 2017. SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs. In Proceedings of the Twelfth European Conference on Computer Systems (EuroSys ’17). 298–313. isbn:978-1-4503-4938-3 https://doi.org/10.1145/3064176.3064186
[3]
Android Community. 2020. View the Java heap and memory allocations with citep. https://developer.android.com/studio/profile/memory-profiler
[4]
The OpenMP ARB. 2022. The OpenMP API Specification For Parallel Programming. https://www.openmp.org/
[5]
L.A. Barroso, K. Gharachorloo, and E. Bugnion. 1998. Memory system characterization of commercial workloads. In Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235). 3–14. https://doi.org/10.1109/ISCA.1998.694758
[6]
Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: A Scalable Memory Allocator for Multithreaded Applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX). 117–128. isbn:1-58113-317-0 https://doi.org/10.1145/378993.379232
[7]
Christian Bienia and Kai Li. 2009. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors. http://www-mount.ece.umn.edu/ jjyi/MoBS/2009/program/02E-Bienia.pdf
[8]
Milind Chabbi, Shasha Wen, and Xu Liu. 2018. Featherlight On-the-Fly False-Sharing Detection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). 152–167. isbn:9781450349826 https://doi.org/10.1145/3178487.3178499
[9]
K. Chang, C. Kao, Y. Chen, and G. Chen. 2014. Memory behavior profiler for Android applications. In 2014 IEEE 3rd Global Conference on Consumer Electronics (GCCE). 634–635. issn:2378-8143 https://doi.org/10.1109/GCCE.2014.7031343
[10]
I. Chihaia and T. Gross. 2004. Effectiveness of simple memory models for performance prediction. In Performance Analysis of Systems and Software, 2004 IEEE International Symposium on - ISPASS. 98–105. https://doi.org/10.1109/ISPASS.2004.1291361
[11]
James Clause and Alessandro Orso. 2010. LEAKPOINT: Pinpointing the Causes of Memory Leaks. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1 (ICSE ’10). 515–524. isbn:978-1-60558-719-6 https://doi.org/10.1145/1806799.1806874
[12]
Jon Coppeard. 2017. Allocate all JS data in a separate jemalloc arena. https://bugzilla.mozilla.org/show_bug.cgi?id=1410132
[13]
Charlie Curtsinger and Emery D. Berger. 2015. Coz: Finding Code That Counts with Causal Profiling. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP ’15). 184–197. isbn:978-1-4503-3834-9 https://doi.org/10.1145/2815400.2815409
[14]
Stephane Eranian, Eric Gouriou, Tipp Moseley, and Willem de Bruijn. 2015. Linux kernel profiling with perf. https://perf.wiki.kernel.org/index.php/Tutorial
[15]
Jason Evans. 2016. Scalable memory allocation using jemalloc. https://krebsonsecurity.com/2016/10/ddos-on-dyn-impacts-twitter-spotify-reddit/
[16]
Tais B. Ferreira, Rivalino Matias, Autran Macêdo, and Lucio B. Araujo. 2011. A Comparison of Memory Allocators for Multicore and Multithread Applications: A Quantitative Approach. In 2011 Brazilian Symposium on Computing System Engineering. 200–205. https://doi.org/10.1109/SBESC.2011.29
[17]
Free Software Foundation, Inc. 2015. The GNU C Library: Allocation Debugging. http://www.gnu.org/software/libc/manual/html_node/Allocation-Debugging.html
[18]
Sanjay Ghemawat. 2005. Profiling heap usage. http://goog-perftools.sourceforge.net/doc/heap_profiler.html
[19]
Sanjay Ghemawat and Paul Menage. 2007. TCMalloc: Thread-caching malloc, 2007. http://goog-perftools.sourceforge.net/doc/tcmalloc.html
[20]
Mel Gorman. 2015. malloc: Reduce worst-case behaviour with madvise and refault overhead. https://patchwork.ozlabs.org/project/glibc/patch/[email protected]/
[21]
Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. 1982. gprof: a Call Graph Execution Profiler. In SIGPLAN Symposium on Compiler Construction. 120–126. https://doi.org/10.1145/872726.806987
[22]
Intel Corporation. 1997. Using the rdtsc instruction for performance monitoring. https://www.ccsl.carleton.ca/ jamuir/rdtscpm1.pdf Techn. Ber., Tech. Rep., Intel Corporation, 22.
[23]
Tanvir Ahmed Khan, Yifan Zhao, Gilles Pokam, Barzan Mozafari, and Baris Kasikci. 2019. Huron: Hybrid False Sharing Detection and Repair. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA. 453–468. isbn:9781450367127 https://doi.org/10.1145/3314221.3314644
[24]
Bradley C. Kuszmaul. 2015. SuperMalloc: a super fast multithreaded malloc for 64-bit machines. In Proceedings of the 2015 International Symposium on Memory Management (ISMM ’15). Association for Computing Machinery, New York, NY, USA. 41–55. isbn:9781450335898 https://doi.org/10.1145/2754169.2754178
[25]
Lawrence Livermore National Laboratory. 2018. CORAL-2 Benchmarks. https://asc.llnl.gov/coral-2-benchmarks
[26]
Woo Hyong Lee, J. Morris Chang, and Yusuf Hasan. 2000. A Dynamic Memory Measuring Tool for C++ Programs. In Proceedings of the 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology (ASSET’00) (ASSET ’00). IEEE Computer Society, Washington, DC, USA. 155–. isbn:0-7695-0559-7 https://doi.org/10.1109/ASSET.2000.888070
[27]
Daan Leijen. 2020. mimalloc. https://github.com/microsoft/mimalloc
[28]
John Levon and Philippe Elie. 2004. Oprofile: A system profiler for linux. https://oprofile.sourceforge.io/news/
[29]
Tongping Liu and Emery D. Berger. 2011. SHERIFF: precise detection and automatic mitigation of false sharing. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications (OOPSLA ’11). 3–18. isbn:978-1-4503-0940-0 https://doi.org/10.1145/2048066.2048070
[30]
Tongping Liu and Xu Liu. 2016. Cheetah: Detecting False Sharing Efficiently and Effectively. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016). 1–11. isbn:978-1-4503-3778-6 https://doi.org/10.1145/2854038.2854039
[31]
Tongping Liu, Chen Tian, Ziang Hu, and Emery D. Berger. 2014. PREDATOR: Predictive False Sharing Detection. 49, Association for Computing Machinery, New York, NY, USA. issn:0362-1340 https://doi.org/10.1145/2692916.2555244
[32]
Xu Liu and John Mellor-Crummey. 2013. A Data-Centric Profiler for Parallel Programs. Association for Computing Machinery, New York, NY, USA. isbn:9781450323789 https://doi.org/10.1145/2503210.2503297
[33]
Xu Liu, Kamal Sharma, and John Mellor-Crummey. 2014. ArrayTool: A Lightweight Profiler to Guide Array Regrouping. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). 405–416. isbn:978-1-4503-2809-8 https://doi.org/10.1145/2628071.2628102
[34]
X. Liu, K. Sharma, and J. Mellor-Crummey. 2014. ArrayTool: A lightweight profiler to guide array regrouping. In 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT). 405–415. https://doi.org/10.1145/2628071.2628102
[35]
Brandon Lucia. [n. d.]. MultiCacheSim: A coherent multiprocessor cache simulator. https://courses.cs.washington.edu/courses/cse471/11sp/sim.html.
[36]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’05). ACM, New York, NY, USA. 190–200. isbn:1-59593-056-6 https://doi.org/10.1145/1065010.1065034
[37]
Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Gilles Pokam, Chris J. Newburn, and Joseph Devietti. 2016. LASER: Light, Accurate Sharing dEtection and Repair. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 261–273. https://doi.org/10.1109/HPCA.2016.7446070
[38]
Adrian LUPASC and Viorica Popoiu. 2014. Dynamic Memory Allocation–Clr Profiler. 118 pages.
[39]
R. Matias, T. Borges Ferreira, and A. Macedo. 2011. An experimental study on user-level memory allocators in middleware applications. In 2011 IEEE International Conference on Systems, Man, and Cybernetics. 2431–2436. issn:1062-922X https://doi.org/10.1109/ICSMC.2011.6084042
[40]
Oliver Perks, Simon D. Hammond, Simon J. Pennycook, and Stephen A. Jarvis. 2011. WMTools - Assessing Parallel Application Memory Utilisation at Scale. In Proceedings of the 8th European Conference on Computer Performance Engineering (EPEW’11). Springer-Verlag, Berlin, Heidelberg. 148–162. isbn:978-3-642-24748-4 https://doi.org/10.1007/978-3-642-24749-1_12
[41]
Aleksey Pesterev, Nickolai Zeldovich, and Robert T. Morris. 2010. Locating Cache Performance Bottlenecks Using Data Profiling. In Proceedings of the 5th European Conference on Computer Systems (EuroSys ’10). 335–348. isbn:9781605585772 https://doi.org/10.1145/1755913.1755947
[42]
The Open MPI Project. 2022. Open MPI: Open Source High Performance Computing. https://www.open-mpi.org/
[43]
Kirill Rogozhin. 2014. Controlling memory consumption with Intel® Threading Building Blocks (Intel® TBB) scalable allocator. https://software.intel.com/content/www/us/en/develop/articles/controlling-memory-consumption-with-intel-threading-building-blocks-intel-tbb-scalable.html
[44]
Probir Roy, Shuaiwen Leon Song, Sriram Krishnamoorthy, and Xu Liu. 2018. Lightweight Detection of Cache Conflicts. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). Association for Computing Machinery, New York, NY, USA. 200–213. isbn:9781450356176 https://doi.org/10.1145/3168819
[45]
Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. 2012. AddressSanitizer: a fast address sanity checker. In Proceedings of the 2012 USENIX conference on Annual Technical Conference (USENIX ATC’12). USENIX Association, Berkeley, CA, USA. 28–28. http://dl.acm.org/citation.cfm?id=2342821.2342849
[46]
Tianwei Sheng, Neil Vachharajani, Stephane Eranian, Robert Hundt, Wenguang Chen, and Weimin Zheng. 2011. RACEZ: A Lightweight and Non-invasive Race Detection Tool for Production Applications. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11). 401–410. isbn:978-1-4503-0445-0 https://doi.org/10.1145/1985793.1985848
[47]
Oliver Yang. 2015. Pitfalls of TSC usage. http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/
[48]
Tingting Yu and Michael Pradel. 2016. SyncProf: detecting, localizing, and optimizing synchronization bottlenecks. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken, Germany, July 18-20, 2016, Andreas Zeller and Abhik Roychoudhury (Eds.). 389–400. https://doi.org/10.1145/2931037.2931070
[49]
Matej Zavrtanik and Jurij Mihelic. [n. d.]. Experimental Evaluation and Comparison of Memory Allocators in the GNU/Linux Operating System. http://ipsitransactions.org/journals/papers/tar/2017jan/p10.pdf
[50]
Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. 2011. Dynamic Cache Contention Detection in Multi-Threaded Applications. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’11). Association for Computing Machinery, New York, NY, USA. 27–38. isbn:9781450306874 https://doi.org/10.1145/1952682.1952688
[51]
Fang Zhou, Yifan Gan, Sixiang Ma, and Yang Wang. 2018. wPerf: Generic Off-CPU Analysis to Identify Bottleneck Waiting Events. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018, Andrea C. Arpaci-Dusseau and Geoff Voelker (Eds.). USENIX Association, 527–543. https://www.usenix.org/conference/osdi18/presentation/zhou
[52]
Benjamin Zorn and Paul N. Hilfinger. 1988. A Memory Allocation Profiler for C and Lisp Programs. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/1988/5382.html

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 7, Issue OOPSLA2
October 2023
2250 pages
EISSN:2475-1421
DOI:10.1145/3554312
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2023
Published in PACMPL Volume 7, Issue OOPSLA2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Memory Allocator
  2. Performance Slowdowns

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 627
    Total Downloads
  • Downloads (Last 12 months)627
  • Downloads (Last 6 weeks)46
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media