research-article

Public Access

NUMAlloc: A Faster NUMA Memory Allocator

Authors:

Tongping LiuAuthors Info & Claims

ISMM 2023: Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management

June 2023

Pages 97 - 110

https://doi.org/10.1145/3591195.3595276

Published: 06 June 2023 Publication History

Abstract

The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the cooperation of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allocators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc, that is designed for the NUMA architecture. is centered on a binding-based memory management. On top of it, proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evaluation, NUMAlloc has the best performance among all evaluated allocators, running 15.7% faster than the second-best allocator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.

Supplementary Material

Auxiliary Archive (pldiws23ismmmain-p64-p-archive.zip)

This is an appendix for the paper titled "NUMAlloc: A Faster NUMA Memory Allocator" submitted to ISMM 2023. The appendix provides a report of the standard deviation of the performance data presented in the paper.

Download
3.19 MB

References

[1]

2017. CORAL-2 Benchmarks. https://asc.llnl.gov/coral-2-benchmarks

[2]

2020. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page

[3]

Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2015. Fast, Multicore-scalable, Low-fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015). 451–469. isbn:978-1-4503-3689-5 https://doi.org/10.1145/2814270.2814294

Digital Library

[4]

Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2016. scalloc. https://github.com/cksystemsgroup/scalloc

[5]

Periklis Akritidis. 2010. Cling: A Memory Allocator to Mitigate Dangling Pointers. In 19th USENIX Security Symposium, Washington, DC, USA, August 11-13, 2010, Proceedings. 177–192. http://www.usenix.org/events/sec10/tech/full_papers/Akritidis.pdf

[6]

Andreas Kleen at SUSE LINUX. 2012. "A NUMA API for LINUX". http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf

[7]

Avi Kivity. 2016. Automatic NUMA balancing may reduce performance. https://github.com/scylladb/scylla/issues/1120

[8]

Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: a scalable memory allocator for multithreaded applications. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems. 117–128. isbn:1-58113-317-0 https://doi.org/10.1145/378993.379232

Digital Library

[9]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08). 72–81. isbn:9781605582825 https://doi.org/10.1145/1454115.1454128

Digital Library

[10]

Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. 2011. A Case for NUMA-aware Contention Management on Multicore Systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference. 1–1. http://dl.acm.org/citation.cfm?id=2002181.2002182

Digital Library

[11]

W. Bolosky, R. Fitzgerald, and M. Scott. 1989. Simple but Effective Techniques for NUMA Memory Management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles (SOSP ’89). Association for Computing Machinery, 19–31. isbn:0897913388 https://doi.org/10.1145/74850.74854

Digital Library

[12]

Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. 2015. memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. (No. SAND2015-1862C). Sandia National Lab.(SNL-NM), Albuquerque, NM.

[13]

William Cohen. 2014. Examining Huge Pages or Transparent Huge Pages performance. https://developers.redhat.com/blog/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance

[14]

Jonathan Corbet. 2012. AutoNUMA: The Other Approach to NUMA Scheduling. https://lwn.net/Articles/488709/

[15]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). 381–394. isbn:978-1-4503-1870-9 https://doi.org/10.1145/2451116.2451157

Digital Library

[16]

SQL Developers. 2019. How SQLite Is Tested. ". https://www.sqlite.org/testing.html

[17]

Matthias Diener. 2015. Automatic task and data mapping in shared memory architectures.

[18]

Matthias Diener, Eduardo HM Cruz, and Philippe OA Navaux. 2015. Locality vs. Balance: Exploring data mapping policies on NUMA systems. In 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 9–16.

Digital Library

[19]

Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 11-15, 2016, Ayal Zaks, Bilha Mendelson, Lawrence Rauchwerger, and Wen-mei W. Hwu (Eds.). ACM, 125–137. https://doi.org/10.1145/2967938.2967946

Digital Library

[20]

Jason Evans. 2011. Scalable memory allocation using jemalloc. ". https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/

[21]

OpenBSD Foundation. 2012. "OpenBSD". ". https://www.openbsd.org

[22]

The Apache Software Foundation. 2020. ab - Apache HTTP server benchmarking tool. ". https://httpd.apache.org/docs/2.4/programs/ab.html

[23]

David Gay and Alexander Aiken. 1998. Memory Management with Explicit Regions. In Proceedings of the ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation (PLDI), Montreal, Canada, June 17-19, 1998. 313–323. https://doi.org/10.1145/277650.277748

Digital Library

[24]

Sanjay Ghemawat and Paul Menage. 2007. "TCMalloc : Thread-Caching Malloc". ". http://goog-perftools.sourceforge.net/doc/tcmalloc.html

[25]

Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan Nguyen. 2015. NumaGiC: A Garbage Collector for Big Data on Big NUMA Machines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA. 661–673. isbn:978-1-4503-2835-7 https://doi.org/10.1145/2694344.2694361

Digital Library

[26]

Mel Gorman. 2012. Foundation for automatic NUMA balancing. ". https://lwn.net/Articles/523065/

[27]

David R Hanson. 1980. A portable storage management system for the Icon programming language. Software: Practice and Experience, 10, 6 (1980), 489–500.

[28]

A.H. Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 257–273. isbn:978-1-939133-22-9 https://www.usenix.org/conference/osdi21/presentation/hunter

[29]

Intel Corporation. [n. d.]. Intel VTune Performance Analyzer. http://www.intel.com/software/products/vtune

[30]

Stefan Kaestle, Reto Achermann, Timothy Roscoe, and Tim Harris. 2015. Shoal: Smart Allocation and Replication of Memory for Parallel Programs. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15). USENIX Association, Berkeley, CA, USA. 263–276. isbn:978-1-931971-225 http://dl.acm.org/citation.cfm?id=2813767.2813787

[31]

Patryk Kaminski. 2012. NUMA aware heap memory manager. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/NUMA_aware_heap_memory_manager_article_final.pdf

[32]

Alex Katranov and Anton Potapov. 2021. oneAPI Threading Building Blocks. https://github.com/oneapi-src/oneTBB

[33]

Alex Katranov and Michael Voss. 2020. Optimize Intel oneAPI Threading Building Blocks for NUMA Architectures. https://www.intel.com/content/www/us/en/developer/videos/onetbb-optimizing-for-numa-architectures.html

[34]

Chris Kennelly and Paul Burton. 2021. TCMalloc: Implement NUMA awareness. https://github.com/google/tcmalloc/commit/ef7a3f8d794c42705bf4327ca79fa17186904801

[35]

Seyeon Kim. 2013. Node-oriented dynamic memory management for real-time systems on ccNUMA architecture systems. Ph. D. Dissertation. University of York.

[36]

Bradley C Kuszmaul. 2015. SuperMalloc: a super fast multithreaded malloc for 64-bit machines. In Proceedings of the 2015 International Symposium on Memory Management. 41–55.

Digital Library

[37]

Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf: A Memory Profiler for NUMA Multicore Systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (USENIX ATC’12). USENIX Association, Berkeley, CA, USA. 5–5. http://dl.acm.org/citation.cfm?id=2342821.2342826

Digital Library

[38]

Christoph Lameter. 2013. Numa (non-uniform memory access): An overview. Queue, 11, 7 (2013), 40–51.

Digital Library

[39]

Per-Åke Larson and Murali Krishnan. 1998. Memory Allocation for Long-Running Server Applications. SIGPLAN Not., 34, 3 (1998), Oct., 176–185. issn:0362-1340 https://doi.org/10.1145/301589.286880

Digital Library

[40]

Doug Lea. 1988. The GNU C Library. ". http://www.gnu.org/software/libc/libc.html

[41]

Daan Leijen. 2020. mimalloc. https://github.com/microsoft/mimalloc

[42]

Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15). USENIX Association, Berkeley, CA, USA. 277–289. isbn:978-1-931971-225 http://dl.acm.org/citation.cfm?id=2813767.2813788

Digital Library

[43]

Xu Liu and John Mellor-Crummey. 2014. A Tool to Analyze the Performance of Multithreaded Programs on NUMA Architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’14). ACM, New York, NY, USA. 259–272. isbn:978-1-4503-2656-8 https://doi.org/10.1145/2555243.2555271

Digital Library

[44]

Sandra Loosemore, Richard M. Stallman, Roland McGrath, Andrew Oram, and Ulrich Drepper. 2019. The GNU C Library Reference Manual. https://www.gnu.org/software/libc/manual/2.28/pdf/libc.pdf

[45]

Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, and Colin Raffel. 2020. Learning-based Memory Allocation for C++ Server Workloads. In ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020. 541–556. https://doi.org/10.1145/3373376.3378525

Digital Library

[46]

Zoltan Majo and Thomas R. Gross. 2011. Memory Management in NUMA Multicore Systems: Trapped Between Cache Contention and Interconnect Overhead. In Proceedings of the International Symposium on Memory Management (ISMM ’11). ACM, New York, NY, USA. 11–20. isbn:978-1-4503-0263-0 https://doi.org/10.1145/1993478.1993481

Digital Library

[47]

Zoltan Majo and Thomas R. Gross. 2013. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In 2013 IEEE International Symposium on Workload Characterization (IISWC). 11–22. https://doi.org/10.1109/IISWC.2013.6704666

[48]

Zoltan Majo and Thomas R. Gross. 2015. A Library for Portable and Composable Data Locality Optimizations for NUMA Systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, USA. 227–238. isbn:978-1-4503-3205-7 https://doi.org/10.1145/2688500.2688509

Digital Library

[49]

C. McCurdy and J. Vetter. 2010. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS). 87–96. https://doi.org/10.1109/ISPASS.2010.5452060

[50]

Gene Novark and Emery D. Berger. 2010. DieHarder: securing the heap. In Proceedings of the 17th ACM conference on Computer and communications security (CCS ’10). ACM, New York, NY, USA. 573–584. isbn:978-1-4503-0245-6 https://doi.org/10.1145/1866307.1866371

Digital Library

[51]

Takeshi Ogasawara. 2009. NUMA-aware Memory Manager with Dominant-thread-based Copying GC. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’09). ACM, New York, NY, USA. 377–390. isbn:978-1-60558-766-0 https://doi.org/10.1145/1640089.1640117

Digital Library

[52]

Sean Reifschneider. 2013. "Pure python memcached client". ". https://pypi.python.org/pypi/python-memcached

[53]

Kirill Rogozhin. 2014. Controlling memory consumption with Intel® Threading Building Blocks (Intel® TBB) scalable allocator. ". https://software.intel.com/content/www/us/en/develop/articles/controlling-memory-consumption-with-intel-threading-building-blocks-intel-tbb-scalable.html

[54]

Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Scalable Locality-Conscious Multithreaded Memory Allocation. In Proceedings of the 5th International Symposium on Memory Management (ISMM ’06). Association for Computing Machinery, New York, NY, USA. 84–94. isbn:1595932216 https://doi.org/10.1145/1133956.1133968

Digital Library

[55]

Sam Silvestro, Hongyu Liu, Corey Crosser, Zhiqiang Lin, and Tongping Liu. 2017. FreeGuard: A Faster Secure Heap Allocator. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017. 2389–2403. https://doi.org/10.1145/3133956.3133957

Digital Library

[56]

Sam Silvestro, Hongyu Liu, Tianyi Liu, Zhiqiang Lin, and Tongping Liu. 2018. Guarder: An Efficient Heap Allocator with Strongest and Tunable Security. In Proceedings of The 27th USENIX Security Symposium (Security’18).

[57]

M. M. Tikir and J. K. Hollingsworth. 2005. NUMA-Aware Java Heaps for Server Applications. In 19th IEEE International Parallel and Distributed Processing Symposium. 108b–108b. issn:1530-2075 https://doi.org/10.1109/IPDPS.2005.299

Digital Library

[58]

François Trahay, Manuel Selva, Lionel Morel, and Kevin Marquet. 2018. NumaMMA: NUMA MeMory Analyzer. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). Association for Computing Machinery, New York, NY, USA. Article 19, 10 pages. isbn:9781450365109 https://doi.org/10.1145/3225058.3225094

Digital Library

[59]

Mehul Wagle, Daniel Booss, Ivan Schreter, and Daniel Egenolf. 2015. NUMA-aware memory management with in-memory databases. In Technology Conference on Performance Evaluation and Benchmarking. 45–60.

[60]

Sean Williams, Latchesar Ionkov, Michael Lang, and Jason Lee. 2018. Heterogeneous Memory and Arena-Based Heap Allocation. In Proceedings of the Workshop on Memory Centric High Performance Computing, MCHPC@SC 2018, Dallas, TX, USA, November 11, 2018. 67–71. https://doi.org/10.1145/3286475.3286568

Digital Library

[61]

Ting Yang, Tongping Liu, Emery D. Berger, Scott F. Kaplan, and J. Eliot B. Moss. 2008. Redline: first class support for interactivity in commodity operating systems. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI’08). USENIX Association, Berkeley, CA, USA. 73–86. http://dl.acm.org/citation.cfm?id=1855741.1855747

Digital Library

[62]

Zhang Yang, Aiqing Zhang, and Zeyao Mo. 2019. JArena: Partitioned Shared Memory for NUMA-awareness in Multi-threaded Scientific Applications. arXiv preprint arXiv:1902.07590.

[63]

Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, and Tongping Liu. 2021. NumaPerf: Predictive NUMA Profiling. In Proceedings of the ACM International Conference on Supercomputing (ICS ’21). ACM, 52–62. isbn:9781450383356 https://doi.org/10.1145/3447818.3460361

Digital Library

[64]

L. Zhu, H. Jin, and X. Liao. 2016. A Tool to Detect Performance Problems of Multi-threaded Programs on NUMA Systems. In 2016 IEEE Trustcom/BigDataSE/ISPA. 1145–1152. https://doi.org/10.1109/TrustCom.2016.0187

Cited By

Xu YQian RWang YHuo Q(2023)iNUMAlloc: Towards Intelligent Memory Allocation for AI Accelerators with NUMA2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00155(929-936)Online publication date: 21-Dec-2023
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00155

Index Terms

NUMAlloc: A Faster NUMA Memory Allocator
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management

Recommendations

Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems

The non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
Read More
A bounded memory allocator for software-defined global address spaces
ISMM '16

This paper presents a memory allocator targeting manycore architec- tures with distributed memory. Among the family of Multi Processor System on Chip (MPSoC), these devices are composed of multiple nodes linked by an on-chip network; most nodes have ...
Read More
The intelligent memory allocator selector

Memory fragmentation is a serious obstacle preventing efficient memory usage. Garbage collectors may solve the problem; however, they cause serious performance impact, memory and energy consumption. Therefore, various memory allocators have been ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISMM 2023: Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management

June 2023

175 pages

ISBN:9798400701795

DOI:10.1145/3591195

General Chair:
Stephen M. Blackburn
Google, Australia / Australian National University, Australia
,
Program Chair:
Erez Petrank
Technion, Israel

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Conference

ISMM '23

Sponsor:

SIGPLAN

ISMM '23: 2023 ACM SIGPLAN International Symposium on Memory Management

June 18, 2023

FL, Orlando, USA

Acceptance Rates

Overall Acceptance Rate 72 of 156 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
438
Total Downloads

Downloads (Last 12 months)324
Downloads (Last 6 weeks)35

Other Metrics

View Author Metrics

Citations

Cited By

Xu YQian RWang YHuo Q(2023)iNUMAlloc: Towards Intelligent Memory Allocation for AI Accelerators with NUMA2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00155(929-936)Online publication date: 21-Dec-2023
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00155

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents