Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3591195.3595276acmconferencesArticle/Chapter ViewAbstractPublication PagesismmConference Proceedingsconference-collections
research-article
Public Access

NUMAlloc: A Faster NUMA Memory Allocator

Published: 06 June 2023 Publication History
  • Get Citation Alerts
  • Abstract

    The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the cooperation of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allocators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc, that is designed for the NUMA architecture. is centered on a binding-based memory management. On top of it, proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evaluation, NUMAlloc has the best performance among all evaluated allocators, running 15.7% faster than the second-best allocator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.

    Supplementary Material

    Auxiliary Archive (pldiws23ismmmain-p64-p-archive.zip)
    This is an appendix for the paper titled "NUMAlloc: A Faster NUMA Memory Allocator" submitted to ISMM 2023. The appendix provides a report of the standard deviation of the performance data presented in the paper.

    References

    [1]
    2017. CORAL-2 Benchmarks. https://asc.llnl.gov/coral-2-benchmarks
    [2]
    2020. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page
    [3]
    Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2015. Fast, Multicore-scalable, Low-fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015). 451–469. isbn:978-1-4503-3689-5 https://doi.org/10.1145/2814270.2814294
    [4]
    Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2016. scalloc. https://github.com/cksystemsgroup/scalloc
    [5]
    Periklis Akritidis. 2010. Cling: A Memory Allocator to Mitigate Dangling Pointers. In 19th USENIX Security Symposium, Washington, DC, USA, August 11-13, 2010, Proceedings. 177–192. http://www.usenix.org/events/sec10/tech/full_papers/Akritidis.pdf
    [6]
    Andreas Kleen at SUSE LINUX. 2012. "A NUMA API for LINUX". http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf
    [7]
    Avi Kivity. 2016. Automatic NUMA balancing may reduce performance. https://github.com/scylladb/scylla/issues/1120
    [8]
    Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: a scalable memory allocator for multithreaded applications. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems. 117–128. isbn:1-58113-317-0 https://doi.org/10.1145/378993.379232
    [9]
    Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08). 72–81. isbn:9781605582825 https://doi.org/10.1145/1454115.1454128
    [10]
    Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. 2011. A Case for NUMA-aware Contention Management on Multicore Systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference. 1–1. http://dl.acm.org/citation.cfm?id=2002181.2002182
    [11]
    W. Bolosky, R. Fitzgerald, and M. Scott. 1989. Simple but Effective Techniques for NUMA Memory Management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles (SOSP ’89). Association for Computing Machinery, 19–31. isbn:0897913388 https://doi.org/10.1145/74850.74854
    [12]
    Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. 2015. memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. (No. SAND2015-1862C). Sandia National Lab.(SNL-NM), Albuquerque, NM.
    [13]
    William Cohen. 2014. Examining Huge Pages or Transparent Huge Pages performance. https://developers.redhat.com/blog/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance
    [14]
    Jonathan Corbet. 2012. AutoNUMA: The Other Approach to NUMA Scheduling. https://lwn.net/Articles/488709/
    [15]
    Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). 381–394. isbn:978-1-4503-1870-9 https://doi.org/10.1145/2451116.2451157
    [16]
    SQL Developers. 2019. How SQLite Is Tested. ". https://www.sqlite.org/testing.html
    [17]
    Matthias Diener. 2015. Automatic task and data mapping in shared memory architectures.
    [18]
    Matthias Diener, Eduardo HM Cruz, and Philippe OA Navaux. 2015. Locality vs. Balance: Exploring data mapping policies on NUMA systems. In 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 9–16.
    [19]
    Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 11-15, 2016, Ayal Zaks, Bilha Mendelson, Lawrence Rauchwerger, and Wen-mei W. Hwu (Eds.). ACM, 125–137. https://doi.org/10.1145/2967938.2967946
    [20]
    Jason Evans. 2011. Scalable memory allocation using jemalloc. ". https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/
    [21]
    OpenBSD Foundation. 2012. "OpenBSD". ". https://www.openbsd.org
    [22]
    The Apache Software Foundation. 2020. ab - Apache HTTP server benchmarking tool. ". https://httpd.apache.org/docs/2.4/programs/ab.html
    [23]
    David Gay and Alexander Aiken. 1998. Memory Management with Explicit Regions. In Proceedings of the ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation (PLDI), Montreal, Canada, June 17-19, 1998. 313–323. https://doi.org/10.1145/277650.277748
    [24]
    Sanjay Ghemawat and Paul Menage. 2007. "TCMalloc : Thread-Caching Malloc". ". http://goog-perftools.sourceforge.net/doc/tcmalloc.html
    [25]
    Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan Nguyen. 2015. NumaGiC: A Garbage Collector for Big Data on Big NUMA Machines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA. 661–673. isbn:978-1-4503-2835-7 https://doi.org/10.1145/2694344.2694361
    [26]
    Mel Gorman. 2012. Foundation for automatic NUMA balancing. ". https://lwn.net/Articles/523065/
    [27]
    David R Hanson. 1980. A portable storage management system for the Icon programming language. Software: Practice and Experience, 10, 6 (1980), 489–500.
    [28]
    A.H. Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 257–273. isbn:978-1-939133-22-9 https://www.usenix.org/conference/osdi21/presentation/hunter
    [29]
    Intel Corporation. [n. d.]. Intel VTune Performance Analyzer. http://www.intel.com/software/products/vtune
    [30]
    Stefan Kaestle, Reto Achermann, Timothy Roscoe, and Tim Harris. 2015. Shoal: Smart Allocation and Replication of Memory for Parallel Programs. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15). USENIX Association, Berkeley, CA, USA. 263–276. isbn:978-1-931971-225 http://dl.acm.org/citation.cfm?id=2813767.2813787
    [31]
    Patryk Kaminski. 2012. NUMA aware heap memory manager. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/NUMA_aware_heap_memory_manager_article_final.pdf
    [32]
    Alex Katranov and Anton Potapov. 2021. oneAPI Threading Building Blocks. https://github.com/oneapi-src/oneTBB
    [33]
    Alex Katranov and Michael Voss. 2020. Optimize Intel oneAPI Threading Building Blocks for NUMA Architectures. https://www.intel.com/content/www/us/en/developer/videos/onetbb-optimizing-for-numa-architectures.html
    [34]
    Chris Kennelly and Paul Burton. 2021. TCMalloc: Implement NUMA awareness. https://github.com/google/tcmalloc/commit/ef7a3f8d794c42705bf4327ca79fa17186904801
    [35]
    Seyeon Kim. 2013. Node-oriented dynamic memory management for real-time systems on ccNUMA architecture systems. Ph. D. Dissertation. University of York.
    [36]
    Bradley C Kuszmaul. 2015. SuperMalloc: a super fast multithreaded malloc for 64-bit machines. In Proceedings of the 2015 International Symposium on Memory Management. 41–55.
    [37]
    Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf: A Memory Profiler for NUMA Multicore Systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (USENIX ATC’12). USENIX Association, Berkeley, CA, USA. 5–5. http://dl.acm.org/citation.cfm?id=2342821.2342826
    [38]
    Christoph Lameter. 2013. Numa (non-uniform memory access): An overview. Queue, 11, 7 (2013), 40–51.
    [39]
    Per-Åke Larson and Murali Krishnan. 1998. Memory Allocation for Long-Running Server Applications. SIGPLAN Not., 34, 3 (1998), Oct., 176–185. issn:0362-1340 https://doi.org/10.1145/301589.286880
    [40]
    Doug Lea. 1988. The GNU C Library. ". http://www.gnu.org/software/libc/libc.html
    [41]
    Daan Leijen. 2020. mimalloc. https://github.com/microsoft/mimalloc
    [42]
    Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15). USENIX Association, Berkeley, CA, USA. 277–289. isbn:978-1-931971-225 http://dl.acm.org/citation.cfm?id=2813767.2813788
    [43]
    Xu Liu and John Mellor-Crummey. 2014. A Tool to Analyze the Performance of Multithreaded Programs on NUMA Architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’14). ACM, New York, NY, USA. 259–272. isbn:978-1-4503-2656-8 https://doi.org/10.1145/2555243.2555271
    [44]
    Sandra Loosemore, Richard M. Stallman, Roland McGrath, Andrew Oram, and Ulrich Drepper. 2019. The GNU C Library Reference Manual. https://www.gnu.org/software/libc/manual/2.28/pdf/libc.pdf
    [45]
    Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, and Colin Raffel. 2020. Learning-based Memory Allocation for C++ Server Workloads. In ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020. 541–556. https://doi.org/10.1145/3373376.3378525
    [46]
    Zoltan Majo and Thomas R. Gross. 2011. Memory Management in NUMA Multicore Systems: Trapped Between Cache Contention and Interconnect Overhead. In Proceedings of the International Symposium on Memory Management (ISMM ’11). ACM, New York, NY, USA. 11–20. isbn:978-1-4503-0263-0 https://doi.org/10.1145/1993478.1993481
    [47]
    Zoltan Majo and Thomas R. Gross. 2013. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In 2013 IEEE International Symposium on Workload Characterization (IISWC). 11–22. https://doi.org/10.1109/IISWC.2013.6704666
    [48]
    Zoltan Majo and Thomas R. Gross. 2015. A Library for Portable and Composable Data Locality Optimizations for NUMA Systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, USA. 227–238. isbn:978-1-4503-3205-7 https://doi.org/10.1145/2688500.2688509
    [49]
    C. McCurdy and J. Vetter. 2010. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS). 87–96. https://doi.org/10.1109/ISPASS.2010.5452060
    [50]
    Gene Novark and Emery D. Berger. 2010. DieHarder: securing the heap. In Proceedings of the 17th ACM conference on Computer and communications security (CCS ’10). ACM, New York, NY, USA. 573–584. isbn:978-1-4503-0245-6 https://doi.org/10.1145/1866307.1866371
    [51]
    Takeshi Ogasawara. 2009. NUMA-aware Memory Manager with Dominant-thread-based Copying GC. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’09). ACM, New York, NY, USA. 377–390. isbn:978-1-60558-766-0 https://doi.org/10.1145/1640089.1640117
    [52]
    Sean Reifschneider. 2013. "Pure python memcached client". ". https://pypi.python.org/pypi/python-memcached
    [53]
    Kirill Rogozhin. 2014. Controlling memory consumption with Intel® Threading Building Blocks (Intel® TBB) scalable allocator. ". https://software.intel.com/content/www/us/en/develop/articles/controlling-memory-consumption-with-intel-threading-building-blocks-intel-tbb-scalable.html
    [54]
    Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Scalable Locality-Conscious Multithreaded Memory Allocation. In Proceedings of the 5th International Symposium on Memory Management (ISMM ’06). Association for Computing Machinery, New York, NY, USA. 84–94. isbn:1595932216 https://doi.org/10.1145/1133956.1133968
    [55]
    Sam Silvestro, Hongyu Liu, Corey Crosser, Zhiqiang Lin, and Tongping Liu. 2017. FreeGuard: A Faster Secure Heap Allocator. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017. 2389–2403. https://doi.org/10.1145/3133956.3133957
    [56]
    Sam Silvestro, Hongyu Liu, Tianyi Liu, Zhiqiang Lin, and Tongping Liu. 2018. Guarder: An Efficient Heap Allocator with Strongest and Tunable Security. In Proceedings of The 27th USENIX Security Symposium (Security’18).
    [57]
    M. M. Tikir and J. K. Hollingsworth. 2005. NUMA-Aware Java Heaps for Server Applications. In 19th IEEE International Parallel and Distributed Processing Symposium. 108b–108b. issn:1530-2075 https://doi.org/10.1109/IPDPS.2005.299
    [58]
    François Trahay, Manuel Selva, Lionel Morel, and Kevin Marquet. 2018. NumaMMA: NUMA MeMory Analyzer. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). Association for Computing Machinery, New York, NY, USA. Article 19, 10 pages. isbn:9781450365109 https://doi.org/10.1145/3225058.3225094
    [59]
    Mehul Wagle, Daniel Booss, Ivan Schreter, and Daniel Egenolf. 2015. NUMA-aware memory management with in-memory databases. In Technology Conference on Performance Evaluation and Benchmarking. 45–60.
    [60]
    Sean Williams, Latchesar Ionkov, Michael Lang, and Jason Lee. 2018. Heterogeneous Memory and Arena-Based Heap Allocation. In Proceedings of the Workshop on Memory Centric High Performance Computing, MCHPC@SC 2018, Dallas, TX, USA, November 11, 2018. 67–71. https://doi.org/10.1145/3286475.3286568
    [61]
    Ting Yang, Tongping Liu, Emery D. Berger, Scott F. Kaplan, and J. Eliot B. Moss. 2008. Redline: first class support for interactivity in commodity operating systems. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI’08). USENIX Association, Berkeley, CA, USA. 73–86. http://dl.acm.org/citation.cfm?id=1855741.1855747
    [62]
    Zhang Yang, Aiqing Zhang, and Zeyao Mo. 2019. JArena: Partitioned Shared Memory for NUMA-awareness in Multi-threaded Scientific Applications. arXiv preprint arXiv:1902.07590.
    [63]
    Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, and Tongping Liu. 2021. NumaPerf: Predictive NUMA Profiling. In Proceedings of the ACM International Conference on Supercomputing (ICS ’21). ACM, 52–62. isbn:9781450383356 https://doi.org/10.1145/3447818.3460361
    [64]
    L. Zhu, H. Jin, and X. Liao. 2016. A Tool to Detect Performance Problems of Multi-threaded Programs on NUMA Systems. In 2016 IEEE Trustcom/BigDataSE/ISPA. 1145–1152. https://doi.org/10.1109/TrustCom.2016.0187

    Cited By

    View all
    • (2023)iNUMAlloc: Towards Intelligent Memory Allocation for AI Accelerators with NUMA2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00155(929-936)Online publication date: 21-Dec-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISMM 2023: Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management
    June 2023
    175 pages
    ISBN:9798400701795
    DOI:10.1145/3591195
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 June 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Memory Allocation
    2. NUMA Architecture

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ISMM '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 72 of 156 submissions, 46%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)324
    • Downloads (Last 6 weeks)35

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)iNUMAlloc: Towards Intelligent Memory Allocation for AI Accelerators with NUMA2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00155(929-936)Online publication date: 21-Dec-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media