Article

"MAMA!": a memory allocator for multithreaded architectures

Authors:

Simon Kahan,

Petr KonecnyAuthors Info & Claims

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 178 - 186

https://doi.org/10.1145/1122971.1122999

Published: 29 March 2006 Publication History

Get Access

Abstract

While the high-performance computing world is dominated by distributed memory computer systems, applications that require random access into large shared data structures continue to motivate development of ever larger shared-memory parallel computers such as Cray's MTA and SGI's Altix systems.To support scalable application performance on such architectures, the memory allocator must be able to satisfy requests at a rate proportional to system size. For example, a 40 processor Cray MTA-2 can experience over 5000 concurrent requests, one from each of its 128 streams per processor. Cray's Eldorado, to be built upon the same network as Sandia's 10,000 processor Red Storm system, will sport thousands of multithreaded processors leading to hundreds of thousands of concurrent requests.In this paper, we present MAMA, a scalable shared-memory allocator designed to service any rate of concurrent requests. MAMA is distinguished from prior work on shared-memory allocators in that it employs software combining to aggregate requests serviced by a single heap structure: Hoard and MTA malloc necessitate repetition of the underlying heap data structures in proportion to processor count. Unlike Hoard, MAMA does not exploit processor-local data structures, limiting its applicability today to systems that sustain high utilization in the presence of global references such as Cray's MTA systems. We believe MAMA's relevance to other shared-memory systems will grow as they become increasingly multithreaded and, consequently, more tolerant of references to non-local memory.We show not only that MAMA scales on Cray MTA systems, but also that it delivers absolute performance competitive with allocators employing heap repetition. In addition, we demonstrate that performance of repetition-based allocators does not scale under heavy loads. We also argue more generally that methods using repetition alone to support concurrency are subject to an impractical tradeoff of scalability against space consumption: when scaled up to meet increasing concurrency demands, repetition-based allocators necessarily house unused space p² quadratic in the number of processors p. Hierarchical structure may reduce this to p log p, but in building large-scale shared-memory parallel computers, unused memory more than linear in p is unacceptable. MAMA, in contrast, scales to arbitrarily large systems while consuming memory that increases only linearly with system and request size.MAMA is of both theoretical interest for its use of novel algorithmic techniques and practical importance as the concurrency upon which shared-memory performance depends continues to grow and multithreaded architectures emerge that are increasingly latency tolerant. While our work is a very recent contribution to memory allocation technology, MAMA already has been incorporated into production as the cornerstone for global memory allocation in Cray's multithreaded systems.

References

[1]

G. Alverson, P. Briggs, S. Coatney, S. Kahan, and R. Korry. Tera hardware-software cooperation. In Proceedings of the 1997 ACM/IEEE conference on Supercomputing, 1997.

Digital Library

Google Scholar

[2]

G. Alverson, S. Kahan, R. Korry, C. McCann, B. Smith. Scheduling on the Tera MTA. In Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, 1995.

Digital Library

Google Scholar

[3]

E. Berger, K. McKinley, R. Blumofe, and P. Wilson. Hoard: a scalable memory allocator for multithreaded applications. In Proceedings of ASPLOS'00, 2000.

Digital Library

Google Scholar

[4]

T. Cormen, C. Leiserson, R. Rivest. Introduction to Algorithms MIT Press, 1990.

Digital Library

Google Scholar

[5]

J. Feo, D. Harper, S. Kahan, and P. Konecny. ELDORADO. In Proceedings of the Second Conference on Computing Frontiers, 2005, Ischia, Italy, May 4-6, 2005.

Digital Library

Google Scholar

[6]

M. Michael. Scalable Lock-Free Dynamic Memory Allocation In Programming Language Design and Implementation, 2004, Washington, DC, June 9-11, 2004.

Digital Library

Google Scholar

[7]

C. Okasaki. Purely functional data structures. Cambridge University Press, 1998.

Digital Library

Google Scholar

[8]

J-R. Sack, T. Strothotte. A characterization of heaps and its applications. Information and Computation, 86(1):69-86,May 1990.

Digital Library

Google Scholar

[9]

N. Shavit and A. Zemach. Combining funnels: a dynamic approach to software combining. Journal of Parallel and Distributed Computing, 11:pages 1355--1387, 2000.

Digital Library

Google Scholar

Cited By

View all

Koziolek HBecker SHappe JTuma Pde Gooijer T(2014)Towards software performance engineering for multicore and manycore systemsACM SIGMETRICS Performance Evaluation Review10.1145/2567529.256753141:3(2-11)Online publication date: 10-Jan-2014
https://dl.acm.org/doi/10.1145/2567529.2567531
Dureau DPoëtte G(2014)Hybrid Parallel Programming Models for AMR Neutron Monte-Carlo TransportSNA + MC 2013 - Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo10.1051/snamc/201404202(04202)Online publication date: 6-Jun-2014
https://doi.org/10.1051/snamc/201404202
Valat SPérache MJalby W(2013)Introducing kernel-level page reuse for high performance computingProceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness10.1145/2492408.2492414(1-9)Online publication date: 16-Jun-2013
https://dl.acm.org/doi/10.1145/2492408.2492414
Show More Cited By

Recommendations

Scalable lock-free dynamic memory allocation
PLDI '04: Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation

Dynamic memory allocators (malloc/free) rely on mutual exclusion locks for protecting the consistency of their shared data structures under multithreading. The use of locking has many disadvantages with respect to performance, availability, robustness, ...
Frustrated With MPI+Threads? Try MPIxThreads!
EuroMPI '23: Proceedings of the 30th European MPI Users' Group Meeting

MPI + Threads, embodied by the MPI/OpenMP hybrid programming model, is a parallel programming paradigm where threads are used for on-node shared-memory parallelization and MPI is used for multi-node distributed-memory parallelization. OpenMP provides an ...
SuperMalloc: a super fast multithreaded malloc for 64-bit machines
ISMM '15: Proceedings of the 2015 International Symposium on Memory Management

SuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM)@. It turns out that the same design decisions also make it fast even without HTM@. For the malloc-test benchmark, which is one of the most ...

Comments

Information & Contributors

Information

Published In

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

March 2006

258 pages

ISBN:1595931899

DOI:10.1145/1122971

General Chair:
Josep Torrellas
University of Illinois
,
Program Chair:
Siddhartha Chatterjee
IBM Research

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PPoPP06

Sponsor:

PPoPP06: ACM SIGPLAN 2006 Symposium on Principles and Practice of Parallel Programming 2006

March 29 - 31, 2006

New York, New York, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
822
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Koziolek HBecker SHappe JTuma Pde Gooijer T(2014)Towards software performance engineering for multicore and manycore systemsACM SIGMETRICS Performance Evaluation Review10.1145/2567529.256753141:3(2-11)Online publication date: 10-Jan-2014
https://dl.acm.org/doi/10.1145/2567529.2567531
Dureau DPoëtte G(2014)Hybrid Parallel Programming Models for AMR Neutron Monte-Carlo TransportSNA + MC 2013 - Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo10.1051/snamc/201404202(04202)Online publication date: 6-Jun-2014
https://doi.org/10.1051/snamc/201404202
Valat SPérache MJalby W(2013)Introducing kernel-level page reuse for high performance computingProceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness10.1145/2492408.2492414(1-9)Online publication date: 16-Jun-2013
https://dl.acm.org/doi/10.1145/2492408.2492414
Lyberis SPratikakis PNikolopoulos DSchulz MGamblin Tde Supinski B(2012)The myrmics memory allocatorACM SIGPLAN Notices10.1145/2426642.225900147:11(15-24)Online publication date: 15-Jun-2012
https://dl.acm.org/doi/10.1145/2426642.2259001
Lyberis SPratikakis PNikolopoulos DSchulz MGamblin Tde Supinski BVechev MMcKinley K(2012)The myrmics memory allocatorProceedings of the 2012 international symposium on Memory Management10.1145/2258996.2259001(15-24)Online publication date: 15-Jun-2012
https://dl.acm.org/doi/10.1145/2258996.2259001
Seo SKim JLee J(2011)SFMallocProceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2011.57(253-263)Online publication date: 10-Oct-2011
https://dl.acm.org/doi/10.1109/PACT.2011.57
Goodman EHaglin DScherrer CChavarria-Miranda DMogill JFeo J(2010)Hashing strategies for the Cray XMT2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)10.1109/IPDPSW.2010.5470688(1-8)Online publication date: Apr-2010
https://doi.org/10.1109/IPDPSW.2010.5470688
Tiwari DLee STuck JSolihin YBartolini SFoglia PGiorgi RPrete C(2009)Memory management thread for heap allocation intensive sequential applicationsProceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture10.1145/1621960.1621967(35-42)Online publication date: 13-Sep-2009
https://dl.acm.org/doi/10.1145/1621960.1621967
Casado LMartinez JGarcia IHendrix E(2008)Branch-and-Bound interval global optimization on shared memory multiprocessorsOptimization Methods & Software10.1080/1055678080208630023:5(689-701)Online publication date: 1-Oct-2008
https://dl.acm.org/doi/10.1080/10556780802086300

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cited By

Index Terms

Recommendations

Scalable lock-free dynamic memory allocation

Frustrated With MPI+Threads? Try MPIxThreads!

SuperMalloc: a super fast multithreaded malloc for 64-bit machines

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Scalable lock-free dynamic memory allocation

Frustrated With MPI+Threads? Try MPIxThreads!

SuperMalloc: a super fast multithreaded malloc for 64-bit machines

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations