research-article

memif: Towards Programming Heterogeneous Memory Asynchronously

Authors:

Felix Xiaozhu Lin,

Xu LiuAuthors Info & Claims

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 369 - 383

https://doi.org/10.1145/2872362.2872401

Published: 25 March 2016 Publication History

Abstract

To harness a heterogeneous memory hierarchy, it is advantageous to integrate application knowledge in guiding frequent memory move, i.e., replicating or migrating virtual memory regions. To this end, we present memif, a protected OS service for asynchronous, hardware-accelerated memory move. Compared to the state of the art -- page migration in Linux, memif incurs low overhead and low latency; in order to do so, it not only redefines the semantics of kernel interface but also overhauls the underlying mechanisms, including request/completion management, race handling, and DMA engine configuration. We implement memif in Linux for a server-class system-on-chip that features heterogeneous memories. Compared to the current Linux page migration, memif reduces CPU usage by up to 15% for small pages and by up to 38x for large pages; in continuously serving requests, memif has no need for request batching and reduces latency by up to 63%. By crafting a small runtime atop memif, we improve the throughputs for a set of streaming workloads by up to 33%. Overall, memif has opened the door to software management of heterogeneous memory.

References

[1]

S. Anthony. Intel unveils 72-core x86 knights landing cpu for exascale supercomputing. ExtremeTech, 2013.

[2]

ARM. ARM architecture reference manual: Armv7-a and armv7-r edition, 2014.

[3]

J. Balart, M. Gonzalez, X. Martorell, E. Ayguade, Z. Sura, T. Chen, T. Zhang, K. O'brien, and K. O'brien. A novel asynchronous software cache implementation for the cell-be processor. In Languages and Compilers for Parallel Computing, pages 125--140. Springer, 2008.

Digital Library

[4]

R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign, pages 73--78, 2002.

Digital Library

[5]

G. Banga, J. C. Mogul, and P. Druschel. A scalable and explicit event delivery mechanism for unix. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, pages 19--19, 1999.

Digital Library

[6]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 72--81, 2008.

Digital Library

[7]

B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. Loh, D. McCauley, P. Morrow, D. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die stacking (3d) microarchitecture. In Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on, pages 469--479, Dec 2006.

Digital Library

[8]

S. Bock, B. R. Childers, R. Melhem, and D. Mossé. Concurrent page migration for mobile systems with os-managed hybrid memory. In Proceedings of the 11th ACM Conference on Computing Frontiers, pages 31:1--31:10, 2014.

Digital Library

[9]

S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An analysis of linux scalability to many cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pages 1--8, 2010.

Digital Library

[10]

F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst. ForestGOMP: An efficient OpenMP environment for NUMA architectures. Intl. Journal of Parallel Programming, 38(5--6):418--439, 2010.

[11]

D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. SIGOPS Oper. Syst. Rev., 25(Special Issue):40--52, Apr. 1991.

[12]

C. Cantalupo, V. Venkatesan, J. R. Hammond, K. Czurylo, and S. Hammond. User extensible heap manager for heterogeneous memory platforms and mixed memory policies. Architecture document, 2015.

[13]

G. Chen, B. Wu, D. Li, and X. Shen. Porple: An extensible optimizer for portable data placement on gpu. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 88--100, 2014.

Digital Library

[14]

D. Chiou, S. Devadas, J. Jacobs, P. Jain, V. Lee, E. Peserico, P. Portante, L. Rudolph, G. E. Suh, and D. Willenson. Scheduler-based prefetching for multilevel memories. Lab. Comput. Sci., MIT, Boston, MA, Group Memo, 444, 2001.

[15]

J. Corbet. The chained scatterlist api. https://lwn.net/Articles/256368/, 2007.

[16]

M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: A holistic approach to memory placement on numa systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--394, 2013.

Digital Library

[17]

M. Diener, E. H. Cruz, P. O. Navaux, A. Busse, and H.-U. Heiß. kmaf: Automatic kernel-level management of thread and data affinity. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pages 277--288, 2014.

Digital Library

[18]

A. Dominguez, S. Udayakumaran, and R. Barua. Heap data allocation to scratch-pad memory in embedded systems. Journal of Embedded Computing, 1(4):521--540, 2005.

Digital Library

[19]

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proc. ACM Int. Conf. Architectural Support for Programming Languages & Operating Systems (ASPLOS), pages 37--48, 2012.

Digital Library

[20]

Y. Gao, F. Zhang, and J. Bakos. Sparse matrix-vector multiply on the keystone ii digital signal processor. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pages 1--6, Sept 2014.

[21]

F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, and V. Quéma. Large pages may be harmful on numa systems. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 231--242, June 2014.

[22]

B. Goglin and N. Furmento. Enabling high-performance memory migration for multithreaded applications on linux. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--9, May 2009.

Digital Library

[23]

P. Hammarlund, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. Haswell: The fourth-generation intel core processor. IEEE Micro, (2):6--20, 2014.

[24]

S. Han, S. Marshall, B.-G. Chun, and S. Ratnasamy. Megapipe: A new programming interface for scalable network i/o. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 135--148, 2012.

[25]

T. L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the 15th International Conference on Distributed Computing, pages 300--314, 2001.

[26]

HP. Data sheet: Hp proliant m800 server cartridge, 2014.

[27]

Intel. Product brief: Intel ixp425 network processor. ftp://download.intel.com/design/network/ProdBrf/27905105.pdf, 2006.

[28]

Intel. Intel xeon processor e5--1600/e5--2600/e5--4600 v2 product families, 2014.

[29]

T. Jiang, Q. Zhang, R. Hou, L. Chai, S. Mckee, Z. Jia, and N. Sun. Understanding the behavior of in-memory computing workloads. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 22--30, Oct 2014.

[30]

A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. T. Kandemir, and C. R. Das. Application-aware memory system for fair and efficient execution of concurrent gpgpu applications. In Proceedings of Workshop on General Purpose Processing Using GPUs, pages 1:1--1:8, 2014.

[31]

S. Kaestle, R. Achermann, T. Roscoe, and T. Harris. Shoal: Smart allocation and replication of memory for parallel programs. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 263--276, July 2015.

[32]

R. Lachaize, B. Lepers, and V. Quéma. Memprof: A memory profiler for numa multicore systems. In Proc. of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), pages 53--64, 2012.

[33]

C. Lameter. Local and remote memory: Memory in a linux/numa system. In Linux Symposium, 2006.

[34]

J. Lemon. Kqueue-a generic and scalable event notification facility. In USENIX Annual Technical Conference, FREENIX Track, pages 141--153, 2001.

[35]

B. Lepers, V. Quema, and A. Fedorova. Thread and memory placement on numa systems: Asymmetry matters. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 277--289, July 2015.

Digital Library

[36]

C. Li, Y. Yang, Z. Lin, and H. Zhou. Automatic data placement into gpu on-chip memory resources. In Code Generation and Optimization (CGO), 2015 IEEE/ACM International Symposium on, pages 23--33, Feb 2015.

[37]

Linaro. Numa support for arm. https://wiki.linaro.org/LEG/Engineering/Kernel/NUMA, 2013.

[38]

X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 259--272, 2014.

Digital Library

[39]

X. Liu and J. M. Mellor-Crummey. A data-centric profiler for parallel programs. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing, 2013.

Digital Library

[40]

G. H. Loh and M. D. Hill. Efficiently enabling conventional block sizes for very large die-stacked dram caches. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 454--464, 2011.

Digital Library

[41]

P. Machanick, P. Salverda, and L. Pompe. Hardware-software trade-offs in a direct rambus implementation of the rampage memory hierarchy. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 105--114, 1998.

Digital Library

[42]

T. Maeurer and D. Shippy. Introduction to the cell multiprocessor. IBM journal of Research and Development, 49(4):589--604, 2005.

[43]

J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19--25, Dec. 1995.

[44]

R. McIlroy, P. Dickman, and J. Sventek. Efficient dynamic heap allocation of scratch-pad memory. In Proceedings of the 7th International Symposium on Memory Management, pages 31--40, 2008.

Digital Library

[45]

M. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G. Loh. Heterogeneous memory architectures: A hw/sw approach for mixing die-stacked and off-package memories. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 126--136, Feb 2015.

[46]

M. Meswani, G. Loh, S. Blagodurov, D. Roberts, J. Slice, and M. Ignatowski. Toward efficient programmer-managed two-level memory hierarchies in exascale computers. In Hardware-Software Co-Design for High Performance Computing (Co-HPC), 2014, pages 9--16, Nov 2014.

Digital Library

[47]

M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. J. Parallel Distrib. Comput., 51(1):1--26, May 1998.

Digital Library

[48]

J. C. Mogul and K. K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. ACM Trans. Comput. Syst., 15(3):217--252, Aug. 1997.

Digital Library

[49]

D. D. Neteworks. Ddn solution brief -- accelerate seismic processing. http://www.ddn.com/pdfs/SeismicProcessing_SolutionBrief.pdf, 2013.

[50]

A. Pena and P. Balaji. Toward the efficient use of multiple explicitly managed memory subsystems. In Cluster Computing (CLUSTER), 2014 IEEE International Conference on, pages 123--131, Sept 2014.

[51]

G. Piccoli, H. N. Santos, R. E. Rodrigues, C. Pousa, E. Borin, and F. M. Quint\ ao Pereira. Compiler support for selective page migration in numa architectures. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, pages 369--380, 2014.

Digital Library

[52]

L. E. Ramos, E. Gorbatov, and R. Bianchini. Page placement in hybrid memory systems. In Proceedings of the International Conference on Supercomputing, pages 85--95, 2011.

Digital Library

[53]

L. Rizzo. netmap: A novel framework for fast packet i/o. In USENIX Annual Technical Conference, pages 101--112, 2012.

Digital Library

[54]

J. Sim, A. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim. Transparent hardware management of stacked dram as part of memory. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 13--24, Dec 2014.

Digital Library

[55]

L. Soares and M. Stumm. Flexsc: Flexible system call scheduling with exception-less system calls. In Proc. USENIX Conf. Operating Systems Design and Implementation (OSDI), pages 1--8, 2010.

[56]

S. Steinke, L. Wehmeyer, B.-S. Lee, and P. Marwedel. Assigning program and data objects to scratchpad for energy reduction. In Design, Automation and Test in Europe Conference and Exhibition, 2002. Proceedings, pages 409--415, 2002.

[57]

H. Sundell and P. Tsigas. Lock-free and practical doubly linked list-based deques using single-word compare-and-swap. In Principles of Distributed Systems, volume 3544 of Lecture Notes in Computer Science, pages 240--255. Springer Berlin Heidelberg, 2005.

Digital Library

[58]

Texas Instruments. Enhanced dma (edma3) controller. literature no.: Spruel2b, 2009.

[59]

Texas Instruments. Multicore DSP+ARM KeyStone II System-on-Chip (SoC), 2013.

[60]

Texas Instruments. Cmem overview. http://processors.wiki.ti.com/index.php/CMEM_Overview, 2014.

[61]

L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu. Bigdatabench: A big data benchmark suite from internet services. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 488--499, Feb 2014.

Cited By

Kim H(2024)Compiler-assisted data placement for heterogeneous memory systemsIEICE Electronics Express10.1587/elex.21.2024046021:19(20240460-20240460)Online publication date: 10-Oct-2024
https://doi.org/10.1587/elex.21.20240460
Bailleu MStavrakakis DRocha RChakraborty SGarg DBhatotia P(2024)Toast: A Heterogeneous Memory Management SystemProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676944(53-65)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676944
Michailidis TSwanson SZhao JSerafini MXu H(2022)PMShifterProceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3546591.3547523(1-8)Online publication date: 23-Aug-2022
https://dl.acm.org/doi/10.1145/3546591.3547523
Show More Cited By

Index Terms

memif: Towards Programming Heterogeneous Memory Asynchronously
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

HeteroOS: OS Design for Heterogeneous Memory Management in Datacenter
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Heterogeneous memory management combined with server virtualization in datacenters is expected to increase the software and OS management complexity. State-of-the-art solutions rely exclusively on the hypervisor (VMM) for expensive page hotness tracking ...
memif: Towards Programming Heterogeneous Memory Asynchronously
ASPLOS '16

To harness a heterogeneous memory hierarchy, it is advantageous to integrate application knowledge in guiding frequent memory move, i.e., replicating or migrating virtual memory regions. To this end, we present memif, a protected OS service for ...
memif: Towards Programming Heterogeneous Memory Asynchronously
ASPLOS'16

To harness a heterogeneous memory hierarchy, it is advantageous to integrate application knowledge in guiding frequent memory move, i.e., replicating or migrating virtual memory regions. To this end, we present memif, a protected OS service for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

March 2016

824 pages

ISBN:9781450340915

DOI:10.1145/2872362

General Chair:
Tom Conte
Georgia Tech, USA
,
Program Chair:
Yuanyuan Zhou
University of California, San Diego, USA

ACM SIGPLAN Notices Volume 51, Issue 4
ASPLOS '16
April 2016
774 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2954679
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 44, Issue 2
ASPLOS'16
May 2016
774 pages
ISSN:0163-5964
DOI:10.1145/2980024
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '16

Sponsor:

ASPLOS '16: Architectural Support for Programming Languages and Operating Systems

April 2 - 6, 2016

Georgia, Atlanta, USA

Acceptance Rates

ASPLOS '16 Paper Acceptance Rate 53 of 232 submissions, 23%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
825
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kim H(2024)Compiler-assisted data placement for heterogeneous memory systemsIEICE Electronics Express10.1587/elex.21.2024046021:19(20240460-20240460)Online publication date: 10-Oct-2024
https://doi.org/10.1587/elex.21.20240460
Bailleu MStavrakakis DRocha RChakraborty SGarg DBhatotia P(2024)Toast: A Heterogeneous Memory Management SystemProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676944(53-65)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676944
Michailidis TSwanson SZhao JSerafini MXu H(2022)PMShifterProceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3546591.3547523(1-8)Online publication date: 23-Aug-2022
https://dl.acm.org/doi/10.1145/3546591.3547523
Singh GNadig RPark JBera RHajinazar NNovo DGómez-Luna JStuijk SCorporaal HMutlu OSalapura VZahran MChong FTang L(2022)SibylProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527442(320-336)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527442
Kannan SRen YBhattacharjee ASherwood TBerger EKozyrakis C(2021)KLOCs: kernel-level object contexts for heterogeneous memory systemsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446745(65-78)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446745
Choi JBlagodurov STseng H(2021)Dancing in the Dark: Profiling for Tiered Memory2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00011(13-22)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00011
Liu HLiu RLiao XJin HHe BZhang Y(2020)Object-Level Memory Allocation and Migration in Hybrid Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.297313469:9(1401-1413)Online publication date: 1-Sep-2020
https://doi.org/10.1109/TC.2020.2973134
Kim TJamil SPark JKim Y(2020)Optimizing Heap Memory Object Placement in the Hybrid Memory System With Energy ConstraintsIEEE Access10.1109/ACCESS.2020.30094328(130323-130339)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3009432
Liu MPeter SKrishnamurthy APhothilimthana PDan TDahlia M(2019)E3Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358839(363-378)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.5555/3358807.3358839
Doudali TBlagodurov SVishnu AGurumurthi SGavrilovska AWeissman JButt ASmirni E(2019)KleioProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325398(37-48)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325398
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten