research-article

Open access

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming

Authors:

Mark Hildebrand,

Jason Lowe-Power,

Venkatesh AkellaAuthors Info & Claims

ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 875 - 890

https://doi.org/10.1145/3373376.3378465

Published: 13 March 2020 Publication History

Abstract

Memory capacity is a key bottleneck for training large scale neural networks. Intel® Optane#8482; DC PMM (persistent memory modules) which are available as NVDIMMs are a disruptive technology that promises significantly higher read bandwidth than traditional SSDs at a lower cost per bit than traditional DRAM. In this work we show how to take advantage of this new memory technology to minimize the amount of DRAM required without compromising performance significantly. Specifically, we take advantage of the static nature of the underlying computational graphs in deep neural network applications to develop a profile guided optimization based on Integer Linear Programming (ILP) called AutoTM to optimally assign and move live tensors to either DRAM or NVDIMMs. Our approach can replace 50% to 80% of a system's DRAM with PMM while only losing a geometric mean 27.7% performance. This is a significant improvement over first-touch NUMA, which loses 71.9% of performance. The proposed ILP based synchronous scheduling technique also provides 2x performance over using DRAM as a hardware-controlled cache for very large networks.

References

[1]

Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265--283, Savannah, GA, 2016. USENIX Association.

Digital Library

[2]

Neha Agarwal and Thomas F. Wenisch. Thermostat: Application-transparent page management for two-tiered main memory. In Yunji Chen, Olivier Temam, and John Carter, editors, Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, Xi'an, China, April 8--12, 2017, pages 631--644. ACM, 2017.

Digital Library

[3]

Marc Andrysco, David Kohlbrenner, Keaton Mowery, Ranjit Jhala, Sorin Lerner, and Hovav Shacham. On subnormal floating point and abnormal timing. In Proceedings of the 2015 IEEE Symposium on Security and Privacy, SP '15, pages 623--639, Washington, DC, USA, 2015. IEEE Computer Society.

Digital Library

[4]

Oren Avissar, Rajeev Barua, and Dave Stewart. Heterogeneous memory management for embedded systems. In Proceedings of the 2001 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES '01, pages 34--43, New York, NY, USA, 2001. ACM.

Digital Library

[5]

Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. Julia: A fresh approach to numerical computing. CoRR, abs/1411.1607, 2014.

[6]

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019. OpenReview.net, 2019.

[7]

Hu Chen, Wenguang Chen, Jian Huang, Bob Robert, and H. Kuhn. MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In Gregory K. Egan and Yoichi Muraoka, editors, Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, Cairns, Queensland, Australia, June 28 - July 01, 2006, pages 353--360. ACM, 2006.

[8]

X. Chen, D. Z. Chen, and X. S. Hu. modnn: Memory optimal dnn training on gpus. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE), pages 13--18, March 2018.

[9]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.

[10]

Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160--167. ACM, 2008.

Digital Library

[11]

Scott Cyphers, Arjun K. Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, William Constable, Christian Convey, Leona Cook, Omar Kanawi, Robert Kimball, Jason Knight, Nikolay Korovaiko, Varun Kumar, Yixing Lao, Christopher R. Lishka, Jaikrishnan Menon, Jennifer Myers, Sandeep Aswath Narayana, Adam Procter, and Tristan J. Webb. Intel ngraph: An intermediate representation, compiler, and executor for deep learning. CoRR, abs/1801.08058, 2018.

[12]

Iain Dunning, Joey Huchette, and Miles Lubin. Jump: A modeling language for mathematical optimization. SIAM Review, 59(2):295--320, 2017.

Digital Library

[13]

Assaf Eisenman, Darryl Gardner, Islam AbdelRahman, Jens Axboe, Siying Dong, Kim M. Hazelwood, Chris Petersen, Asaf Cidon, and Sachin Katti. Reducing DRAM footprint with NVM in facebook. In Proceedings of the Thirteenth EuroSys Conference, EuroSys 2018, Porto, Portugal, April 23--26, 2018, pages 42:1--42:13, 2018.

Digital Library

[14]

Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim M. Hazelwood, Asaf Cidon, and Sachin Katti. Bandana: Using non-volatile memory for storing deep learning models. CoRR, abs/1811.05922, 2018.

[15]

Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, and Keshav Pingali. Single machine graph analytics on massive datasets using intel optane DC persistent memory. CoRR, abs/1904.07162, 2019.

[16]

Andrew Goldberg, Éva Tardos, and Robert Tarjan. Network flow algorithms. page 80, 04 1989.

[17]

David W. Goodwin and Kent D. Wilken. Optimal and near-optimal global register allocations using 0--1 integer programming. Softw. Pract. Exper., 26(8):929--965, August 1996.

Digital Library

[18]

LLC Gurobi Optimization. Gurobi optimizer reference manual, 2018.

[19]

Frank T Hady, Annie Foong, Bryan Veal, and Dan Williams. Platform storage performance with 3D XPoint technology. Proceedings of the IEEE, 105(9):1822--1833, 2017.

[20]

Mary W. Hall, Jennifer-Ann M. Anderson, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S. Lam. Maximizing multiprocessor performance with the SUIF compiler. Digital Technical Journal, 10(1):71--80, 1998.

[21]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.

[22]

Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy: Computational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP '19, pages 1--14, New York, NY, USA, 2019. ACM.

Digital Library

[23]

Takahiro Hirofuchi and Ryousei Takano. The preliminary evaluation of a hypervisor-based virtualization mechanism for intel optane DC persistent memory module. CoRR, abs/1907.12014, 2019.

[24]

Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.

[25]

Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R. Dulloor, Jishen Zhao, and Steven Swanson. Basic performance measurements of the intel optane DC persistent memory module. CoRR, abs/1903.05714, 2019.

[26]

Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. Automatic CPU-GPU communication management and optimization. In Mary W. Hall and David A. Padua, editors, Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4--8, 2011, pages 142--151. ACM, 2011.

[27]

Christoph Lameter. Numa (non-uniform memory access): An overview. Queue, 11(7):40:40--40:51, July 2013.

Digital Library

[28]

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, Nov 1998.

[29]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.

[30]

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. Deep learning recommendation model for personalization and recommendation systems. CoRR, abs/1906.00091, 2019.

[31]

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda. Queue, 6(2):40--53, March 2008.

Digital Library

[32]

Tony Nowatzki, Michael Ferris, Karthikeyan Sankaralingam, Cristian Estan, Nilay Vaish, and David Wood. Optimization and mathematical modeling in computer architecture. Synthesis Lectures on Computer Architecture, 8(4):1--144, 2013.

[33]

Sreepathi Pai, R. Govindarajan, and Matthew J. Thazhuthaveetil. Fast and efficient automatic memory management for gpus using compiler-assisted runtime coherence scheme. In Pen-Chung Yew, Sangyeun Cho, Luiz DeRose, and David J. Lilja, editors, International Conference on Parallel Architectures and Compilation Techniques, PACT '12, Minneapolis, MN, USA - September 19 - 23, 2012, pages 33--42. ACM, 2012.

Digital Library

[34]

Phitchaya Mangpo Phothilimthana, Jason Ansel, Jonathan Ragan-Kelley, and Saman Amarasinghe. Portable performance on heterogeneous architectures. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 431--444, New York, NY, USA, 2013. ACM.

Digital Library

[35]

Milan Radulovic, Darko Zivanovic, Daniel Ruiz, Bronis R. de Supinski, Sally A. McKee, Petar Radojkoviç, and Eduard Ayguadé. Another trip to the wall: How much will stacked dram benefit hpc? In Proceedings of the 2015 International Symposium on Memory Systems, MEMSYS '15, pages 31--36, New York, NY, USA, 2015. ACM.

Digital Library

[36]

Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-49, pages 18:1--18:13, Piscataway, NJ, USA, 2016. IEEE Press.

[37]

Alexander Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, Inc., New York, NY, USA, 1986.

Digital Library

[38]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR, abs/1701.06538, 2017.

[39]

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.

[40]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014.

Digital Library

[41]

Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.

[42]

Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann, and Alfons Kemper. Persistent memory I/O primitives. CoRR, abs/1904.01614, 2019.

[43]

Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, et al. Alphastar: Mastering the real-time strategy game starcraft ii. DeepMind Blog, 2019.

[44]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic GPU memory management for training deep neural networks. CoRR, abs/1801.04380, 2018.

[45]

Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Nimble page management for tiered memory systems. In Iris Bahar, Maurice Herlihy, Emmett Witchel, and Alvin R. Lebeck, editors, Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13--17, 2019, pages 331--345. ACM, 2019.

Digital Library

[46]

Heiga Ze, Andrew Senior, and Mike Schuster. Statistical parametric speech synthesis using deep neural networks. In 2013 ieee international conference on acoustics, speech and signal processing, pages 7962--7966. IEEE, 2013.

[47]

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5907--5915, 2017.

[48]

Ruoyu Zhou and Timothy M. Jones. Janus: Statically-driven and profile-guided automatic dynamic binary parallelisation. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, pages 15--25, Piscataway, NJ, USA, 2019. IEEE Press.

Cited By

Kim SSim EShin YCho YBaek W(2024)Activation Sequence Caching: High-Throughput and Memory-Efficient Generative Inference with a Single GPUProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676945(78-90)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676945
Zhou HRang WChen HZhou XCheng D(2024)DeepTM: Efficient Tensor Management in Heterogeneous Memory for DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343191035:11(1920-1935)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3431910
Zhou JChen YHong ZChen WYu YZhang TWang HZhang CZheng Z(2024)Training and Serving System of Foundation Models: A Comprehensive SurveyIEEE Open Journal of the Computer Society10.1109/OJCS.2024.33808285(107-119)Online publication date: 2024
https://doi.org/10.1109/OJCS.2024.3380828
Show More Cited By

Index Terms

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming
1. Computing methodologies
  1. Machine learning
2. Hardware
  1. Emerging technologies
    1. Analysis and design of emerging devices and systems
    2. Memory and dense storage

Recommendations

Using DRAM as Cache for Non-Volatile Main Memory Swapping

The performance of mobile devices such as smartphones and tablets has been rapidly improving in recent years. However, these improvements have been seriously affecting power consumption. One of the greatest challenges is to achieve efficient power ...
Exploiting Phase-Change Memory in Cooperative Caches
SBAC-PAD '12: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Modern servers require large main memories, which so far have been enabled by improvements in DRAM density. However, the scalability of DRAM is approaching its limit, so Phase-Change Memory (PCM) is being considered as an alternative technology. PCM is ...
Wear-leveling-aware buddy-like memory allocator for persistent memory file systems
Abstract
Existing persistent memory file systems usually ignore the problem that persistent memories (PMs) have limited write endurance. Then the underlying PMs can be damaged easily by the unbalanced writes of file systems. However, existing wear-...
Highlights
- We reveal the high overhead and severe imbalanced wear problem caused by allocator of PM file systems.
- We propose WBAlloc which provides O(1) time complexity in both allocation and deallocation while achieving near-balanced writes.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

March 2020

1412 pages

ISBN:9781450371025

DOI:10.1145/3373376

General Chair:
James Larus
EPFL
,
Program Chairs:
Luis Ceze
University of Washington
,
Karin Strauss
Microsoft

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ASPLOS '20

Sponsor:

ASPLOS '20: Architectural Support for Programming Languages and Operating Systems

March 16 - 20, 2020

Lausanne, Switzerland

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
2,713
Total Downloads

Downloads (Last 12 months)565
Downloads (Last 6 weeks)53

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim SSim EShin YCho YBaek W(2024)Activation Sequence Caching: High-Throughput and Memory-Efficient Generative Inference with a Single GPUProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676945(78-90)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676945
Zhou HRang WChen HZhou XCheng D(2024)DeepTM: Efficient Tensor Management in Heterogeneous Memory for DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343191035:11(1920-1935)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3431910
Zhou JChen YHong ZChen WYu YZhang TWang HZhang CZheng Z(2024)Training and Serving System of Foundation Models: A Comprehensive SurveyIEEE Open Journal of the Computer Society10.1109/OJCS.2024.33808285(107-119)Online publication date: 2024
https://doi.org/10.1109/OJCS.2024.3380828
Ren JXu DYang SZhao JLi ZNavasca CWang CXu HLi D(2024)Enabling Large Dynamic Neural Network Training with Learning-based Memory Management2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00066(788-802)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00066
Steiner BElhoushi MKahn JHegarty JKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)MODeLProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619760(32618-32632)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619760
Soh YSwanson SZhao J(2023)ENTS: Flush-and-Fence-Free Failure Atomic TransactionsProceedings of the International Symposium on Memory Systems10.1145/3631882.3631907(1-16)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631907
Wei JZhang XWang LWei Z(2023)Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning TrainingACM Transactions on Architecture and Code Optimization10.1145/363010820:4(1-25)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3630108
Zhang HZhou YXue YLiu YHuang J(2023)G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor MigrationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614309(395-410)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614309
Li ZCao QChen YYan W(2023)CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in ParallelProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605647(92-101)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605647
Lee TMonga SMin CEom YDruschel PKaufmann AMace JFlinn JSeltzer M(2023)MEMTIS: Efficient Memory Tiering with Dynamic Page Classification and Page Size DeterminationProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613167(17-34)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613167
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents