research-article

Hierarchical Training: Scaling Deep Recommendation Models on Large CPU Clusters

Authors:

Shivam Bharuka,

Dhruv Choudhary,

Jack LangmanAuthors Info & Claims

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Pages 3050 - 3058

https://doi.org/10.1145/3447548.3467084

Published: 14 August 2021 Publication History

Abstract

Neural network based recommendation models are widely used to power many internet-scale applications including product recommendation and feed ranking. As the models become more complex and more training data is required during training, improving the training scalability of these recommendation models becomes an urgent need. However, improving the scalability without sacrificing the model quality is challenging. In this paper, we conduct an in-depth analysis of the scalability bottleneck in existing training architecture on large scale CPU clusters. Based on these observations, we propose a new training architecture called Hierarchical Training, which exploits both data parallelism and model parallelism for the neural network part of the model within a group. We implement hierarchical training with a two-layer design: a tagging system that decides the operator placement and a net transformation system that materializes the training plans, and integrate hierarchical training into existing training stack. We propose several optimizations to improve the scalability of hierarchical training including model architecture optimization, communication compression, and various system-level improvements. Extensive experiments at massive scale demonstrate that hierarchical training can speed up distributed recommendation model training by 1.9x without model quality drop.

References

[1]

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 265--283.

Digital Library

[2]

Bilge Acun, Matthew Murphy, Xiaodong Wang, Jade Nie, Carole-Jean Wu, and Kim Hazelwood. 2020. Understanding Training Efficiency of Deep Learning Recommendation Models at Scale. arXiv preprint arXiv:2011.05497(2020).

[3]

Caffe2 Operators Catalog. 2021. https://caffe2.ai/docs/operators-catalogue.html.(2021).

[4]

Zhenkun Cai, Kaihao Ma, Xiao Yan, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, and Fan Yu. 2020. TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism. CoRRabs/2004.10856 (2020). arXiv:2004.10856 https://arxiv.org/abs/2004.10856

[5]

Kai Chen and Qiang Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In2016 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 5880--5884.

[6]

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1stworkshop on deep learning for recommender systems. 7--10.

Digital Library

[7]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191--198.

Digital Library

[8]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large scale distributed deep networks. Advances in neural information processing systems 25 (2012), 1223--1231.

[9]

Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web. 278--288.

Digital Library

[10]

Carlos A Gomez-Uribe and Neil Hunt. 2015. The Netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS)6, 4 (2015), 1--19.

Digital Library

[11]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677(2017).

[12]

Hui Guan, Andrey Malevich, Jiyan Yang, Jongsoo Park, and Hector Yuen. 2019. Post-training 4-bit quantization on embedding tables. arXiv preprintarXiv:1911.02079(2019).

[13]

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXivpreprint arXiv:1703.04247(2017).

[14]

Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S Lee, David Brooks, and Carole-Jean Wu. 2020. Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture(ISCA). IEEE, 982--995.

Digital Library

[15]

Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, et al. 2020. The architectural implications of facebook's DNN-based personalized recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 488--501.

[16]

Vipul Gupta, Dhruv Choudhary, Ping Tak Peter Tang, Xiaohan Wei, Xing Wang, Yuzhen Huang, Arun Kejariwal, Kannan Ramchandran, and Michael W. Mahoney. 2020. Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism. CoRRabs/2010.08899 (2020). arXiv:2010.08899 https://arxiv.org/abs/2010.08899

[17]

Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018. Applied machine learning at facebook: A data center infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture(HPCA). IEEE, 620--629.

[18]

Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al.2014. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. 1--9.

[19]

Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and et al. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (ADKDD'14). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2648584.2648589

Digital Library

[20]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, Hyouk Joong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2018. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965(2018).

[21]

Yuzhen Huang, Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Fan Yang, Jinfeng Li, Yuying Guo, and James Cheng. 2018. FlexPS: Flexible Parallelism Control in Parameter Server Architecture.Proc. VLDB Endow. 11, 5 (2018), 566--579. https://doi.org/10.1145/3187009.3177734

Digital Library

[22]

Yuzhen Huang, Xiao Yan, Guanxian Jiang, Tatiana Jin, James Cheng, An Xu,Zhanhao Liu, and Shuo Tu. 2019. Tangram: bridging immutable and mutable abstractions for distributed data analytics. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 191--206.

[23]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358(2018).

[24]

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi.2019. Error Feedback Fixes Sign SGD and other Gradient Compression Schemes. In International Conference on Machine Learning. 3252--3261.

[25]

Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A Gibson, and Eric P Xing. 2016. Strads: A distributed framework for scheduled model parallel machine learning. In Proceedings of the Eleventh European Conference on Computer Systems. 1--16.

Digital Library

[26]

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. InProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 740--753.

Digital Library

[27]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 583--598.

Digital Library

[28]

Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1754--1763.

Digital Library

[29]

Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning. PMLR, 3043--3052.

[30]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1--15.

Digital Library

[31]

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, LiangXiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems.arXiv preprint arXiv:1906.00091(2019).

[32]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. 8026--8037.

[33]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems 24 (2011), 693--701.

[34]

S. Rendle. 2010. Factorization Machines. In 2010 IEEE International Conference on Data Mining. 995--1000. https://doi.org/10.1109/ICDM.2010.127

Digital Library

[35]

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Pen-porn Koanantakool, Peter Hawkins, Hyouk Joong Lee, Mingsheng Hong, CliffYoung, et al.2018. Mesh-tensor flow: Deep learning for supercomputers. arXivpreprint arXiv:1811.02084(2018).

[36]

Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. 2020. Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 165--175.

[37]

Qingquan Song, Dehua Cheng, Hanning Zhou, Jiyan Yang, Yuandong Tian, and Xia Hu. 2020. Towards automated neural interaction discovery for click-through rate prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 945--955.

Digital Library

[38]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762(2017).

[39]

Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting very large models using automatic data flow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019. 1--17.

Digital Library

[40]

Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang, James Cheng, Han Yuan, and Fan Yu. 2021. Elastic Deep Learning in Multi-Tenant GPU Clusters. IEEE Transactions on Parallel and Distributed Systems(2021).

[41]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1492--1500.

[42]

Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A new platform for distributed machine learning on big data.IEEE Transactions on Big Data 1, 2 (2015), 49--67.

[43]

Xpress Optimization. 2021. https://www.fico.com/en/products/fico-xpress-optimization. (2021).

[44]

Fan Yang, Fanhua Shang, Yuzhen Huang, James Cheng, Jinfeng Li, Yunjian Zhao,and Ruihao Zhao. 2017. Lftf: A framework for efficient tensor analytics at scale. Proceedings of the VLDB Endowment 10, 7 (2017), 745--756.

Digital Library

[45]

Hyokun Yun, Hsiang-Fu Yu, Cho-Jui Hsieh, SVN Vishwanathan, and Inderjit Dhillon. 2013. Nomad: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. arXiv preprint arXiv:1312.0193(2013).

[46]

Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. Advances in neural information processing systems 28(2015), 685--693.

[47]

Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. 2020. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems. arXiv preprint arXiv:2003.05622(2020).

[48]

Qinqing Zheng, Bor-Yiing Su, Jiyan Yang, Alisson Azzolini, Qiang Wu, Ou Jin,Shri Karandikar, Hagay Lupesko, Liang Xiong, and Eric Zhou. 2020. Shadow Sync: Performing Synchronization in the Background for Highly Scalable Distributed Training. arXiv preprint arXiv:2003.03477(2020).

[49]

Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rateprediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 5941--5948.

Digital Library

[50]

Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059--1068.

Digital Library

[51]

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. 2010. Parallelized stochastic gradient descent. In Advances in neural information processing systems. 2595--2603.

Cited By

Chen CYen JLai YLin YYang C(2024)RecTS: A Temporal-Aware Memory System Optimization for Training Deep Learning Recommendation ModelsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689155(104-117)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689155
Han XZhu CHu XQin CZhao XZhu HBaeza-Yates RBonchi F(2024)Adapting Job Recommendations to User Preference Drift with Behavioral-Semantic Fusion LearningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671759(1004-1015)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671759
Wang SFeng TYang HYou XChen BLiu TLuan ZQian D(2024)AtRec: Accelerating Recommendation Model Training on CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.338118635:6(905-918)Online publication date: Jun-2024
https://doi.org/10.1109/TPDS.2024.3381186
Show More Cited By

Index Terms

Hierarchical Training: Scaling Deep Recommendation Models on Large CPU Clusters
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms

Recommendations

Stacked co-training for semi-supervised multi-label learning
Abstract
Due to the difficulty of annotation, multi-label learning sometimes obtains a small amount of labeled data and a large amount of unlabeled data as supplements. To make up this issue, many algorithms extended the existing semi-supervised ...
Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference

Rapid advances in machine learning necessitate significant computing power and memory for training, which is accessible only to large corporations today. Small-scale players like academics often only have consumer-grade GPU clusters locally and can ...
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

August 2021

4259 pages

ISBN:9781450383325

DOI:10.1145/3447548

General Chairs:
Feida Zhu
Singapore Management University
,
Beng Chin Ooi
National University of Singapore
,
Chunyan Miao
Nanyang Technology University
,
Program Chairs:
Haixun Wang,
Iryna Skrypnyk,
Wynne Hsu,
Sanjay Chawla

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '21

Sponsor:

KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2021

Virtual Event, Singapore

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
542
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)5

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen CYen JLai YLin YYang C(2024)RecTS: A Temporal-Aware Memory System Optimization for Training Deep Learning Recommendation ModelsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689155(104-117)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689155
Han XZhu CHu XQin CZhao XZhu HBaeza-Yates RBonchi F(2024)Adapting Job Recommendations to User Preference Drift with Behavioral-Semantic Fusion LearningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671759(1004-1015)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671759
Wang SFeng TYang HYou XChen BLiu TLuan ZQian D(2024)AtRec: Accelerating Recommendation Model Training on CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.338118635:6(905-918)Online publication date: Jun-2024
https://doi.org/10.1109/TPDS.2024.3381186
Hsia SGupta UAcun BArdalani NZhong PWei GBrooks DWu CAamodt TJerger NSwift M(2023)MP-Rec: Hardware-Software Co-design to Enable Multi-path RecommendationProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582068(449-465)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582068

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents