Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3357384.3358045acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Open access

AIBox: CTR Prediction Model Training on a Single Node

Published: 03 November 2019 Publication History
  • Get Citation Alerts
  • Abstract

    As one of the major search engines in the world, Baidu's Sponsored Search has long adopted the use of deep neural network (DNN) models for Ads click-through rate (CTR) predictions, as early as in 2013. The input futures used by Baidu's online advertising system (a.k.a. "Phoenix Nest'') are extremely high-dimensional (e.g., hundreds or even thousands of billions of features) and also extremely sparse. The size of the CTR models used by Baidu's production system can well exceed 10TB. This imposes tremendous challenges for training, updating, and using such models in production. For Baidu's Ads system, it is obviously important to keep the model training process highly efficient so that engineers (and researchers) are able to quickly refine and test their new models or new features. Moreover, as billions of user ads click history entries are arriving every day, the models have to be re-trained rapidly because CTR prediction is an extremely time-sensitive task. Baidu's current CTR models are trained on MPI (Message Passing Interface) clusters, which require high fault tolerance and synchronization that incur expensive communication and computation costs. And, of course, the maintenance costs for clusters are also substantial. This paper presents AIBox, a centralized system to train CTR models with tens-of-terabytes-scale parameters by employing solid-state drives (SSDs) and GPUs. Due to the memory limitation on GPUs, we carefully partition the CTR model into two parts: one is suitable for CPUs and another for GPUs. We further introduce a bi-level cache management system over SSDs to store the 10TB parameters while providing low-latency accesses. Extensive experiments on production data reveal the effectiveness of the new system. AIBox has comparable training performance with a large MPI cluster, while requiring only a small fraction of the cost for the cluster.

    References

    [1]
    Marc Abrams, Charles R Standridge, Ghaleb Abdulla, Stephen Williams, and Edward A Fox. 1996. Caching Proxies: Limitations and Potentials . World Wide Web Journal, Vol. 1, 1 (1996).
    [2]
    Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D Davis, Mark S Manasse, and Rina Panigrahy. 2008. Design Tradeoffs for SSD Performance. In USENIX Annual Technical Conference (USENIX ATC), Vol. 57.
    [3]
    Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman Nath. 2010. Cheap and Large CAMs for High Performance Data-Intensive Networked Systems. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Vol. 10. 29--29.
    [4]
    David G Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array of Wimpy Nodes. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SIGOPS). 1--14.
    [5]
    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate . arXiv preprint arXiv:1409.0473 (2014).
    [6]
    Jeff Bonwick et almbox. 1994. The Slab Allocator: An Object-Caching Kernel Memory Allocator. In USENIX Summer, Vol. 16.
    [7]
    Dhruba Borthakur. 2007. The Hadoop Distributed File System: Architecture and Design . Hadoop Project Website, Vol. 11, 2007 (2007), 21.
    [8]
    Y-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A Theoretical Analysis of Feature Pooling in Visual Recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML) . 111--118.
    [9]
    Andrei Z. Broder and Michael Mitzenmacher. 2003. Survey: Network Applications of Bloom Filters: A Survey . Internet Mathematics, Vol. 1, 4, 485--509.
    [10]
    Pei Cao and Sandy Irani. 1997. Cost-Aware WWW Proxy Caching Algorithms. In USENIX Symposium on Internet Technologies and Systems (USITS), Vol. 12. 193--206.
    [11]
    Li-Pin Chang. 2007. On Efficient Wear Leveling for Large-Scale Flash-Memory Storage Systems. In Proceedings of the 2007 ACM Symposium on Applied Computing (SAC) . 1126--1130.
    [12]
    Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing Neural Networks with the Hashing Trick. In Proceedings of the 32nd International Conference on Machine Learning (ICML). 2285--2294.
    [13]
    Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et almbox. 2016. Wide & Deep Learning for Recommender Systems . In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (RecSys). 7--10.
    [14]
    Hong-Tai Chou and David J DeWitt. 1986. An Evaluation of Buffer Management Strategies for Relational Database Systems . Algorithmica, Vol. 1, 1--4 (1986), 311--336.
    [15]
    Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for Youtube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys). 191--198.
    [16]
    Shaul Dar, Michael J Franklin, Bjorn T Jonsson, Divesh Srivastava, Michael Tan, et almbox. 1996. Semantic Data Caching and Replacement. In Proceedings of 22th International Conference on Very Large Data Bases (VLDB), Vol. 96. 330--341.
    [17]
    Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. FlashStore: High Throughput Persistent Key-Value Store . Proceedings of the VLDB Endowment, Vol. 3, 1--2 (2010), 1414--1425.
    [18]
    Biplob Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-based Storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD) . 25--36.
    [19]
    Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. Internet Advertising and the Generalized Second-Price Auction: Selling Billions of Dollars Worth of Keywords . American Economic Review, Vol. 97, 1 (2007), 242--259.
    [20]
    Miao Fan, Jiacheng Guo, Shuai Zhu, Shuo Miao, Mingming Sun, and Ping Li. 2019. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 2509--2517.
    [21]
    Denis Foley and John Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect . IEEE Micro, Vol. 37, 2 (2017), 7--17.
    [22]
    Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine. In Proceedings of the 27th International Conference on Machine Learning (ICML). 13--20.
    [23]
    Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a Factorization-Machine based Neural Network for CTR Prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI) . 1725--1731.
    [24]
    Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Advances in Neural Information Processing Systems (NIPS). 1223--1231.
    [25]
    Jin Huang and Charles X Ling. 2005. Using AUC and Accuracy in Evaluating Learning Algorithms . IEEE Transactions on Knowledge and Data Engineering, Vol. 17, 3 (2005), 299--310.
    [26]
    Stratos Idreos, Niv Dayan, Wilson Qin, Mali Akmanalp, Sophie Hilgard, Andrew Ross, James Lennon, Varun Jain, Harshita Gupta, David Li, et almbox. [n. d.]. Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn. In Biennial Conference on Innovative Data Systems Research (CIDR) .
    [27]
    Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep Cross-Modal Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3232--3240.
    [28]
    Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the Future of Parallel Computing . IEEE Micro 5 (2011), 7--17.
    [29]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS). 1097--1105.
    [30]
    Jason Kuen, Xiangfei Kong, Zhe Lin, Gang Wang, Jianxiong Yin, Simon See, and Yap-Peng Tan. 2018. Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7929--7938.
    [31]
    Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. 2001. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies . IEEE Trans. Comput. 12 (2001), 1352--1361.
    [32]
    Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Vol. 14. 583--598.
    [33]
    Ping Li, Art B Owen, and Cun-Hui Zhang. 2012. One Permutation Hashing. In Advances in Neural Information Processing Systems (NIPS). 3122--3130.
    [34]
    Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian König. 2011. Hashing Algorithms for Large-Scale Learning. In Advances in Neural Information Processing Systems (NIPS). 2672--2680.
    [35]
    Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 1754--1763.
    [36]
    Hyeontaek Lim, Bin Fan, David G Andersen, and Michael Kaminsky. 2011. SILT: A Memory-Efficient, High-Performance Key-Value Store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP) . 1--13.
    [37]
    Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Hariharan Gopalakrishnan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2017. WiscKey: Separating Keys from Values in SSD-Conscious Storage . ACM Transactions on Storage (TOS), Vol. 13, 1 (2017), 5.
    [38]
    Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision . arXiv preprint arXiv:1803.04014 (2018).
    [39]
    Avantika Mathur, Mingming Cao, Suparna Bhattacharya, Andreas Dilger, Alex Tomas, and Laurent Vivier. 2007. The New Ext4 Filesystem: Current Status and Future Plans. In Proceedings of the Linux Symposium, Vol. 2. 21--33.
    [40]
    H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et almbox. 2013. Ad Click Prediction: a View from the Trenches . In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . 1222--1230.
    [41]
    Michael Mitzenmacher. 2002. Compressed Bloom Filters . IEEE/ACM Transactions on Networking (TON), Vol. 10, 5 (2002), 604--612.
    [42]
    Shamkant Navathe, Stefano Ceri, Gio Wiederhold, and Jinglie Dou. 1984. Vertical Partitioning Algorithms for Database Design . ACM Transactions on Database Systems (TODS), Vol. 9, 4 (1984), 680--710.
    [43]
    John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA . ACM Queue, Vol. 6, 2, 40--53.
    [44]
    NVIDIA. 2018. NVLink Fabric . https://www.nvidia.com/en-us/data-center/nvlink . Accessed: 2019-01--29.
    [45]
    Elizabeth J O'neil, Patrick E O'neil, and Gerhard Weikum. 1993. The LRU-K Page Replacement Algorithm for Database Disk Buffering ., Vol. 22, 2 (1993), 297--306.
    [46]
    Alexander D Poularikas. 1998. Handbook of Formulas and Tables for Signal Processing. CRC press.
    [47]
    Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based Neural Networks for User Response Prediction. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM) . 1149--1154.
    [48]
    Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems (NIPS). 693--701.
    [49]
    Bianca Schroeder and Garth Gibson. 2010. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing, Vol. 7, 4 (2010), 337--350.
    [50]
    Ying Shan, T Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016. Deep crossing: Web-scale Modeling without Manually Crafted Combinatorial Features. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 255--262.
    [51]
    Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In Advances in Neural Information Processing Systems (NIPS). 2321--2329.
    [52]
    Marc Snir, Steve Otto, Steven Huss-Lederman, Jack Dongarra, and David Walker. 1998. MPI--the Complete Reference: the MPI Core. Vol. 1. MIT press.
    [53]
    Leonid B Sokolinsky. 2004. LFU-K: An Effective Buffer Management Replacement Algorithm. In International Conference on Database Systems for Advanced Applications (DASFAA) . 670--681.
    [54]
    Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS File System. In USENIX Annual Technical Conference (USENIX ATC), Vol. 15.
    [55]
    Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2019 a. Fast Item Ranking under Neural Network based Measures. Technical Report. Baidu Research.
    [56]
    Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2019 b. On Efficient Retrieval of Top Similarity Vectors. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) .
    [57]
    Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature Hashing for Large Scale Multitask Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML) . 1113--1120.
    [58]
    Roland P Wooster and Marc Abrams. 1997. Proxy Caching that Estimates Page Load Delays . Computer Networks and ISDN Systems, Vol. 29, 8 (1997), 977--986.
    [59]
    Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI). 3119--3125.
    [60]
    Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita Shayesteh, and Vijay Balakrishnan. 2015. Performance Analysis of NVMe SSDs and Their Implication on Real World Databases. In Proceedings of the 8th ACM International Systems and Storage Conference (SYSTOR). 6:1--6:11.
    [61]
    Shuangfei Zhai, Keng-hao Chang, Ruofei Zhang, and Zhongfei Mark Zhang. 2016. Deepintent: Learning Attentions for Online Advertising with Recurrent Neural Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . 1295--1304.
    [62]
    Shanshan Zhang, Ce Zhang, Zhao You, Rong Zheng, and Bo Xu. 2013. Asynchronous Stochastic Gradient Descent for DNN Training. In Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6660--6663.
    [63]
    Weijie Zhao, Yu Cheng, and Florin Rusu. 2015. Vertical Partitioning for Query Processing over Raw Data. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management (SSDBM). 15:1--15:12.
    [64]
    Weijie Zhao, Shulong Tan, and Ping Li. 2019. SONG: Approximate Nearest Neighbor Search on GPU. Technical Report. Baidu Research.
    [65]
    Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint Deep Modeling of Users and Items Using Reviews for Recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM) . 425--434.
    [66]
    Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD) . 1059--1068.

    Cited By

    View all
    • (2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
    • (2024)Breaking Barriers: Expanding GPU Memory with Sub-Two Digit Nanosecond Latency CXL ControllerProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665953(108-115)Online publication date: 8-Jul-2024
    • (2024)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation ModelsProceedings of the ACM on Management of Data10.1145/36393062:1(1-28)Online publication date: 26-Mar-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management
    November 2019
    3373 pages
    ISBN:9781450369763
    DOI:10.1145/3357384
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. gpu computing
    2. sponsored search
    3. ssd cache management

    Qualifiers

    • Research-article

    Conference

    CIKM '19
    Sponsor:

    Acceptance Rates

    CIKM '19 Paper Acceptance Rate 202 of 1,031 submissions, 20%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)892
    • Downloads (Last 6 weeks)135
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
    • (2024)Breaking Barriers: Expanding GPU Memory with Sub-Two Digit Nanosecond Latency CXL ControllerProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665953(108-115)Online publication date: 8-Jul-2024
    • (2024)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation ModelsProceedings of the ACM on Management of Data10.1145/36393062:1(1-28)Online publication date: 26-Mar-2024
    • (2024)NDRec: A Near-Data Processing System for Training Large-Scale Recommendation ModelsIEEE Transactions on Computers10.1109/TC.2024.336593973:5(1248-1261)Online publication date: May-2024
    • (2024)ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00038(410-423)Online publication date: 29-Jun-2024
    • (2024)A Simple yet Effective Framework for Active Learning to RankMachine Intelligence Research10.1007/s11633-023-1422-z21:1(169-183)Online publication date: 15-Jan-2024
    • (2023)Experimental Analysis of Large-Scale Learnable Vector Storage CompressionProceedings of the VLDB Endowment10.14778/3636218.363623417:4(808-822)Online publication date: 1-Dec-2023
    • (2023)Efficient Fault Tolerance for Recommendation Model Training via Erasure CodingProceedings of the VLDB Endowment10.14778/3611479.361151416:11(3137-3150)Online publication date: 24-Aug-2023
    • (2023)PetPS: Supporting Huge Embedding Models with Persistent MemoryProceedings of the VLDB Endowment10.14778/3579075.357907716:5(1013-1022)Online publication date: 6-Mar-2023
    • (2023)Gradient Matching for Categorical Data Distillation in CTR PredictionProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608769(161-170)Online publication date: 14-Sep-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media