Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3183713.3196892acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions

Published: 27 May 2018 Publication History

Abstract

Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask "Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?"
The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et almbox. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
[2]
Sanjay Agrawal, Vivek Narasayya, and Beverly Yang. 2004. Integrating vertical and horizontal partitioning into automated physical database design Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 359--370.
[3]
Dan Alistarh, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2016. QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent. arXiv preprint arXiv:1610.02132 (2016).
[4]
Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT'2010. 177--186.
[5]
Léon Bottou. 2012. Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade. 421--436.
[6]
Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32.
[7]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785--794.
[8]
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 571--582.
[9]
Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. 2010. Schism: a workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment Vol. 3, 1--2 (2010), 48--57.
[10]
Wei Dai, Jinliang Wei, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho, and Eric P Xing. 2013. Petuum: A Framework for Iterative-Convergent Distributed ML. arXiv preprint arXiv:1312.7651 (2013).
[11]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et almbox. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.
[12]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM Vol. 51, 1 (2008), 107--113.
[13]
Imola K Fodor. 2002. A survey of dimension reduction techniques. Technical Report. Lawrence Livermore National Lab., CA (US).
[14]
Jerome Friedman et almbox. 2000. Additive logistic regression: a statistical view of boosting. Annals of statistics, Vol. 28, 2 (2000), 337--407.
[15]
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.
[16]
Shahram Ghandeharizadeh and David J DeWitt. 1990. Hybrid-range partitioning strategy: A new declustering strategy for multiprocessor database machines. In Proc. 16th international Conference on VLDB. 481--492.
[17]
Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce Data Engineering (ICDE), 2011 IEEE 27th International Conference on. 231--242.
[18]
Michael Greenwald and Sanjeev Khanna. 2001. Space-efficient online computation of quantile summaries ACM SIGMOD Record, Vol. Vol. 30. 58--66.
[19]
JBoss. 2004. Netty. (2004). http://netty.io/
[20]
Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017 a. Heterogeneity-aware distributed parameter servers. Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 463--478.
[21]
Jiawei Jiang, Ming Huang, Jie Jiang, and Bin Cui. 2017 b. TeslaML: Steering Machine Learning Automatically in Tencent Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data. Springer, 313--318.
[22]
Jie Jiang, Jiawei Jiang, Bin Cui, and Ce Zhang. 2017 c. TencentBoost: A Gradient Boosting Tree System with Parameter Server Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. 281--284.
[23]
Jie Jiang, Lele Yu, Jiawei Jiang, Yuhong Liu, and Bin Cui . 2017 d. Angel: a new large-scale machine learning system. National Science Review (2017), nwx018.
[24]
David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research Vol. 5, Apr (2004), 361--397.
[25]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 583--598.
[26]
Mu Li, Ziqi Liu, Alexander J Smola, and Yu-Xiang Wang. 2016. Difacto: Distributed factorization machines. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. 377--386.
[27]
Xupeng Li, Bin Cui, Yiru Chen, Wentao Wu, and Ce Zhang. 2017. Mlog: Towards declarative in-database machine learning. Proceedings of the VLDB Endowment Vol. 10, 12 (2017), 1933--1936.
[28]
Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, and Tieyan Liu. 2016 b. A communication-efficient parallel algorithm for decision tree Advances in Neural Information Processing Systems. 1279--1287.
[29]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et almbox. 2016 a. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research Vol. 17, 1 (2016), 1235--1241.
[30]
Jun Rao, Chun Zhang, Nimrod Megiddo, and Guy Lohman. 2002. Automating physical database design in a parallel database Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 558--569.
[31]
Steffen Rendle. 2013. Scaling factorization machines to relational data. Proceedings of the VLDB Endowment, Vol. Vol. 6. VLDB Endowment, 337--348.
[32]
Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. 2016. Learning linear regression models over factorized joins Proceedings of the 2016 International Conference on Management of Data. ACM, 3--18.
[33]
Si Si, Huan Zhang, S Sathiya Keerthi, Dhruv Mahajan, Inderjit S Dhillon, and Cho-Jui Hsieh. 2017. Gradient Boosted Decision Trees for High Dimensional Sparse Output International Conference on Machine Learning. 3182--3190.
[34]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, Vol. 19, 1 (2005), 49--66.
[35]
Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin. 2011. Parallel boosted regression trees for web search ranking Proceedings of the 20th international conference on World wide web. 387--396.
[36]
Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. 2015. Managed communication and consistency for fast data-parallel iterative analytics Proceedings of the Sixth ACM Symposium on Cloud Computing. 381--394.
[37]
Yahoo. 2004. Data Sketches. (2004). https://datasketches.github.io/
[38]
Lele Yut, Ce Zhang, Yingxia Shao, and Bin Cui. 2017. LDA*: a robust and large-scale topic modeling system. Proceedings of the VLDB Endowment Vol. 10, 11 (2017), 1406--1417.
[39]
Hantian Zhang, Kaan Kara, Jerry Li, Dan Alistarh, Ji Liu, and Ce Zhang. 2016. ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models. arXiv:1611.05402 (2016).
[40]
Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. 2010. Parallelized stochastic gradient descent. In Advances in neural information processing systems. 2595--2603.

Cited By

View all
  • (2024)An Architecture as an Alternative to Gradient Boosted Decision Trees for Multiple Machine Learning TasksElectronics10.3390/electronics1312229113:12(2291)Online publication date: 12-Jun-2024
  • (2024)${\sf FederBoost}$: Private Federated Learning for GBDTIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.327636521:3(1274-1285)Online publication date: May-2024
  • (2024)Multi-objective optimization of ternary geopolymers with multiple solid wastesMaterials Today Communications10.1016/j.mtcomm.2024.10959940(109599)Online publication date: Aug-2024
  • Show More Cited By

Index Terms

  1. DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
    May 2018
    1874 pages
    ISBN:9781450347037
    DOI:10.1145/3183713
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. collective communication
    2. gradient boosting decision tree
    3. gradient histogram
    4. high-dimensional feature
    5. parameter server

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS '18
    Sponsor:

    Acceptance Rates

    SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)75
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An Architecture as an Alternative to Gradient Boosted Decision Trees for Multiple Machine Learning TasksElectronics10.3390/electronics1312229113:12(2291)Online publication date: 12-Jun-2024
    • (2024)${\sf FederBoost}$: Private Federated Learning for GBDTIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.327636521:3(1274-1285)Online publication date: May-2024
    • (2024)Multi-objective optimization of ternary geopolymers with multiple solid wastesMaterials Today Communications10.1016/j.mtcomm.2024.10959940(109599)Online publication date: Aug-2024
    • (2024)Understanding the synergistic geopolymerization mechanism of multiple solid wastes in ternary geopolymersJournal of Building Engineering10.1016/j.jobe.2024.110295(110295)Online publication date: Jul-2024
    • (2023)Asynch-SGBDT: Train Stochastic Gradient Boosting Decision Trees in an Asynchronous Parallel Manner2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00034(256-267)Online publication date: May-2023
    • (2023)Deep Boosting Multi-Modal Ensemble Face Recognition with Sample-Level Weighting2023 IEEE International Joint Conference on Biometrics (IJCB)10.1109/IJCB57857.2023.10448893(1-10)Online publication date: 25-Sep-2023
    • (2023)Rapid identification and semi-quantification of adulteration in walnut oil by using excitation–emission matrix fluorescence spectroscopy coupled with chemometrics and ensemble learningJournal of Food Composition and Analysis10.1016/j.jfca.2022.105094117(105094)Online publication date: Apr-2023
    • (2023)A systematic evaluation of machine learning on serverless infrastructureThe VLDB Journal10.1007/s00778-023-00813-033:2(425-449)Online publication date: 20-Sep-2023
    • (2023)SNN-AAD: Active Anomaly Detection Method for Multivariate Time Series with Sparse Neural NetworkDatabase Systems for Advanced Applications10.1007/978-3-031-30637-2_17(253-269)Online publication date: 14-Apr-2023
    • (2022)BaguaProceedings of the VLDB Endowment10.14778/3503585.350359015:4(804-813)Online publication date: 14-Apr-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media