research-article

DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions

Authors:

Fangcheng FuAuthors Info & Claims

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 1363 - 1376

https://doi.org/10.1145/3183713.3196892

Published: 27 May 2018 Publication History

Abstract

Gradient boosting decision tree (GBDT) is one of the most popular machine learning models widely used in both academia and industry. Although GBDT has been widely supported by existing systems such as XGBoost, LightGBM, and MLlib, one system bottleneck appears when the dimensionality of the data becomes high. As a result, when we tried to support our industrial partner on datasets of the dimension up to 330K, we observed suboptimal performance for all these aforementioned systems. In this paper, we ask "Can we build a scalable GBDT training system whose performance scales better with respect to dimensionality of the data?"

The first contribution of this paper is a careful investigation of existing systems by developing a performance model with respect to the dimensionality of the data. We find that the collective communication operations in many existing systems only implement the algorithm designed for small messages. By just fixing this problem, we are able to speed up these systems by up to 2X. Our second contribution is a series of optimizations to further optimize the performance of collective communications. These optimizations include a task scheduler, a two-phase split finding method, and low-precision gradient histograms. Our third contribution is a sparsity-aware algorithm to build gradient histograms and a novel index structure to build histograms in parallel. We implement these optimizations in DimBoost and show that it can be 2-9X faster than existing systems.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et almbox. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).

Digital Library

[2]

Sanjay Agrawal, Vivek Narasayya, and Beverly Yang. 2004. Integrating vertical and horizontal partitioning into automated physical database design Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 359--370.

Digital Library

[3]

Dan Alistarh, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2016. QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent. arXiv preprint arXiv:1610.02132 (2016).

[4]

Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT'2010. 177--186.

[5]

Léon Bottou. 2012. Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade. 421--436.

[6]

Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32.

Digital Library

[7]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785--794.

Digital Library

[8]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 571--582.

Digital Library

[9]

Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. 2010. Schism: a workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment Vol. 3, 1--2 (2010), 48--57.

Digital Library

[10]

Wei Dai, Jinliang Wei, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho, and Eric P Xing. 2013. Petuum: A Framework for Iterative-Convergent Distributed ML. arXiv preprint arXiv:1312.7651 (2013).

[11]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et almbox. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.

Digital Library

[12]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM Vol. 51, 1 (2008), 107--113.

Digital Library

[13]

Imola K Fodor. 2002. A survey of dimension reduction techniques. Technical Report. Lawrence Livermore National Lab., CA (US).

[14]

Jerome Friedman et almbox. 2000. Additive logistic regression: a statistical view of boosting. Annals of statistics, Vol. 28, 2 (2000), 337--407.

[15]

Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.

[16]

Shahram Ghandeharizadeh and David J DeWitt. 1990. Hybrid-range partitioning strategy: A new declustering strategy for multiprocessor database machines. In Proc. 16th international Conference on VLDB. 481--492.

Digital Library

[17]

Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce Data Engineering (ICDE), 2011 IEEE 27th International Conference on. 231--242.

Digital Library

[18]

Michael Greenwald and Sanjeev Khanna. 2001. Space-efficient online computation of quantile summaries ACM SIGMOD Record, Vol. Vol. 30. 58--66.

Digital Library

[19]

JBoss. 2004. Netty. (2004). http://netty.io/

[20]

Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017 a. Heterogeneity-aware distributed parameter servers. Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 463--478.

Digital Library

[21]

Jiawei Jiang, Ming Huang, Jie Jiang, and Bin Cui. 2017 b. TeslaML: Steering Machine Learning Automatically in Tencent Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data. Springer, 313--318.

[22]

Jie Jiang, Jiawei Jiang, Bin Cui, and Ce Zhang. 2017 c. TencentBoost: A Gradient Boosting Tree System with Parameter Server Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. 281--284.

[23]

Jie Jiang, Lele Yu, Jiawei Jiang, Yuhong Liu, and Bin Cui . 2017 d. Angel: a new large-scale machine learning system. National Science Review (2017), nwx018.

[24]

David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research Vol. 5, Apr (2004), 361--397.

Digital Library

[25]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 583--598.

Digital Library

[26]

Mu Li, Ziqi Liu, Alexander J Smola, and Yu-Xiang Wang. 2016. Difacto: Distributed factorization machines. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. 377--386.

Digital Library

[27]

Xupeng Li, Bin Cui, Yiru Chen, Wentao Wu, and Ce Zhang. 2017. Mlog: Towards declarative in-database machine learning. Proceedings of the VLDB Endowment Vol. 10, 12 (2017), 1933--1936.

Digital Library

[28]

Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, and Tieyan Liu. 2016 b. A communication-efficient parallel algorithm for decision tree Advances in Neural Information Processing Systems. 1279--1287.

Digital Library

[29]

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et almbox. 2016 a. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research Vol. 17, 1 (2016), 1235--1241.

Digital Library

[30]

Jun Rao, Chun Zhang, Nimrod Megiddo, and Guy Lohman. 2002. Automating physical database design in a parallel database Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 558--569.

Digital Library

[31]

Steffen Rendle. 2013. Scaling factorization machines to relational data. Proceedings of the VLDB Endowment, Vol. Vol. 6. VLDB Endowment, 337--348.

Digital Library

[32]

Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. 2016. Learning linear regression models over factorized joins Proceedings of the 2016 International Conference on Management of Data. ACM, 3--18.

Digital Library

[33]

Si Si, Huan Zhang, S Sathiya Keerthi, Dhruv Mahajan, Inderjit S Dhillon, and Cho-Jui Hsieh. 2017. Gradient Boosted Decision Trees for High Dimensional Sparse Output International Conference on Machine Learning. 3182--3190.

[34]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, Vol. 19, 1 (2005), 49--66.

Digital Library

[35]

Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin. 2011. Parallel boosted regression trees for web search ranking Proceedings of the 20th international conference on World wide web. 387--396.

Digital Library

[36]

Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. 2015. Managed communication and consistency for fast data-parallel iterative analytics Proceedings of the Sixth ACM Symposium on Cloud Computing. 381--394.

Digital Library

[37]

Yahoo. 2004. Data Sketches. (2004). https://datasketches.github.io/

[38]

Lele Yut, Ce Zhang, Yingxia Shao, and Bin Cui. 2017. LDA*: a robust and large-scale topic modeling system. Proceedings of the VLDB Endowment Vol. 10, 11 (2017), 1406--1417.

Digital Library

[39]

Hantian Zhang, Kaan Kara, Jerry Li, Dan Alistarh, Ji Liu, and Ce Zhang. 2016. ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models. arXiv:1611.05402 (2016).

[40]

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. 2010. Parallelized stochastic gradient descent. In Advances in neural information processing systems. 2595--2603.

Digital Library

Cited By

Du LSong HXu YDai S(2024)An Architecture as an Alternative to Gradient Boosted Decision Trees for Multiple Machine Learning TasksElectronics10.3390/electronics1312229113:12(2291)Online publication date: 12-Jun-2024
https://doi.org/10.3390/electronics13122291
Tian ZZhang RHou XLyu LZhang TLiu JRen K(2024)${\sf FederBoost}$: Private Federated Learning for GBDTIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.327636521:3(1274-1285)Online publication date: May-2024
https://doi.org/10.1109/TDSC.2023.3276365
Zhang JShang FHuo ZChen JXue G(2024)Multi-objective optimization of ternary geopolymers with multiple solid wastesMaterials Today Communications10.1016/j.mtcomm.2024.10959940(109599)Online publication date: Aug-2024
https://doi.org/10.1016/j.mtcomm.2024.109599
Show More Cited By

Index Terms

DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms

Recommendations

An advanced gradient histogram and its application for contrast and gradient enhancement

We propose a histogram which contains intensity and gradient information of image.The histogram can avoid the high peak essentially.A novel image enhancement method for human vision is presented.We prove in theory that the method can increase the ...
A Novel Compression Algorithm for Hardware-Oriented Gradient Boosting Decision Tree Classification Model
Intelligent Computing Methodologies
Abstract
Gradient boosting decision tree is a widely used machine learning algorithm. As the big data era comes, this algorithm has been applied to the multimedia fields. However, this algorithm suffers a lot when it comes to the mobility of multimedia ...
On Incremental Learning for Gradient Boosting Decision Trees
Abstract
Boosting algorithms, as a class of ensemble learning methods, have become very popular in data classification, owing to their strong theoretical guarantees and outstanding prediction performance. However, most of these boosting algorithms were ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

May 2018

1874 pages

ISBN:9781450347037

DOI:10.1145/3183713

General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

China Postdoctoral Science Foundation
National Natural Science Foundation of China
National Basic Research Program of China (973 Program)

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10 - 15, 2018

TX, Houston, USA

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
1,205
Total Downloads

Downloads (Last 12 months)75
Downloads (Last 6 weeks)15

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Du LSong HXu YDai S(2024)An Architecture as an Alternative to Gradient Boosted Decision Trees for Multiple Machine Learning TasksElectronics10.3390/electronics1312229113:12(2291)Online publication date: 12-Jun-2024
https://doi.org/10.3390/electronics13122291
Tian ZZhang RHou XLyu LZhang TLiu JRen K(2024)${\sf FederBoost}$: Private Federated Learning for GBDTIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.327636521:3(1274-1285)Online publication date: May-2024
https://doi.org/10.1109/TDSC.2023.3276365
Zhang JShang FHuo ZChen JXue G(2024)Multi-objective optimization of ternary geopolymers with multiple solid wastesMaterials Today Communications10.1016/j.mtcomm.2024.10959940(109599)Online publication date: Aug-2024
https://doi.org/10.1016/j.mtcomm.2024.109599
Zhang JSun NHuo ZChen J(2024)Understanding the synergistic geopolymerization mechanism of multiple solid wastes in ternary geopolymersJournal of Building Engineering10.1016/j.jobe.2024.110295(110295)Online publication date: Jul-2024
https://doi.org/10.1016/j.jobe.2024.110295
Daning CShigang LYunquan Z(2023)Asynch-SGBDT: Train Stochastic Gradient Boosting Decision Trees in an Asynchronous Parallel Manner2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00034(256-267)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00034
Malakshan SSaadabadi MNajafzadeh NNasrabadi N(2023)Deep Boosting Multi-Modal Ensemble Face Recognition with Sample-Level Weighting2023 IEEE International Joint Conference on Biometrics (IJCB)10.1109/IJCB57857.2023.10448893(1-10)Online publication date: 25-Sep-2023
https://doi.org/10.1109/IJCB57857.2023.10448893
Wang XWu HWang TChen ASun HDing ZChang HYu R(2023)Rapid identification and semi-quantification of adulteration in walnut oil by using excitation–emission matrix fluorescence spectroscopy coupled with chemometrics and ensemble learningJournal of Food Composition and Analysis10.1016/j.jfca.2022.105094117(105094)Online publication date: Apr-2023
https://doi.org/10.1016/j.jfca.2022.105094
Jiang JGan SDu BAlonso GKlimovic ASingla AWu WWang SZhang C(2023)A systematic evaluation of machine learning on serverless infrastructureThe VLDB Journal10.1007/s00778-023-00813-033:2(425-449)Online publication date: 20-Sep-2023
https://doi.org/10.1007/s00778-023-00813-0
Ding XLiu YWang HYang DSong Y(2023)SNN-AAD: Active Anomaly Detection Method for Multivariate Time Series with Sparse Neural NetworkDatabase Systems for Advanced Applications10.1007/978-3-031-30637-2_17(253-269)Online publication date: 14-Apr-2023
https://doi.org/10.1007/978-3-031-30637-2_17
Gan SJiang JYuan BZhang CLian XWang RChang JLiu CShi HZhang SLi XSun TYang SLiu J(2022)BaguaProceedings of the VLDB Endowment10.14778/3503585.350359015:4(804-813)Online publication date: 14-Apr-2022
https://dl.acm.org/doi/10.14778/3503585.3503590
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents