research-article

Open access

A Comparison of End-to-End Decision Forest Inference Pipelines

Authors:

Mahidhar Dwarampudi,

Venkatesh Gunda,

Jia ZouAuthors Info & Claims

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing

Pages 200 - 215

https://doi.org/10.1145/3620678.3624656

Published: 31 October 2023 Publication History

Abstract

Decision forest, including RandomForest, XGBoost, and LightGBM, dominates the machine learning tasks over tabular data. Recently, several frameworks were developed for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. While these frameworks are fully optimized for inference computations, they are all decoupled with databases and general data management frameworks, which leads to cross-system performance overheads. We first provided a DICT model to understand the performance gaps between decoupled and in-database inference. We further identified that for in-database inference, in addition to the popular UDF-centric representation that encapsulates the ML into one User Defined Function (UDF), there also exists a relation-centric representation that breaks down the decision forest inference into several fine-grained SQL operations. The relation-centric representation can achieve significantly better performance for large models. We optimized both implementations and conducted a comprehensive benchmark to compare these two implementations to the aforementioned decoupled inference pipelines and existing in-database inference pipelines such as Spark-SQL and PostgresML. The evaluation results validated the DICT model and demonstrated the superior performance of our in-database inference design compared to the baselines.

References

[1]

Last accessed in Sept 2023. Criteo Display Ad Challenge. (Last accessed in Sept 2023). https://www.kaggle.com/c/criteo-display-ad-challenge/

[2]

Last accessed in Sept 2023. GBM-Perf. (Last accessed in Sept 2023). https://github.com/szilard/GBM-perf

[3]

Last accessed in Sept 2023. LIBSVM Data: Classification, Regression, and Multi-label. (Last accessed in Sept 2023). https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

[4]

Last accessed in Sept 2023. lleaves github repository. (Last accessed in Sept 2023). https://github.com/siboehm/lleaves

[5]

Last accessed in Sept 2023. Microsoft LightGBM Benchmark Suite. (Last accessed in Sept 2023). https://github.com/microsoft/lightgbm-benchmark

[6]

Last accessed in Sept 2023. Nvidia GBM-Bench. (Last accessed in Sept 2023). https://github.com/NVIDIA/gbm-bench

[7]

Last accessed in Sept 2023. PostgresML. (Last accessed in Sept 2023). https://postgresml.org/

[8]

Last accessed in Sept 2023. PostgreSQL Limits. (Last accessed in Sept 2023). https://www.postgresql.org/docs/current/limits.html

[9]

Last accessed in Sept 2023. RAPIDS Forest Inference Library. (Last accessed in Sept 2023). https://github.com/rapidsai/cuml

[10]

Last accessed in Sept 2023. RAPIDS Forest Inference Library: Prediction at 100 million rows per second. (Last accessed in Sept 2023). https://medium.com/rapids-ai/rapids-forest-inference-library-prediction-at-100-million-rows-per-second-19558890bc35"

[11]

Last accessed in Sept 2023. The State of Data Science. (Last accessed in Sept 2023). http://www.kaggle.com/kaggle-survey2020

[12]

Last accessed in Sept 2023. TensorFlow Decision Forests (TF-DF). (Last accessed in Sept 2023). https://github.com/tensorflow/decision-forests

[13]

Last accessed in Sept 2023. TPCx-AI Express Benchmark. (Last accessed in Sept 2023). https://www.tpc.org/tpcx-ai/default5.asp

[14]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).

[15]

Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 6679--6687.

[16]

Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1383--1394.

Digital Library

[17]

Junjie Bai, Fang Lu, Ke Zhang, et al. 2019. ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx.

[18]

Shivam Bansal. 2020. Data Science Trends on Kaggle. (2020). https://www.kaggle.com/code/shivamb/data-science-trends-on-kaggle/notebook

[19]

Elena Baralis, Stefano Paraboschi, Ernest Teniente, et al. 1997. Materialized view selection in a multidimensional database. In VLDB, Vol. 97. 156--165.

[20]

Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5--32.

[21]

Rich Caruana. 1997. Multitask learning. Machine learning 28 (1997), 41--75.

Digital Library

[22]

Sirish Chandrasekaran and Michael J Franklin. 2002. Streaming queries over streaming data. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 203--214.

[23]

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1--27:27. Issue 3. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Digital Library

[24]

Olivier Chapelle and Yi Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the learning to rank challenge. PMLR, 1--24.

[25]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785--794.

Digital Library

[26]

Hyunsu Cho and Mu Li. 2018. Treelite: toolbox for decision tree deployment. In Proc. Conf. Syst. Mach. Learn.(SysML).

[27]

Patrick Damme, Marius Birkenbach, Constantinos Bitsakos, Matthias Boehm, Philippe Bonnet, Florina Ciorba, Mark Dokter, Pawl Dowgiallo, Ahmed Eleliemy, Christian Faerber, et al. 2022. DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines. In Conference on Innovative Data Systems Research.

[28]

Weishan Dong, Jian Li, Renjie Yao, Changsheng Li, Ting Yuan, and Lanjun Wang. 2016. Characterizing driving styles with deep learning. arXiv preprint arXiv:1607.03611 (2016).

[29]

Leila Etaati and Leila Etaati. 2019. Overview of Microsoft Machine Learning Tools. Machine Learning with Microsoft Technologies: Selecting the Right Architecture and Tools for Your Project (2019), 355--358.

[30]

Sérgio Fernandes and Jorge Bernardino. 2015. What is bigquery?. In Proceedings of the 19th International Database Engineering & Applications Symposium. 202--203.

Digital Library

[31]

Mathieu Guillame-Bert, Sebastian Bruch, Richard Stotz, and Jan Pfeifer. 2022. Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library. arXiv preprint arXiv:2212.02934 (2022).

[32]

Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J Gao. 2019. Declarative Recursive Computation on an RDBMS. Proceedings of the VLDB Endowment 12, 7 (2019).

Digital Library

[33]

Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J Gao. 2020. Declarative recursive computation on an RDBMS: or, why you should use a database for distributed machine learning. ACM SIGMOD Record 49, 1 (2020), 43--50.

Digital Library

[34]

Rie Johnson and Tong Zhang. 2013. Learning nonlinear functions using regularized greedy forest. IEEE transactions on pattern analysis and machine intelligence 36, 5 (2013), 942--954.

[35]

Michael Jungmair, André Kohn, and Jana Giceva. 2022. Designing an open framework for query optimization and compilation. Proceedings of the VLDB Endowment 15, 11 (2022), 2389--2401.

Digital Library

[36]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).

Digital Library

[37]

Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh Nallapati, Julio Delgado, Amir Sadoughi, Yury Astashonok, Piali Das, et al. 2020. Elastic machine learning algorithms in amazon sagemaker. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 731--737.

Digital Library

[38]

Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2015. Quickscorer: A fast algorithm to rank documents with additive ensembles of regression trees. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 73--82.

Digital Library

[39]

Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2017. Quickscorer: Efficient traversal of large ensembles of decision trees. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 383--387.

[40]

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235--1241.

Digital Library

[41]

Supun Nakandala, Karla Saur, Gyeong-In Yu, Konstantinos Karanasos, Carlo Curino, Markus Weimer, and Matteo Interlandi. 2020. A tensor compiler for unified machine learning prediction serving. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 899--917.

[42]

Ippokratis Pandis. 2021. The evolution of amazon redshift. Proceedings of the VLDB Endowment 14, 12 (2021), 3162--3174.

Digital Library

[43]

Kwanghyun Park, Karla Saur, Dalitso Banda, Rathijit Sen, Matteo Interlandi, and Konstantinos Karanasos. 2022. End-to-end Optimization of Machine Learning Prediction Queries. In Proceedings of the 2022 International Conference on Management of Data. 587--601.

Digital Library

[44]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825--2830.

[45]

Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Mike Bendersky, and Marc Najork. 2021. Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees? (2021).

[46]

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, et al. 2020. Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 446--459.

Digital Library

[47]

Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, and Chen Wang. 2014. MRTuner: a toolkit to enable holistic optimization for mapreduce jobs. Proceedings of the VLDB Endowment 7, 13 (2014), 1319--1330.

Digital Library

[48]

Philipp Unterbrunner, Georgios Giannikis, Gustavo Alonso, Dietmar Fauser, and Donald Kossmann. 2009. Predictable performance for unpredictable workloads. Proceedings of the VLDB Endowment 2, 1 (2009), 706--717.

Digital Library

[49]

Xiaoying Wang, Weiyuan Wu, Jinze Wu, Yizhou Chen, Nick Zrymiak, Changbo Qu, Lampros Flokas, George Chow, Jiannan Wang, Tianzheng Wang, et al. 2022. ConnectorX: accelerating data loading from databases to dataframes. Proceedings of the VLDB Endowment 15, 11 (2022), 2994--3003.

Digital Library

[50]

Ting Ye, Hucheng Zhou, Will Y Zou, Bin Gao, and Ruofei Zhang. 2018. Rapidscorer: fast tree ensemble evaluation by maximizing compactness in data level parallelization. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 941--950.

Digital Library

[51]

Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois, and Chris Jermaine. 2021. Tensor Relational Algebra for Distributed Machine Learning System Design. Proc. VLDB Endow. 14, 8 (2021), 1338--1350. https://doi.org/10.14778/3457390.3457399

Digital Library

[52]

Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In USENIX HotCloud. 1--10.

[53]

Lixi Zhou, Jiaqing Chen, Amitabh Das, Hong Min, Lei Yu, Ming Zhao, and Jia Zou. 2022. Serving Deep Learning Models with Deduplication from Relational Databases. Proc. VLDB Endow. 15, 10 (2022), 2230--2243. https://www.vldb.org/pvldb/vol15/p2230-zou.pdf

Digital Library

[54]

Jia Zou, R Matthew Barnett, Tania Lorido-Botran, Shangyu Luo, Carlos Monroy, Sourav Sikdar, Kia Teymourian, Binhang Yuan, and Chris Jermaine. 2018. PlinyCompute: A platform for high-performance, distributed, data-intensive tool development. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1189--1204.

Digital Library

[55]

Jia Zou, Amitabh Das, Pratik Barhate, Arun Iyengar, Binhang Yuan, Dimitrije Jankov, and Chris Jermaine. 2021. Lachesis: Automated Partitioning for UDF-Centric Analytics. Proc. VLDB Endow. 14, 8 (2021), 1262--1275. https://doi.org/10.14778/3457390.3457392

Digital Library

[56]

Jia Zou, Arun Iyengar, and Chris Jermaine. 2019. Pangea: monolithic distributed storage for data analytics. Proceedings of the VLDB Endowment 12, 6 (2019), 681--694.

Digital Library

[57]

Jia Zou, Arun Iyengar, and Chris Jermaine. 2020. Architecture of a distributed storage that combines file system, memory and computation in a single layer. The VLDB Journal (2020), 1--25.

[58]

Jia Zou, Ming Zhao, Juwei Shi, and Chen Wang. 2021. Watson: A workflow-based data storage optimizer for analytics. In 36th Intl. Conf. on Massive Storage Systems and Technology.

Cited By

Tang XDu XWang QWu J(2024)Intelligent recognition of high-quality academic papers: based on knowledge-based metasemantic networksScientometrics10.1007/s11192-024-05157-2129:11(6779-6812)Online publication date: 22-Sep-2024
https://doi.org/10.1007/s11192-024-05157-2
Penha JSilva ABarros OMoreira INacif JFerreira R(2023)Avaliação de Estilos de Código para Árvores de Decisão em GPU com MicrobenchmarksAnais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD 2023)10.5753/wscad.2023.235903(277-288)Online publication date: 17-Oct-2023
https://doi.org/10.5753/wscad.2023.235903

Index Terms

A Comparison of End-to-End Decision Forest Inference Pipelines
1. Computer systems organization
2. Computing methodologies
  1. Machine learning

Recommendations

Bayesian Inference Using Gibbs Sampling in Applications and Curricula of Decision Analysis

Applications and curricula of decision analysis currently do not include methods to compute Bayes' rule and obtain posteriors for nonconjugate prior distributions. The current convention is to force the decision maker's belief to take the form of a ...
Bayesian Inference Using Gibbs Sampling in Applications and Curricula of Decision Analysis

Applications and curricula of decision analysis currently do not include methods to compute Bayes' rule and obtain posteriors for nonconjugate prior distributions. The current convention is to force the decision maker's belief to take the form of a ...
Bayesian Inference Using Gibbs Sampling in Applications and Curricula of Decision Analysis

Applications and curricula of decision analysis currently do not include methods to compute Bayes' rule and obtain posteriors for nonconjugate prior distributions. The current convention is to force the decision maker's belief to take the form of a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing

October 2023

624 pages

ISBN:9798400703874

DOI:10.1145/3620678

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 October 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

SoCC '23

Sponsor:

SoCC '23: ACM Symposium on Cloud Computing

October 30 - November 1, 2023

CA, Santa Cruz, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
705
Total Downloads

Downloads (Last 12 months)550
Downloads (Last 6 weeks)67

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tang XDu XWang QWu J(2024)Intelligent recognition of high-quality academic papers: based on knowledge-based metasemantic networksScientometrics10.1007/s11192-024-05157-2129:11(6779-6812)Online publication date: 22-Sep-2024
https://doi.org/10.1007/s11192-024-05157-2
Penha JSilva ABarros OMoreira INacif JFerreira R(2023)Avaliação de Estilos de Código para Árvores de Decisão em GPU com MicrobenchmarksAnais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD 2023)10.5753/wscad.2023.235903(277-288)Online publication date: 17-Oct-2023
https://doi.org/10.5753/wscad.2023.235903

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten