Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3620678.3624656acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

A Comparison of End-to-End Decision Forest Inference Pipelines

Published: 31 October 2023 Publication History

Abstract

Decision forest, including RandomForest, XGBoost, and LightGBM, dominates the machine learning tasks over tabular data. Recently, several frameworks were developed for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. While these frameworks are fully optimized for inference computations, they are all decoupled with databases and general data management frameworks, which leads to cross-system performance overheads. We first provided a DICT model to understand the performance gaps between decoupled and in-database inference. We further identified that for in-database inference, in addition to the popular UDF-centric representation that encapsulates the ML into one User Defined Function (UDF), there also exists a relation-centric representation that breaks down the decision forest inference into several fine-grained SQL operations. The relation-centric representation can achieve significantly better performance for large models. We optimized both implementations and conducted a comprehensive benchmark to compare these two implementations to the aforementioned decoupled inference pipelines and existing in-database inference pipelines such as Spark-SQL and PostgresML. The evaluation results validated the DICT model and demonstrated the superior performance of our in-database inference design compared to the baselines.

References

[1]
Last accessed in Sept 2023. Criteo Display Ad Challenge. (Last accessed in Sept 2023). https://www.kaggle.com/c/criteo-display-ad-challenge/
[2]
Last accessed in Sept 2023. GBM-Perf. (Last accessed in Sept 2023). https://github.com/szilard/GBM-perf
[3]
Last accessed in Sept 2023. LIBSVM Data: Classification, Regression, and Multi-label. (Last accessed in Sept 2023). https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
[4]
Last accessed in Sept 2023. lleaves github repository. (Last accessed in Sept 2023). https://github.com/siboehm/lleaves
[5]
Last accessed in Sept 2023. Microsoft LightGBM Benchmark Suite. (Last accessed in Sept 2023). https://github.com/microsoft/lightgbm-benchmark
[6]
Last accessed in Sept 2023. Nvidia GBM-Bench. (Last accessed in Sept 2023). https://github.com/NVIDIA/gbm-bench
[7]
Last accessed in Sept 2023. PostgresML. (Last accessed in Sept 2023). https://postgresml.org/
[8]
Last accessed in Sept 2023. PostgreSQL Limits. (Last accessed in Sept 2023). https://www.postgresql.org/docs/current/limits.html
[9]
Last accessed in Sept 2023. RAPIDS Forest Inference Library. (Last accessed in Sept 2023). https://github.com/rapidsai/cuml
[10]
Last accessed in Sept 2023. RAPIDS Forest Inference Library: Prediction at 100 million rows per second. (Last accessed in Sept 2023). https://medium.com/rapids-ai/rapids-forest-inference-library-prediction-at-100-million-rows-per-second-19558890bc35"
[11]
Last accessed in Sept 2023. The State of Data Science. (Last accessed in Sept 2023). http://www.kaggle.com/kaggle-survey2020
[12]
Last accessed in Sept 2023. TensorFlow Decision Forests (TF-DF). (Last accessed in Sept 2023). https://github.com/tensorflow/decision-forests
[13]
Last accessed in Sept 2023. TPCx-AI Express Benchmark. (Last accessed in Sept 2023). https://www.tpc.org/tpcx-ai/default5.asp
[14]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
[15]
Sercan Ö Arik and Tomas Pfister. 2021. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 6679--6687.
[16]
Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1383--1394.
[17]
Junjie Bai, Fang Lu, Ke Zhang, et al. 2019. ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx.
[18]
Shivam Bansal. 2020. Data Science Trends on Kaggle. (2020). https://www.kaggle.com/code/shivamb/data-science-trends-on-kaggle/notebook
[19]
Elena Baralis, Stefano Paraboschi, Ernest Teniente, et al. 1997. Materialized view selection in a multidimensional database. In VLDB, Vol. 97. 156--165.
[20]
Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5--32.
[21]
Rich Caruana. 1997. Multitask learning. Machine learning 28 (1997), 41--75.
[22]
Sirish Chandrasekaran and Michael J Franklin. 2002. Streaming queries over streaming data. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 203--214.
[23]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1--27:27. Issue 3. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[24]
Olivier Chapelle and Yi Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the learning to rank challenge. PMLR, 1--24.
[25]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785--794.
[26]
Hyunsu Cho and Mu Li. 2018. Treelite: toolbox for decision tree deployment. In Proc. Conf. Syst. Mach. Learn.(SysML).
[27]
Patrick Damme, Marius Birkenbach, Constantinos Bitsakos, Matthias Boehm, Philippe Bonnet, Florina Ciorba, Mark Dokter, Pawl Dowgiallo, Ahmed Eleliemy, Christian Faerber, et al. 2022. DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines. In Conference on Innovative Data Systems Research.
[28]
Weishan Dong, Jian Li, Renjie Yao, Changsheng Li, Ting Yuan, and Lanjun Wang. 2016. Characterizing driving styles with deep learning. arXiv preprint arXiv:1607.03611 (2016).
[29]
Leila Etaati and Leila Etaati. 2019. Overview of Microsoft Machine Learning Tools. Machine Learning with Microsoft Technologies: Selecting the Right Architecture and Tools for Your Project (2019), 355--358.
[30]
Sérgio Fernandes and Jorge Bernardino. 2015. What is bigquery?. In Proceedings of the 19th International Database Engineering & Applications Symposium. 202--203.
[31]
Mathieu Guillame-Bert, Sebastian Bruch, Richard Stotz, and Jan Pfeifer. 2022. Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library. arXiv preprint arXiv:2212.02934 (2022).
[32]
Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J Gao. 2019. Declarative Recursive Computation on an RDBMS. Proceedings of the VLDB Endowment 12, 7 (2019).
[33]
Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J Gao. 2020. Declarative recursive computation on an RDBMS: or, why you should use a database for distributed machine learning. ACM SIGMOD Record 49, 1 (2020), 43--50.
[34]
Rie Johnson and Tong Zhang. 2013. Learning nonlinear functions using regularized greedy forest. IEEE transactions on pattern analysis and machine intelligence 36, 5 (2013), 942--954.
[35]
Michael Jungmair, André Kohn, and Jana Giceva. 2022. Designing an open framework for query optimization and compilation. Proceedings of the VLDB Endowment 15, 11 (2022), 2389--2401.
[36]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
[37]
Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh Nallapati, Julio Delgado, Amir Sadoughi, Yury Astashonok, Piali Das, et al. 2020. Elastic machine learning algorithms in amazon sagemaker. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 731--737.
[38]
Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2015. Quickscorer: A fast algorithm to rank documents with additive ensembles of regression trees. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 73--82.
[39]
Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. 2017. Quickscorer: Efficient traversal of large ensembles of decision trees. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 383--387.
[40]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research 17, 1 (2016), 1235--1241.
[41]
Supun Nakandala, Karla Saur, Gyeong-In Yu, Konstantinos Karanasos, Carlo Curino, Markus Weimer, and Matteo Interlandi. 2020. A tensor compiler for unified machine learning prediction serving. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 899--917.
[42]
Ippokratis Pandis. 2021. The evolution of amazon redshift. Proceedings of the VLDB Endowment 14, 12 (2021), 3162--3174.
[43]
Kwanghyun Park, Karla Saur, Dalitso Banda, Rathijit Sen, Matteo Interlandi, and Konstantinos Karanasos. 2022. End-to-end Optimization of Machine Learning Prediction Queries. In Proceedings of the 2022 International Conference on Management of Data. 587--601.
[44]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825--2830.
[45]
Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Mike Bendersky, and Marc Najork. 2021. Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees? (2021).
[46]
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, et al. 2020. Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 446--459.
[47]
Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, and Chen Wang. 2014. MRTuner: a toolkit to enable holistic optimization for mapreduce jobs. Proceedings of the VLDB Endowment 7, 13 (2014), 1319--1330.
[48]
Philipp Unterbrunner, Georgios Giannikis, Gustavo Alonso, Dietmar Fauser, and Donald Kossmann. 2009. Predictable performance for unpredictable workloads. Proceedings of the VLDB Endowment 2, 1 (2009), 706--717.
[49]
Xiaoying Wang, Weiyuan Wu, Jinze Wu, Yizhou Chen, Nick Zrymiak, Changbo Qu, Lampros Flokas, George Chow, Jiannan Wang, Tianzheng Wang, et al. 2022. ConnectorX: accelerating data loading from databases to dataframes. Proceedings of the VLDB Endowment 15, 11 (2022), 2994--3003.
[50]
Ting Ye, Hucheng Zhou, Will Y Zou, Bin Gao, and Ruofei Zhang. 2018. Rapidscorer: fast tree ensemble evaluation by maximizing compactness in data level parallelization. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 941--950.
[51]
Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois, and Chris Jermaine. 2021. Tensor Relational Algebra for Distributed Machine Learning System Design. Proc. VLDB Endow. 14, 8 (2021), 1338--1350. https://doi.org/10.14778/3457390.3457399
[52]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In USENIX HotCloud. 1--10.
[53]
Lixi Zhou, Jiaqing Chen, Amitabh Das, Hong Min, Lei Yu, Ming Zhao, and Jia Zou. 2022. Serving Deep Learning Models with Deduplication from Relational Databases. Proc. VLDB Endow. 15, 10 (2022), 2230--2243. https://www.vldb.org/pvldb/vol15/p2230-zou.pdf
[54]
Jia Zou, R Matthew Barnett, Tania Lorido-Botran, Shangyu Luo, Carlos Monroy, Sourav Sikdar, Kia Teymourian, Binhang Yuan, and Chris Jermaine. 2018. PlinyCompute: A platform for high-performance, distributed, data-intensive tool development. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1189--1204.
[55]
Jia Zou, Amitabh Das, Pratik Barhate, Arun Iyengar, Binhang Yuan, Dimitrije Jankov, and Chris Jermaine. 2021. Lachesis: Automated Partitioning for UDF-Centric Analytics. Proc. VLDB Endow. 14, 8 (2021), 1262--1275. https://doi.org/10.14778/3457390.3457392
[56]
Jia Zou, Arun Iyengar, and Chris Jermaine. 2019. Pangea: monolithic distributed storage for data analytics. Proceedings of the VLDB Endowment 12, 6 (2019), 681--694.
[57]
Jia Zou, Arun Iyengar, and Chris Jermaine. 2020. Architecture of a distributed storage that combines file system, memory and computation in a single layer. The VLDB Journal (2020), 1--25.
[58]
Jia Zou, Ming Zhao, Juwei Shi, and Chen Wang. 2021. Watson: A workflow-based data storage optimizer for analytics. In 36th Intl. Conf. on Massive Storage Systems and Technology.

Cited By

View all
  • (2024)Intelligent recognition of high-quality academic papers: based on knowledge-based metasemantic networksScientometrics10.1007/s11192-024-05157-2129:11(6779-6812)Online publication date: 22-Sep-2024
  • (2023)Avaliação de Estilos de Código para Árvores de Decisão em GPU com MicrobenchmarksAnais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD 2023)10.5753/wscad.2023.235903(277-288)Online publication date: 17-Oct-2023

Index Terms

  1. A Comparison of End-to-End Decision Forest Inference Pipelines

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing
      October 2023
      624 pages
      ISBN:9798400703874
      DOI:10.1145/3620678
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 31 October 2023

      Check for updates

      Author Tags

      1. Decision Forest
      2. Machine Learning System

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      SoCC '23
      Sponsor:
      SoCC '23: ACM Symposium on Cloud Computing
      October 30 - November 1, 2023
      CA, Santa Cruz, USA

      Acceptance Rates

      Overall Acceptance Rate 169 of 722 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)550
      • Downloads (Last 6 weeks)67
      Reflects downloads up to 18 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Intelligent recognition of high-quality academic papers: based on knowledge-based metasemantic networksScientometrics10.1007/s11192-024-05157-2129:11(6779-6812)Online publication date: 22-Sep-2024
      • (2023)Avaliação de Estilos de Código para Árvores de Decisão em GPU com MicrobenchmarksAnais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD 2023)10.5753/wscad.2023.235903(277-288)Online publication date: 17-Oct-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media