research-article

Towards a polyglot framework for factorized ML

Authors:

Nadia Polikarpova,

Arun KumarAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 12

Pages 2918 - 2931

https://doi.org/10.14778/3476311.3476372

Published: 01 July 2021 Publication History

Abstract

Optimizing machine learning (ML) workloads on structured data is a key concern for data platforms. One class of optimizations called "factorized ML" helps reduce ML runtimes over multi-table datasets by pushing ML computations down through joins, avoiding the need to materialize such joins. The recent Morpheus system automated factorized ML to any ML algorithm expressible in linear algebra (LA). But all such prior factorized ML/LA stacks are restricted by their chosen programming language (PL) and runtime environment, limiting their reach in emerging industrial data science environments with many PLs (R, Python, etc.) and even cross-PL analytics workflows. Re-implementing Morpheus from scratch in each PL/environment is a massive developability overhead for implementation, testing, and maintenance. We tackle this challenge by proposing a new system architecture, Trinity, to enable factorized LA logic to be written only once and easily reused across many PLs/LA tools in one go. To do this in an extensible and efficient manner without costly data copies, Trinity leverages and extends an emerging industrial polyglot compiler and runtime, Oracle's GraalVM. Trinity enables factorized LA in multiple PLs and even cross-PL workflows. Experiments with real datasets show that Trinity is significantly faster than materialized execution (> 8x speedups in some cases), while being largely competitive to a prior single PL-focused Morpheus stack.

References

[1]

[n.d.]. Common Language Runtime (CLR) overview - .NET Framework. https://docs.microsoft.com/en-us/dotnet/standard/clr. Accessed: 2020-03-01.

[2]

[n.d.]. Embed Languages with the GraalVM Polyglot API. https://www.graalvm.org/docs/reference-manual/embed/. Accessed: 2020-03-01.

[3]

[n.d.]. Fallback (GraalVM Truffle Java API Reference). https://www.graalvm.org/truffle/javadoc/com/oracle/truffle/api/dsl/Fallback.html. Accessed: 2020-03-01.

[4]

[n.d.]. FastR GitHub Repository. https://github.com/oracle/fastr/. Accessed: 2020-03-01.

[5]

[n.d.]. GraalJS GitHub Repository. https://github.com/graalvm/graaljs. Accessed: 2020-03-01.

[6]

[n.d.]. GraalVM Python Implementation GitHub Repository. https://github.com/graalvm/graalpython. Accessed: 2020-03-01.

[7]

[n.d.]. grCuda Documentation. https://github.com/NVIDIA/grcuda/blob/master/docs/language.md. Accessed: 2020-03-01.

[8]

[n.d.]. grCuda GitHub Repository. https://github.com/NVIDIA/grcuda. Accessed: 2020-03-01.

[9]

[n.d.]. Interactive Matrix Programming With SAS IML Software. https://www.sas.com/en_us/software/iml.html. Accessed: 2020-03-01.

[10]

[n.d.]. InteropLibrary (GraalVM Truffle Reference). https://www.graalvm.org/truffle/javadoc/com/oracle/truffle/api/interop/InteropLibrary.html. Accessed: 2020-03-01.

[11]

[n.d.]. Jython Project Homepage. https://www.jython.org/. Accessed: 2020-03-01.

[12]

[n.d.]. Math.js Project Homepage. https://mathjs.org/. Accessed: 2020-03-01.

[13]

[n.d.]. MATLAB Homepage. https://www.mathworks.com/products/matlab.html. Accessed: 2020-03-01.

[14]

[n.d.]. ParrotVM Documentation - HLLs and Interoperation. http://docs.parrot.org/parrot/latest/html/docs/book/draft/chXX_hlls.pod.html. Accessed: 2020-03-01.

[15]

[n.d.]. The R Project for Statistical Computing. https://www.R-project.org/. Accessed: 2020-03-01.

[16]

[n.d.]. SimpleLanguage GitHub Repository. https://github.com/graalvm/simplelanguage/blob/master/language/src/main/java/com/oracle/truffle/sl/nodes/expression/SLAddNode.java. Accessed: 2020-03-01.

[17]

[n.d.]. Specialization (GraalVM Truffle Java API Reference). https://www.graalvm.org/truffle/javadoc/com/oracle/truffle/api/dsl/Specialization.html. Accessed: 2020-03-01.

[18]

[n.d.]. TruffleLibraries Documentation. https://github.com/oracle/graal/blob/master/truffle/docs/TruffleLibraries.md. Accessed: 2020-03-01.

[19]

[n.d.]. Walnut Project Homepage on Oracle Labs. https://labs.oracle.com/pls/apex/f?p=LABS:project_details:0:15. Accessed: 2020-03-01.

[20]

Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2018. In-Database Learning with Sparse Tensors. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (Houston, TX, USA) (SIGMOD/PODS '18). Association for Computing Machinery, New York, NY, USA, 325--340.

Digital Library

[21]

Todd A. Anderson, Hai Liu, Lindsey Kuper, Ehsan Totoni, Jan Vitek, and Tatiana Shpeisman. 2017. Parallelizing Julia with a Non-Invasive DSL. In 31st European Conference on Object-Oriented Programming, ECOOP 2017, June 19-23, 2017, Barcelona, Spain (LIPIcs), Peter Müller (Ed.), Vol. 74. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 4:1--4:29.

[22]

Hassan Chafi, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Anand R. Atreya, and Kunle Olukotun. 2011. A domain-specific approach to heterogeneous parallelism. In Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2011, San Antonio, TX, USA, February 12--16, 2011, Calin Cascaval and Pen-Chung Yew (Eds.). ACM, 35--46.

Digital Library

[23]

Lingjiao Chen, Arun Kumar, Jeffrey F. Naughton, and Jignesh M. Patel. 2017. Towards Linear Algebra over Normalized Data. PVLDB 10, 11 (2017), 1214--1225.

Digital Library

[24]

Lin Clark. [n.d.]. WebAssembly Interface Types: Interoperate with All the Things! - Mozilla Hacks - the Web developer blog. https://hacks.mozilla.org/2019/08/webassembly-interface-types/. Accessed: 2020-03-01.

[25]

Michael L. Van de Vanter, Chris Seaton, Michael Haupt, Christian Humer, and Thomas Würthinger. 2018. Fast, Flexible, Polyglot Instrumentation Support for Debuggers and other Tools. CoRR abs/1803.10201 (2018). arXiv:1803.10201 http://arxiv.org/abs/1803.10201

[26]

Venmugil Elango, Norm Rubin, Mahesh Ravishankar, Hariharan Sandanagobalane, and Vinod Grover. 2018. Diesel: DSL for linear algebra and neural net computations on GPUs. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL@PLDI 2018, Philadelphia, PA, USA, June 18--22, 2018, Justin Gottschlich and Alvin Cheung (Eds.). ACM, 42--51.

Digital Library

[27]

Grégory M. Essertel, Ruby Y. Tahboub, Fei Wang, James M. Decker, and Tiark Rompf. 2019. Flare & Lantern: Efficiently Swapping Horses Midstream. Proc. VLDB Endow. 12, 12 (2019), 1910--1913.

Digital Library

[28]

Michael Furr and Jeffrey Foster. 2008. Checking type safety of foreign function calls. ACM Trans. Program. Lang. Syst. 30 (07 2008).

Digital Library

[29]

Matthias Grimmer, Roland Schatz, Chris Seaton, Thomas Würthinger, and Mikel Luján. 2018. Cross-Language Interoperability in a Multi-Language Runtime. ACM Trans. Program. Lang. Syst. 40, 2 (2018), 8:1--8:43.

Digital Library

[30]

Matthias Grimmer, Chris Seaton, Roland Schatz, Thomas Würthinger, and Hanspeter Mössenböck. 2015. High-performance cross-language interoperability in a multi-language runtime. In Proceedings of the 11th Symposium on Dynamic Languages, DLS 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25--30, 2015, Manuel Serrano (Ed.). ACM, 78--90.

Digital Library

[31]

Matthias Grimmer, Chris Seaton, Roland Schatz, Thomas Würthinger, and Hanspeter Mössenböck. 2015. High-performance Cross-language Interoperability in a Multi-language Runtime. SIGPLAN Not. 51, 2 (Oct. 2015), 78--90.

Digital Library

[32]

Dylan Hutchison, Bill Howe, and Dan Suciu. 2017. LaraDB: A Minimalist Kernel for Linear and Relational Algebra Computation. In Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (Chicago, IL, USA) (BeyondMR'17). Association for Computing Machinery, New York, NY, USA, Article 2, 10 pages.

Digital Library

[33]

David Justo, Shaoqing Yi, Lukas Stadler, Nadia Polikarpova, and Arun Kumar. [n.d.]. Towards A Polyglot Framework For Factorized ML. Technical Report. Tech. rep. https://adalabucsd.github.io/papers/TR_2021_Trinity.pdf.

[34]

Arun Kumar, Mona Jalal, Boqun Yan, Jeffrey Naughton, and Jignesh M. Patel. 2015. Demonstration of Santoku: Optimizing Machine Learning over Normalized Data. Proc. VLDB Endow. 8, 12 (Aug. 2015), 1864--1867.

Digital Library

[35]

Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel. 2015. Learning Generalized Linear Models Over Normalized Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 1969--1984.

Digital Library

[36]

Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Sebastian Breundefined, Tilmann Rabl, and Volker Markl. 2019. An Intermediate Representation for Optimizing Machine Learning Pipelines. Proc. VLDB Endow. 12, 11 (July 2019), 1553--1567.

Digital Library

[37]

Side Li, Lingjiao Chen, and Arun Kumar. 2019. Enabling and Optimizing Nonlinear Feature Interactions in Factorized Linear Algebra. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1571--1588.

Digital Library

[38]

Side Li and Arun Kumar. [n.d.]. Morpheuspy: Factorized machine learning with numpy. Technical Report. Tech. rep. https://adalabucsd.github.io/papers/TR_2018_MorpheusPy.pdf.

[39]

Wing Hang Li, David Robert White, and Jeremy Singer. 2013. JVM-hosted languages: they talk the talk, but do they walk the walk?. In PPPJ '13.

Digital Library

[40]

Todd M Malone. 2014. Interoperability in Programming Languages.

[41]

Fabio Niephaus, Tim Felgentreff, and Robert Hirschfeld. 2019. Towards polyglot adapters for the GraalVM. In Conference Companion of the 3rd International Conference on Art, Science, and Engineering of Programming, Genova, Italy, April 1--4, 2019. ACM, 1:1--1:3.

Digital Library

[42]

Charles O Nutter, Thomas Enebo, Nick Sieger, and Ian Dees. 2011. Using JRuby: Bringing Ruby to Java. Pragmatic Bookshelf.

Digital Library

[43]

Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky, and Markus Püschel. 2013. Spiral in scala: towards the systematic construction of generators for performance libraries. In Generative Programming: Concepts and Experiences, GPCE'13, Indianapolis, IN, USA - October 27 - 28, 2013, Jaakko Järvi and Christian Kästner (Eds.). ACM, 125--134.

Digital Library

[44]

Travis E Oliphant. 2006. A guide to NumPy. Vol. 1. Trelgol Publishing USA.

Digital Library

[45]

Shoumik Palkar, James Thomas, Deepak Narayanan, Pratiksha Thaker, Rahul Palamuttam, Parimajan Negi, Anil Shanbhag, Malte Schwarzkopf, Holger Pirk, Saman Amarasinghe, Samuel Madden, and Matei Zaharia. 2018. Evaluating End-to-End Optimization for Data Analytics Applications in Weld. Proc. VLDB Endow. 11, 9 (May 2018), 1002--1015.

Digital Library

[46]

Shoumik Palkar, J. Thomas, A. Shanbhag, D. Narayanan, H. Pirk, M. Schwarzkopf, Saman P. Amarasinghe, M. Zaharia, and Stanford InfoLab. 2016. Weld: ACommon Runtime for High Performance Data Analytics.

[47]

Steffen Rendle. 2013. Scaling Factorization Machines to Relational Data. Proc. VLDB Endow. 6, 5 (March 2013), 337--348.

Digital Library

[48]

Alexander Riese, Fabio Niephaus, Tim Felgentreff, and Robert Hirschfeld. 2020. User-Defined Interface Mappings for the GraalVM. In Conference Companion of the 4th International Conference on Art, Science, and Engineering of Programming (Porto, Portugal) ('20). Association for Computing Machinery, New York, NY, USA, 19--22.

Digital Library

[49]

Maximilian Schleich, Dan Olteanu, Mahmoud Abo Khamis, Hung Q. Ngo, and XuanLong Nguyen. 2019. A Layered Aggregate Engine for Analytics Workloads. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1642--1659.

Digital Library

[50]

Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. 2016. Learning Linear Regression Models over Factorized Joins. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 3--18.

Digital Library

[51]

Amir Shaikhha, Maximilian Schleich, Alexandru Ghita, and Dan Olteanu. 2020. Multi-Layer Optimizations for End-to-End Data Analytics. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (San Diego, CA, USA) (CGO 2020). Association for Computing Machinery, New York, NY, USA, 145--157.

Digital Library

[52]

Daniel Smilkov, Nikhil Thorat, Yannick Assogba, Ann Yuan, Nick Kreeger, Ping Yu, Kangyi Zhang, Shanqing Cai, Eric Nielsen, David Soergel, Stan Bileschi, Michael Terry, Charles Nicholson, Sandeep N. Gupta, Sarah Sirajuddin, D. Sculley, Rajat Monga, Greg Corrado, Fernanda B. Viegas, and Martin Wattenberg. 2019. TensorFlow.js: Machine Learning for the Web and Beyond. Palo Alto, CA, USA. https://arxiv.org/abs/1901.05350

[53]

Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages. ACM Trans. Embedded Comput. Syst. 13, 4s (2014), 134:1--134:25.

Digital Library

[54]

Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle Olukotun. 2013. Forge: generating a high performance DSL implementation from a declarative specification. In Generative Programming: Concepts and Experiences, GPCE'13, Indianapolis, IN, USA - October 27 - 28, 2013, Jaakko Järvi and Christian Kästner (Eds.). ACM, 145--154.

Digital Library

[55]

Ruby Y. Tahboub and Tiark Rompf. 2020. Architecting a Query Compiler for Spatial Workloads. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 2103--2118.

Digital Library

[56]

Anthony Thomas and Arun Kumar. 2018. A Comparative Evaluation of Systems for Scalable Linear Algebra-Based Analytics. Proc. VLDB Endow. 11, 13 (Sept. 2018), 2168--2182.

Digital Library

[57]

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, Ílhan Polat, Yu Feng, Eric W. Moore, Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1. 0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods (2020).

[58]

Fei Wang, Daniel Zheng, James M. Decker, Xilun Wu, Grégory M. Essertel, and Tiark Rompf. 2019. Demystifying differentiable programming: shift/reset the penultimate backpropagator. Proc. ACM Program. Lang. 3, ICFP (2019), 96:1--96:31.

Digital Library

[59]

Christian Wimmer and Thomas Würthinger. 2012. Truffle: a self-optimizing runtime system. In Conference on Systems, Programming, and Applications: Software for Humanity, SPLASH '12, Tucson, AZ, USA, October 21--25, 2012, Gary T. Leavens (Ed.). ACM, 13--14.

Digital Library

[60]

Thomas Würthinger, Christian Wimmer, Christian Humer, Andreas Wöß, Lukas Stadler, Chris Seaton, Gilles Duboscq, Doug Simon, and Matthias Grimmer. 2017. Practical partial evaluation for high-performance dynamic language runtimes. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18--23, 2017, Albert Cohen and Martin T. Vechev (Eds.). ACM, 662--676.

Digital Library

[61]

Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko. 2013. One VM to rule them all. In ACM Symposium on New Ideas in Programming and Reflections on Software, Onward! 2013, part of SPLASH '13, Indianapolis, IN, USA, October 26--31, 2013, Antony L. Hosking, Patrick Th. Eugster, and Robert Hirschfeld (Eds.). ACM, 187--204.

Digital Library

[62]

Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. 2012. Self-optimizing AST interpreters. In Proceedings of the 8th Symposium on Dynamic Languages, DLS '12, Tucson, AZ, USA, October 22, 2012, Alessandro Warth (Ed.). ACM, 73--82.

Digital Library

Cited By

Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
Wang JChai CTang NLiu JLi G(2022)Coresets over multiple tables for feature-rich and data-efficient machine learningProceedings of the VLDB Endowment10.14778/3561261.356126716:1(64-76)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3561261.3561267

Index Terms

Towards a polyglot framework for factorized ML

Index terms have been assigned to the content through auto-classification.

Recommendations

Polyglot: an extensible compiler framework for Java
CC'03: Proceedings of the 12th international conference on Compiler construction

Polyglot is an extensible compiler framework that supports the easy creation of compilers for languages similar to Java, while avoiding code duplication. The Polyglot framework is useful for domain-specific languages, exploration of language design, and ...
GraalVM: metaprogramming inside a polyglot system (invited talk)
META 2018: Proceedings of the 3rd ACM SIGPLAN International Workshop on Meta-Programming Techniques and Reflection

GraalVM is a polyglot virtual machine for running applications written in a variety of languages such as JavaScript, Ruby, Python, R, JVM-based languages like Java, Scala, Kotlin, and LLVM-based languages such as C and C++.

GraalVM enables ...
Language-agnostic integrated queries in a managed polyglot runtime

Language-integrated query (LINQ) frameworks offer a convenient programming abstraction for processing in-memory collections of data, allowing developers to concisely express declarative queries using general-purpose programming languages. Existing LINQ ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 12

July 2021

587 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2021

Published in PVLDB Volume 14, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
94
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
Wang JChai CTang NLiu JLi G(2022)Coresets over multiple tables for feature-rich and data-efficient machine learningProceedings of the VLDB Endowment10.14778/3561261.356126716:1(64-76)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3561261.3561267

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents