research-article

Empowering big data analytics with polystore and strongly typed functional queries

Authors:

Annabelle Gillet,

Éric Leclercq,

Marinette Savonnet, and

Nadine CullotAuthors Info & Claims

IDEAS '20: Proceedings of the 24th Symposium on International Database Engineering & Applications

August 2020

Article No.: 13, Pages 1 - 10

https://doi.org/10.1145/3410566.3410591

Published: 25 August 2020 Publication History

Abstract

Polystores are of primary importance to tackle the diversity and the volume of Big Data, as they propose to store data according to specific use cases. Nevertheless, analytics frameworks often lack a uniform interface allowing to fully access and take advantage of the various models offered by the polystore. It also should be ensured that the typing of the algebraic expressions built with data manipulation operators can be checked and that schema can be inferred before starting to execute the operators (type-safe).

Tensors are good candidates for supporting a pivot data model. They are powerful abstract mathematical objects which can embed complex relationships between entities and that are used in major analytics frameworks. However, they are far away from data models, and lack high level operators to manipulate their content, resulting in bad coding habits and less maintainability, and sometimes poor performances.

With TDM (Tensor Data Model), we propose to join the best of both worlds, to take advantage of modeling capabilities of tensors by adding schema and data manipulation operators to them. We developed an implementation in Scala using Spark, providing users with a type-safe and schema inference mechanism that guarantees the technical and functional correctness of composed expressions on tensors at compile time. We show that this extension does not induce overhead and allows to outperform Spark query optimizer using bind join.

References

[1]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pages 265--283, 2016.

Digital Library

[2]

A. Agrawal, R. Chatterjee, C. Curino, A. Floratou, N. Gowdal, M. Interlandi, A. Jindal, K. Karanasos, S. Krishnan, B. Kroth, et al. Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML. In Conference on Innovative Data Systems Research (CIDR), 2020.

[3]

R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv:1605.02688, 2016.

[4]

R. Alotaibi, D. Bursztyn, A. Deutsch, I. Manolescu, and S. Zampetakis. Towards Scalable Hybrid Stores: Constraint-Based Rewriting to the Rescue. In Proceedings of the 2019 International Conference on Management of Data, pages 1660--1677, 2019.

Digital Library

[5]

N. Amin, T. Rompf, and M. Odersky. Foundations of path-dependent types. ACM SIGPLAN Notices, 49(10):233--249, 2014.

Digital Library

[6]

P. Barceló, N. Higuera, J. Pérez, and B. Subercaseaux. On the Expressiveness of LARA: A Unified Language for Linear and Relational Algebra. arXiv preprint arXiv:1909.11693, 2019.

[7]

P. Baumann. Management of multidimensional discrete data. The VLDB Journal, 3(4):401--444, 1994.

Digital Library

[8]

R. Brijder, F. Geerts, J. Van den Bussche, and T. Weerwag. MATLANG: Matrix operations and their expressive power. ACM SIGMOD Record, 48(1):60--67, 2019.

Digital Library

[9]

A. Buluc and J. R. Gilbert. On the representation and multiplication of hypersparse matrices. In IEEE International Symposium on Parallel and Distributed Processing, pages 1--11, 2008.

[10]

T. Chen. Typesafe abstractions for tensor operations. In Proceedings of the 8th ACM SIGPLAN International Symposium on Scala, pages 45--50. ACM, 2017.

[11]

A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009.

Digital Library

[12]

E. Codd. Relational completeness of data base sublanguages. Computer, 1972.

[13]

R. C. Fernandez, P. R. Pietzuch, J. Kreps, N. Narkhede, J. Rao, J. Koshy, D. Lin, C. Riccomini, and G. Wang. Liquid: Unifying Nearline and Offline Big Data Integration. In Conference on Innovative Data System Research (CIDR), 2015.

[14]

V. Gadepally, P. Chen, J. Duggan, A. Elmore, B. Haynes, J. Kepner, S. Madden, T. Mattson, and M. Stonebraker. The BigDAWG Polystore System and Architecture. In 2016 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--6. IEEE, 2016.

[15]

V. Gadepally, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, L. Edwards, M. Hubbell, P. Michaleas, J. Mullen, et al. D4M: Bringing associative arrays to database engines. In 2015 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--6. IEEE, 2015.

[16]

P. Griffioen. Type inference for array programming with dimensioned vector spaces. In Proceedings of the 27th Symposium on the Implementation and Application of Functional Programming Languages, page 4. ACM, 2015.

[17]

H. Jananthan, Z. Zhou, V. Gadepally, D. Hutchison, S. Kim, and J. Kepner. Polystore mathematics of relational algebra. In 2017 IEEE International Conference on Big Data (Big Data), pages 3180--3189. IEEE, 2017.

[18]

P. C. Kanellakis. Elements of relational database theory. In Formal models and semantics, pages 1073--1156. Elsevier, 1990.

[19]

J. Kepner, V. Gadepally, H. Jananthan, L. Milechin, and S. Samsi. AI Data Wrangling with Associative Arrays. arXiv preprint arXiv:2001.06731, 2020.

[20]

D. Knuth. The art of computer programming. Vol. 1: Fundamental algorithms. Addison-Wesley, 1978.

[21]

T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455--500, 2009.

Digital Library

[22]

B. Kolev, O. Levchenko, E. Pacitti, P. Valduriez, R. Vilaça, R. Gonçalves, R. Jiménez-Peris, and P. Kranas. Parallel polyglot query processing on heterogeneous cloud data stores with LeanXcale. In 2018 IEEE International Conference on Big Data (Big Data), pages 1757--1766. IEEE, 2018.

[23]

J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic. Tensorly: Tensor learning in python. The Journal of Machine Learning Research, 20(1):925--930, 2019.

Digital Library

[24]

É. Leclercq, A. Gillet, T. Grison, and M. Savonnet. Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XLII, pages 51--90. Springer, 2019.

[25]

L. Libkin, R. Machlin, and L. Wong. A query language for multidimensional arrays: design, implementation, and optimization techniques. In ACM SIGMOD Record, volume 25, pages 228--239. ACM, 1996.

[26]

Z. H. Liu, J. Lu, D. Gawlick, H. Helskyaho, G. Pogossiants, and Z. Wu. Multi-model database management systems-a look forward. In Heterogeneous Data Management, Polystores, and Analytics for Healthcare, pages 16--29. Springer, 2018.

[27]

J. Lu and I. Holubová. Multi-model databases: a new journey to handle the variety of data. ACM Computing Surveys (CSUR), 52(3):1--38, 2019.

[28]

D. Mišev and P. Baumann. SQL Support for Multidimensional Arrays. IRC-Library, Information Resource Center der Jacobs University Bremen, 2017.

[29]

T. Muranushi and R. A. Eisenberg. Experience report: Type-checking polymorphic units for astrophysics research in Haskell. In ACM SIGPLAN Notices, volume 49, pages 31--38. ACM, 2014.

[30]

M. Odersky, L. Spoon, and B. Venners. Programming in scala. Artima Inc, 2008.

[31]

B. C. Oliveira, A. Moors, and M. Odersky. Type classes as objects and implicits. ACM SIGPLAN Notices, 45(10):341--360, 2010.

Digital Library

[32]

E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos. Tensors for data mining and data fusion: Models, applications, and scalable algorithms. ACM Transactions on Intelligent Systems and Technology (TIST), 8(2):16, 2017.

[33]

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024--8035, 2019.

[34]

S. Rabanser, O. Shchur, and S. Günnemann. Introduction to tensor decompositions and their applications in machine learning. arXiv preprint arXiv:1711.10781, 2017.

[35]

A. Rush. Tensor Considered Harmful. Technical report, Harvard NLP, 2010.

[36]

S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis. Splatt: Efficient and parallel sparse tensor-matrix multiplication. In IEEE International Parallel and Distributed Processing Symposium, pages 61--70, 2015.

Digital Library

[37]

M. Stonebraker and U. Cetintemel. "one size fits all": an idea whose time has come and gone. In 21st International Conference on Data Engineering (ICDE'05), pages 2--11. IEEE, 2005.

Digital Library

[38]

J. Sun, D. Tao, and C. Faloutsos. Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 374--383. ACM, 2006.

Digital Library

[39]

R. Tan, R. Chirkova, V. Gadepally, and T. G. Mattson. Enabling query processing across heterogeneous data models: A survey. In 2017 IEEE International Conference on Big Data (Big Data), pages 3211--3220. IEEE, 2017.

[40]

S. Van Der Walt, S. C. Colbert, and G. Varoquaux. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.

Digital Library

[41]

P. Vassiliadis and A. Simitsis. Near real time ETL. In New trends in data ware-housing and data analysis, pages 1--31. Springer, 2009.

[42]

M. Zaharia and B. Chambers. Spark: The Definitive Guide. O'Reilly Media, 2018.

Cited By

Gillet ALeclercq ÉSautot L(2023)A Guide to the Tucker Tensor Decomposition for Data Mining: Exploratory Analysis, Clustering and ClassificationTransactions on Large-Scale Data- and Knowledge-Centered Systems LIV10.1007/978-3-662-68014-8_3(56-88)Online publication date: 22-Sep-2023
https://doi.org/10.1007/978-3-662-68014-8_3
Guyot ALeclercq ÉGillet ACullot N(2023)Preventing Technical Errors in Data Lake Analyses with Type TheoryBig Data Analytics and Knowledge Discovery10.1007/978-3-031-39831-5_2(18-24)Online publication date: 10-Aug-2023
https://doi.org/10.1007/978-3-031-39831-5_2

Index Terms

Empowering big data analytics with polystore and strongly typed functional queries
1. Information systems
  1. Data management systems
    1. Data structures
    2. Query languages

Recommendations

Big Data Analytics
Read More
Approaches of enhancing interoperations among high performance computing and big data analytics via augmentation
Abstract
The dawn of exascale computing and its convergence with big data analytics has greatly spurred research interests. The reasons are straightforward. Traditionally, high performance computing (HPC) systems have been used for scientific applications ...
Read More
Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics
Transactions on Large-Scale Data- and Knowledge-Centered Systems XLII
Abstract
This paper presents a Tensor based Data Model (TDM) for polystore systems meant to address two major closely related issues in big data analytics architectures, namely logical data independence and data impedance mismatch. The TDM is an expressive ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

IDEAS '20: Proceedings of the 24th Symposium on International Database Engineering & Applications

August 2020

252 pages

ISBN:9781450375030

DOI:10.1145/3410566

General Chairs:
Bipin C. Desai
Concordia University, Canada
,
Wan-Sup Cho
Chungbuk National University, Korea

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

IDEAS 2020

IDEAS 2020: 24th International Database Engineering & Applications Symposium

August 12 - 14, 2020

Seoul, Republic of Korea

Acceptance Rates

IDEAS '20 Paper Acceptance Rate 27 of 57 submissions, 47%;

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
78
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Gillet ALeclercq ÉSautot L(2023)A Guide to the Tucker Tensor Decomposition for Data Mining: Exploratory Analysis, Clustering and ClassificationTransactions on Large-Scale Data- and Knowledge-Centered Systems LIV10.1007/978-3-662-68014-8_3(56-88)Online publication date: 22-Sep-2023
https://doi.org/10.1007/978-3-662-68014-8_3
Guyot ALeclercq ÉGillet ACullot N(2023)Preventing Technical Errors in Data Lake Analyses with Type TheoryBig Data Analytics and Knowledge Discovery10.1007/978-3-031-39831-5_2(18-24)Online publication date: 10-Aug-2023
https://doi.org/10.1007/978-3-031-39831-5_2

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents