Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3410566.3410591acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

Empowering big data analytics with polystore and strongly typed functional queries

Published: 25 August 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Polystores are of primary importance to tackle the diversity and the volume of Big Data, as they propose to store data according to specific use cases. Nevertheless, analytics frameworks often lack a uniform interface allowing to fully access and take advantage of the various models offered by the polystore. It also should be ensured that the typing of the algebraic expressions built with data manipulation operators can be checked and that schema can be inferred before starting to execute the operators (type-safe).
    Tensors are good candidates for supporting a pivot data model. They are powerful abstract mathematical objects which can embed complex relationships between entities and that are used in major analytics frameworks. However, they are far away from data models, and lack high level operators to manipulate their content, resulting in bad coding habits and less maintainability, and sometimes poor performances.
    With TDM (Tensor Data Model), we propose to join the best of both worlds, to take advantage of modeling capabilities of tensors by adding schema and data manipulation operators to them. We developed an implementation in Scala using Spark, providing users with a type-safe and schema inference mechanism that guarantees the technical and functional correctness of composed expressions on tensors at compile time. We show that this extension does not induce overhead and allows to outperform Spark query optimizer using bind join.

    References

    [1]
    M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pages 265--283, 2016.
    [2]
    A. Agrawal, R. Chatterjee, C. Curino, A. Floratou, N. Gowdal, M. Interlandi, A. Jindal, K. Karanasos, S. Krishnan, B. Kroth, et al. Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML. In Conference on Innovative Data Systems Research (CIDR), 2020.
    [3]
    R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv:1605.02688, 2016.
    [4]
    R. Alotaibi, D. Bursztyn, A. Deutsch, I. Manolescu, and S. Zampetakis. Towards Scalable Hybrid Stores: Constraint-Based Rewriting to the Rescue. In Proceedings of the 2019 International Conference on Management of Data, pages 1660--1677, 2019.
    [5]
    N. Amin, T. Rompf, and M. Odersky. Foundations of path-dependent types. ACM SIGPLAN Notices, 49(10):233--249, 2014.
    [6]
    P. Barceló, N. Higuera, J. Pérez, and B. Subercaseaux. On the Expressiveness of LARA: A Unified Language for Linear and Relational Algebra. arXiv preprint arXiv:1909.11693, 2019.
    [7]
    P. Baumann. Management of multidimensional discrete data. The VLDB Journal, 3(4):401--444, 1994.
    [8]
    R. Brijder, F. Geerts, J. Van den Bussche, and T. Weerwag. MATLANG: Matrix operations and their expressive power. ACM SIGMOD Record, 48(1):60--67, 2019.
    [9]
    A. Buluc and J. R. Gilbert. On the representation and multiplication of hypersparse matrices. In IEEE International Symposium on Parallel and Distributed Processing, pages 1--11, 2008.
    [10]
    T. Chen. Typesafe abstractions for tensor operations. In Proceedings of the 8th ACM SIGPLAN International Symposium on Scala, pages 45--50. ACM, 2017.
    [11]
    A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009.
    [12]
    E. Codd. Relational completeness of data base sublanguages. Computer, 1972.
    [13]
    R. C. Fernandez, P. R. Pietzuch, J. Kreps, N. Narkhede, J. Rao, J. Koshy, D. Lin, C. Riccomini, and G. Wang. Liquid: Unifying Nearline and Offline Big Data Integration. In Conference on Innovative Data System Research (CIDR), 2015.
    [14]
    V. Gadepally, P. Chen, J. Duggan, A. Elmore, B. Haynes, J. Kepner, S. Madden, T. Mattson, and M. Stonebraker. The BigDAWG Polystore System and Architecture. In 2016 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--6. IEEE, 2016.
    [15]
    V. Gadepally, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, L. Edwards, M. Hubbell, P. Michaleas, J. Mullen, et al. D4M: Bringing associative arrays to database engines. In 2015 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--6. IEEE, 2015.
    [16]
    P. Griffioen. Type inference for array programming with dimensioned vector spaces. In Proceedings of the 27th Symposium on the Implementation and Application of Functional Programming Languages, page 4. ACM, 2015.
    [17]
    H. Jananthan, Z. Zhou, V. Gadepally, D. Hutchison, S. Kim, and J. Kepner. Polystore mathematics of relational algebra. In 2017 IEEE International Conference on Big Data (Big Data), pages 3180--3189. IEEE, 2017.
    [18]
    P. C. Kanellakis. Elements of relational database theory. In Formal models and semantics, pages 1073--1156. Elsevier, 1990.
    [19]
    J. Kepner, V. Gadepally, H. Jananthan, L. Milechin, and S. Samsi. AI Data Wrangling with Associative Arrays. arXiv preprint arXiv:2001.06731, 2020.
    [20]
    D. Knuth. The art of computer programming. Vol. 1: Fundamental algorithms. Addison-Wesley, 1978.
    [21]
    T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455--500, 2009.
    [22]
    B. Kolev, O. Levchenko, E. Pacitti, P. Valduriez, R. Vilaça, R. Gonçalves, R. Jiménez-Peris, and P. Kranas. Parallel polyglot query processing on heterogeneous cloud data stores with LeanXcale. In 2018 IEEE International Conference on Big Data (Big Data), pages 1757--1766. IEEE, 2018.
    [23]
    J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic. Tensorly: Tensor learning in python. The Journal of Machine Learning Research, 20(1):925--930, 2019.
    [24]
    É. Leclercq, A. Gillet, T. Grison, and M. Savonnet. Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XLII, pages 51--90. Springer, 2019.
    [25]
    L. Libkin, R. Machlin, and L. Wong. A query language for multidimensional arrays: design, implementation, and optimization techniques. In ACM SIGMOD Record, volume 25, pages 228--239. ACM, 1996.
    [26]
    Z. H. Liu, J. Lu, D. Gawlick, H. Helskyaho, G. Pogossiants, and Z. Wu. Multi-model database management systems-a look forward. In Heterogeneous Data Management, Polystores, and Analytics for Healthcare, pages 16--29. Springer, 2018.
    [27]
    J. Lu and I. Holubová. Multi-model databases: a new journey to handle the variety of data. ACM Computing Surveys (CSUR), 52(3):1--38, 2019.
    [28]
    D. Mišev and P. Baumann. SQL Support for Multidimensional Arrays. IRC-Library, Information Resource Center der Jacobs University Bremen, 2017.
    [29]
    T. Muranushi and R. A. Eisenberg. Experience report: Type-checking polymorphic units for astrophysics research in Haskell. In ACM SIGPLAN Notices, volume 49, pages 31--38. ACM, 2014.
    [30]
    M. Odersky, L. Spoon, and B. Venners. Programming in scala. Artima Inc, 2008.
    [31]
    B. C. Oliveira, A. Moors, and M. Odersky. Type classes as objects and implicits. ACM SIGPLAN Notices, 45(10):341--360, 2010.
    [32]
    E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos. Tensors for data mining and data fusion: Models, applications, and scalable algorithms. ACM Transactions on Intelligent Systems and Technology (TIST), 8(2):16, 2017.
    [33]
    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024--8035, 2019.
    [34]
    S. Rabanser, O. Shchur, and S. Günnemann. Introduction to tensor decompositions and their applications in machine learning. arXiv preprint arXiv:1711.10781, 2017.
    [35]
    A. Rush. Tensor Considered Harmful. Technical report, Harvard NLP, 2010.
    [36]
    S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis. Splatt: Efficient and parallel sparse tensor-matrix multiplication. In IEEE International Parallel and Distributed Processing Symposium, pages 61--70, 2015.
    [37]
    M. Stonebraker and U. Cetintemel. "one size fits all": an idea whose time has come and gone. In 21st International Conference on Data Engineering (ICDE'05), pages 2--11. IEEE, 2005.
    [38]
    J. Sun, D. Tao, and C. Faloutsos. Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 374--383. ACM, 2006.
    [39]
    R. Tan, R. Chirkova, V. Gadepally, and T. G. Mattson. Enabling query processing across heterogeneous data models: A survey. In 2017 IEEE International Conference on Big Data (Big Data), pages 3211--3220. IEEE, 2017.
    [40]
    S. Van Der Walt, S. C. Colbert, and G. Varoquaux. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.
    [41]
    P. Vassiliadis and A. Simitsis. Near real time ETL. In New trends in data ware-housing and data analysis, pages 1--31. Springer, 2009.
    [42]
    M. Zaharia and B. Chambers. Spark: The Definitive Guide. O'Reilly Media, 2018.

    Cited By

    View all
    • (2023)A Guide to the Tucker Tensor Decomposition for Data Mining: Exploratory Analysis, Clustering and ClassificationTransactions on Large-Scale Data- and Knowledge-Centered Systems LIV10.1007/978-3-662-68014-8_3(56-88)Online publication date: 22-Sep-2023
    • (2023)Preventing Technical Errors in Data Lake Analyses with Type TheoryBig Data Analytics and Knowledge Discovery10.1007/978-3-031-39831-5_2(18-24)Online publication date: 10-Aug-2023

    Index Terms

    1. Empowering big data analytics with polystore and strongly typed functional queries

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        IDEAS '20: Proceedings of the 24th Symposium on International Database Engineering & Applications
        August 2020
        252 pages
        ISBN:9781450375030
        DOI:10.1145/3410566
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 25 August 2020

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. high performance data analytics
        2. polystore
        3. query language
        4. tensor

        Qualifiers

        • Research-article

        Conference

        IDEAS 2020

        Acceptance Rates

        IDEAS '20 Paper Acceptance Rate 27 of 57 submissions, 47%;
        Overall Acceptance Rate 74 of 210 submissions, 35%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)6
        • Downloads (Last 6 weeks)0

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)A Guide to the Tucker Tensor Decomposition for Data Mining: Exploratory Analysis, Clustering and ClassificationTransactions on Large-Scale Data- and Knowledge-Centered Systems LIV10.1007/978-3-662-68014-8_3(56-88)Online publication date: 22-Sep-2023
        • (2023)Preventing Technical Errors in Data Lake Analyses with Type TheoryBig Data Analytics and Knowledge Discovery10.1007/978-3-031-39831-5_2(18-24)Online publication date: 10-Aug-2023

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media