research-article

Unified data analytics: state-of-the-art and open problems

Authors:

Jorge-Arnulfo Quiané-RuizAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 15, Issue 12

Pages 3778 - 3781

https://doi.org/10.14778/3554821.3554898

Published: 01 August 2022 Publication History

Abstract

There is an urgent need for unifying data analytics as more and more application tasks become more complex: Nowadays, it is normal to see tasks performing data preparation, analytical processing, and machine learning operations in a single pipeline. Despite this need, achieving this is still a dreadful process where developers have to get familiar with many data processing platforms and write ad hoc scripts for integrating them. This tutorial is motivated by this need from both academia and industry. We will discuss the importance of unifying data processing as well as the current efforts to achieve it. In particular, we will introduce a classification of the different cases where an application needs or benefits from data analytics unification and discuss the challenges in each case. Along with this classification, we will also present current efforts known up to date that aim at unifying data processing, such as Apache Beam and Apache Wayang, and emphasize their differences. We will conclude with open problems and their challenges.

References

[1]

2019. Fortune magazine. http://fortune.com/2014/06/19/big-data-airline-industry/.

[2]

2019. TensorFlow Federated. https://www.tensorflow.org/federated.

[3]

2021. Apache Beam. https://beam.apache.org.

[4]

2021. Apache Wayang (incubating). https://wayang.apache.org/.

[5]

2021. FATE (Federated AI Technology Enabler). https://github.com/FederatedAI/FATE.

[6]

2022. Apache Drill. https://drill.apache.org.

[7]

2022. PrestoDB Project. https://prestodb.io.

[8]

Divy Agrawal et al. 2016. Road to Freedom in Big Data Analytics. In EDBT. 479--484.

[9]

Divy Agrawal, Lamine Ba, Laure Berti-Equille, Sanjay Chawla, Ahmed Elmagarmid, Hossam Hammady, Yasser Idris, Zoi Kaoudi, Zuhair Khayyat, Sebastian Kruse, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and MohammedJ. Zaki. 2016. Rheem: Enabling Multi-Platform Task Execution. In SIGMOD. 2069--2072.

Digital Library

[10]

Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! -. Proc. VLDB Endow. 11, 11 (2018), 1414--1427.

Digital Library

[11]

Mohammed Al-Kateb, Paul Sinclair, Grace Au, and Carrie Ballinger. 2016. Hybrid Row-Column Partitioning in Teradata. PVLDB 9, 13 (2016), 1353--1364.

Digital Library

[12]

Rana Alotaibi, Damian Bursztyn, Alin Deutsch, Ioana Manolescu, and Stamatis Zampetakis. 2019. Towards Scalable Hybrid Stores: Constraint-Based Rewriting to the Rescue. In SIGMOD. 1660--1677.

[13]

Abdelkader Baaziz and Luc Quoniam. 2014. How to use Big Data technologies to optimize operations in Upstream Petroleum Industry. In 21^st World Petroleum Congress.

[14]

Daniel J. Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D. Lane. 2020. Flower: A Friendly Federated Learning Research Framework. CoRR abs/2007.14390 (2020). arXiv:2007.14390 https://arxiv.org/abs/2007.14390

[15]

Matthias Boehm, Michael Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick Reiss, Prithviraj Sen, Arvind Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. PVLDB 9, 13 (2016), 1425--1436.

Digital Library

[16]

Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, Ioana Ileana, and Ioana Manolescu. 2015. Invisible Glue: Scalable Self-Tuning Multi-Stores. In CIDR.

[17]

Omran A. Bukhres et al. 1993. InterBase: An Execution Environment for Heterogeneous Software Systems. IEEE Computer 26, 8 (1993), 57--69.

Digital Library

[18]

Michael J. Carey et al. 1995. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In RIDE-DOM. 124--131.

[19]

Jens Dittrich and Alekh Jindal. 2011. Towards a One-Size-Fits-All Database Architecture. In CIDR.

[20]

Katerina Doka, Nikolaos Papailiou, Victor Giannakouris, Dimitrios Tsoumakos, and Nectarios Koziris. 2016. Mix 'n' match multi-engine analytics. In IEEE BigData. 194--203.

[21]

Jennie Duggan et al. 2015. The BigDAWG polystore system. ACM SIGMOD Record 44, 2 (2015), 11--16.

Digital Library

[22]

Ionel Gog et al. 2015. Musketeer: all for one, one for all in data processing systems. In EuroSys. 2:1--2:16.

[23]

Brandon Haynes, Alvin Cheung, and Magdalena Balazinska. 2016. PipeGen: Data Pipe Generator for Hybrid Analytics. In SoCC. 470--483.

[24]

Chaoyang He, Songze Li, Jinhyun So, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, Li Shen, Peilin Zhao, Yan Kang, Yang Liu, Ramesh Raskar, Qiang Yang, Murali Annavaram, and Salman Avestimehr. 2020. FedML: A Research Library and Benchmark for Federated Machine Learning. CoRR abs/2007.13518 (2020).

[25]

Adam Hems, Adil Soofi, and Ernie Perez. 2014. How innovative oil and gas companies are using big data to outmaneuver the competition. Microsoft White Paper, http://goo.gl/2Bn0xq.

[26]

IBM. 2017. Data-driven healthcare organizations use big data analytics for big gains. White paper, http://goo.gl/AFIHpk.

[27]

Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, and Jens Dittrich. 2013. WWHow! Freeing Data Storage from Cages. In CIDR.

[28]

Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, and Samuel Madden. 2013. Cartilage: Adding Flexibility to the Hadoop Skeleton. In SIGMOD. 1057--1060.

[29]

Peter Kairouz, Brendan McMahan, and Virginia Smith. 2020. Federated Learning and Analytics: Industry Meets Academia. In NeurIPS (tutorial).

[30]

Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, and Sanjay Chawla. 2020. ML-based Cross-Platform Query Optimization. In ICDE. 1489--1500.

[31]

Zoi Kaoudi and Jorge-Arnulfo Quiané-Ruiz. 2018. Cross-Platform Data Processing: Use Cases and Challenges. In ICDE (tutorial).

[32]

Zoi Kaoudi, Jorge-Arnulfo Quiane-Ruiz, Saravanan Thurumuruganathan, Sanjay Chawla, and Divy Agrawal. 2017. A Cost-based Optimizer for Gradient Descent Optimization. In SIGMOD.

[33]

Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Panos Kalnis. 2015. Lightning Fast and Space Efficient Inequality Joins. PVLDB 8, 13 (2015), 2074--2085.

Digital Library

[34]

Sebastian Kruse, Zoi Kaoudi, Bertty Contreras-Rojas, Sanjay Chawla, Felix Naumann, and Jorge-Arnulfo Quiané-Ruiz. 2020. RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems. VLDB J. 29, 6 (2020), 1287--1310.

Digital Library

[35]

Sebastian Kruse, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Sanjay Chawla, Felix Naumann, and Bertty Contreras-Rojas. 2019. Optimizing Cross-platform Data Movement. In ICDE.

[36]

Harold Lim, Yuzhang Han, and Shivnath Babu. 2013. How to Fit when No One Size Fits. In CIDR.

[37]

Shoumik Palkar, James J. Thomas, Anil Shanbhag, Malte Schwarzkopt, Saman P. Amarasinghe, and Matei Zaharia. 2017. A Common Runtime for High Performance Data Analysis. In CIDR.

[38]

Mosha Pasumansky. April 26, 2016. Inside Capacitor, BigQuery's Next-Generation Columnar Storage Format. Google Cloud Platform.

[39]

Mary Tork Roth and Peter M. Schwarz. 1997. Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources. In VLDB. 266--275.

[40]

Pramod J. Sadalage and Martin Fowler. 2012. NoSQL distilled: A brief guide to the emerging world of polyglot persistence. Addison-Wesley Professional.

Digital Library

[41]

Alkis Simitsis, Kevin Wilkinson, Malu Castellanos, and Umeshwar Dayal. 2012. Optimizing Analytic Data Flows for Multiple Execution Engines. In SIGMOD. 829--840.

[42]

Michael Stonebraker. July 13, 2015. The Case for Polystores. ACM SIGMOD Blog.

[43]

Michael Stonebraker and Ugur Çetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone (Abstract). In ICDE.

[44]

Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta, Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker, and Shengliang Xu. 2017. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR.

[45]

A. Ziller, A. Trask, A. Lopardo, et al. 2021. PySyft: A Library for Easy Federated Learning. In Federated Learning Systems: Towards Next-Generation AI. 111--139.

Cited By

Beedkar KContreras-Rojas BGavriilidis HKaoudi ZMarkl VPardo-Meza RQuiané-Ruiz J(2023)Apache Wayang: A Unified Data Analytics FrameworkACM SIGMOD Record10.1145/3631504.363151052:3(30-35)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3631504.3631510
Ruck CSchüle MBoehm MHulsebos MShankar SVarma P(2023)Teaching Blue Elephants the Maths for Machine LearningProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595852(1-4)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3595360.3595852

Recommendations

Big Data Analytics
Data, analytics, and Intelligence: A Unified Approach
ICSIM '24: Proceedings of the 2024 7th International Conference on Software Engineering and Information Management

We are living in an age of data, analytics, and intelligence. After reviewing a dozen different books on big data, data analytics, data science, Artificial intelligence (AI), and business intelligence, there are the current questions: 1. What are the ...
Big Data Analytics with R and Hadoop

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 15, Issue 12

August 2022

551 pages

ISSN:2150-8097

Editors:
Fatma Özcan
Google
,
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2022

Published in PVLDB Volume 15, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
151
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)8

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Beedkar KContreras-Rojas BGavriilidis HKaoudi ZMarkl VPardo-Meza RQuiané-Ruiz J(2023)Apache Wayang: A Unified Data Analytics FrameworkACM SIGMOD Record10.1145/3631504.363151052:3(30-35)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3631504.3631510
Ruck CSchüle MBoehm MHulsebos MShankar SVarma P(2023)Teaching Blue Elephants the Maths for Machine LearningProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595852(1-4)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3595360.3595852

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents