Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Unified data analytics: state-of-the-art and open problems

Published: 01 August 2022 Publication History

Abstract

There is an urgent need for unifying data analytics as more and more application tasks become more complex: Nowadays, it is normal to see tasks performing data preparation, analytical processing, and machine learning operations in a single pipeline. Despite this need, achieving this is still a dreadful process where developers have to get familiar with many data processing platforms and write ad hoc scripts for integrating them. This tutorial is motivated by this need from both academia and industry. We will discuss the importance of unifying data processing as well as the current efforts to achieve it. In particular, we will introduce a classification of the different cases where an application needs or benefits from data analytics unification and discuss the challenges in each case. Along with this classification, we will also present current efforts known up to date that aim at unifying data processing, such as Apache Beam and Apache Wayang, and emphasize their differences. We will conclude with open problems and their challenges.

References

[1]
2019. Fortune magazine. http://fortune.com/2014/06/19/big-data-airline-industry/.
[2]
2019. TensorFlow Federated. https://www.tensorflow.org/federated.
[3]
2021. Apache Beam. https://beam.apache.org.
[4]
2021. Apache Wayang (incubating). https://wayang.apache.org/.
[5]
2021. FATE (Federated AI Technology Enabler). https://github.com/FederatedAI/FATE.
[6]
2022. Apache Drill. https://drill.apache.org.
[7]
2022. PrestoDB Project. https://prestodb.io.
[8]
Divy Agrawal et al. 2016. Road to Freedom in Big Data Analytics. In EDBT. 479--484.
[9]
Divy Agrawal, Lamine Ba, Laure Berti-Equille, Sanjay Chawla, Ahmed Elmagarmid, Hossam Hammady, Yasser Idris, Zoi Kaoudi, Zuhair Khayyat, Sebastian Kruse, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and MohammedJ. Zaki. 2016. Rheem: Enabling Multi-Platform Task Execution. In SIGMOD. 2069--2072.
[10]
Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! -. Proc. VLDB Endow. 11, 11 (2018), 1414--1427.
[11]
Mohammed Al-Kateb, Paul Sinclair, Grace Au, and Carrie Ballinger. 2016. Hybrid Row-Column Partitioning in Teradata. PVLDB 9, 13 (2016), 1353--1364.
[12]
Rana Alotaibi, Damian Bursztyn, Alin Deutsch, Ioana Manolescu, and Stamatis Zampetakis. 2019. Towards Scalable Hybrid Stores: Constraint-Based Rewriting to the Rescue. In SIGMOD. 1660--1677.
[13]
Abdelkader Baaziz and Luc Quoniam. 2014. How to use Big Data technologies to optimize operations in Upstream Petroleum Industry. In 21st World Petroleum Congress.
[14]
Daniel J. Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D. Lane. 2020. Flower: A Friendly Federated Learning Research Framework. CoRR abs/2007.14390 (2020). arXiv:2007.14390 https://arxiv.org/abs/2007.14390
[15]
Matthias Boehm, Michael Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick Reiss, Prithviraj Sen, Arvind Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. PVLDB 9, 13 (2016), 1425--1436.
[16]
Francesca Bugiotti, Damian Bursztyn, Alin Deutsch, Ioana Ileana, and Ioana Manolescu. 2015. Invisible Glue: Scalable Self-Tuning Multi-Stores. In CIDR.
[17]
Omran A. Bukhres et al. 1993. InterBase: An Execution Environment for Heterogeneous Software Systems. IEEE Computer 26, 8 (1993), 57--69.
[18]
Michael J. Carey et al. 1995. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In RIDE-DOM. 124--131.
[19]
Jens Dittrich and Alekh Jindal. 2011. Towards a One-Size-Fits-All Database Architecture. In CIDR.
[20]
Katerina Doka, Nikolaos Papailiou, Victor Giannakouris, Dimitrios Tsoumakos, and Nectarios Koziris. 2016. Mix 'n' match multi-engine analytics. In IEEE BigData. 194--203.
[21]
Jennie Duggan et al. 2015. The BigDAWG polystore system. ACM SIGMOD Record 44, 2 (2015), 11--16.
[22]
Ionel Gog et al. 2015. Musketeer: all for one, one for all in data processing systems. In EuroSys. 2:1--2:16.
[23]
Brandon Haynes, Alvin Cheung, and Magdalena Balazinska. 2016. PipeGen: Data Pipe Generator for Hybrid Analytics. In SoCC. 470--483.
[24]
Chaoyang He, Songze Li, Jinhyun So, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, Li Shen, Peilin Zhao, Yan Kang, Yang Liu, Ramesh Raskar, Qiang Yang, Murali Annavaram, and Salman Avestimehr. 2020. FedML: A Research Library and Benchmark for Federated Machine Learning. CoRR abs/2007.13518 (2020).
[25]
Adam Hems, Adil Soofi, and Ernie Perez. 2014. How innovative oil and gas companies are using big data to outmaneuver the competition. Microsoft White Paper, http://goo.gl/2Bn0xq.
[26]
IBM. 2017. Data-driven healthcare organizations use big data analytics for big gains. White paper, http://goo.gl/AFIHpk.
[27]
Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, and Jens Dittrich. 2013. WWHow! Freeing Data Storage from Cages. In CIDR.
[28]
Alekh Jindal, Jorge-Arnulfo Quiane-Ruiz, and Samuel Madden. 2013. Cartilage: Adding Flexibility to the Hadoop Skeleton. In SIGMOD. 1057--1060.
[29]
Peter Kairouz, Brendan McMahan, and Virginia Smith. 2020. Federated Learning and Analytics: Industry Meets Academia. In NeurIPS (tutorial).
[30]
Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, and Sanjay Chawla. 2020. ML-based Cross-Platform Query Optimization. In ICDE. 1489--1500.
[31]
Zoi Kaoudi and Jorge-Arnulfo Quiané-Ruiz. 2018. Cross-Platform Data Processing: Use Cases and Challenges. In ICDE (tutorial).
[32]
Zoi Kaoudi, Jorge-Arnulfo Quiane-Ruiz, Saravanan Thurumuruganathan, Sanjay Chawla, and Divy Agrawal. 2017. A Cost-based Optimizer for Gradient Descent Optimization. In SIGMOD.
[33]
Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Panos Kalnis. 2015. Lightning Fast and Space Efficient Inequality Joins. PVLDB 8, 13 (2015), 2074--2085.
[34]
Sebastian Kruse, Zoi Kaoudi, Bertty Contreras-Rojas, Sanjay Chawla, Felix Naumann, and Jorge-Arnulfo Quiané-Ruiz. 2020. RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems. VLDB J. 29, 6 (2020), 1287--1310.
[35]
Sebastian Kruse, Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Sanjay Chawla, Felix Naumann, and Bertty Contreras-Rojas. 2019. Optimizing Cross-platform Data Movement. In ICDE.
[36]
Harold Lim, Yuzhang Han, and Shivnath Babu. 2013. How to Fit when No One Size Fits. In CIDR.
[37]
Shoumik Palkar, James J. Thomas, Anil Shanbhag, Malte Schwarzkopt, Saman P. Amarasinghe, and Matei Zaharia. 2017. A Common Runtime for High Performance Data Analysis. In CIDR.
[38]
Mosha Pasumansky. April 26, 2016. Inside Capacitor, BigQuery's Next-Generation Columnar Storage Format. Google Cloud Platform.
[39]
Mary Tork Roth and Peter M. Schwarz. 1997. Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources. In VLDB. 266--275.
[40]
Pramod J. Sadalage and Martin Fowler. 2012. NoSQL distilled: A brief guide to the emerging world of polyglot persistence. Addison-Wesley Professional.
[41]
Alkis Simitsis, Kevin Wilkinson, Malu Castellanos, and Umeshwar Dayal. 2012. Optimizing Analytic Data Flows for Multiple Execution Engines. In SIGMOD. 829--840.
[42]
Michael Stonebraker. July 13, 2015. The Case for Polystores. ACM SIGMOD Blog.
[43]
Michael Stonebraker and Ugur Çetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone (Abstract). In ICDE.
[44]
Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta, Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker, and Shengliang Xu. 2017. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR.
[45]
A. Ziller, A. Trask, A. Lopardo, et al. 2021. PySyft: A Library for Easy Federated Learning. In Federated Learning Systems: Towards Next-Generation AI. 111--139.

Cited By

View all
  • (2023)Apache Wayang: A Unified Data Analytics FrameworkACM SIGMOD Record10.1145/3631504.363151052:3(30-35)Online publication date: 30-Oct-2023
  • (2023)Teaching Blue Elephants the Maths for Machine LearningProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595852(1-4)Online publication date: 18-Jun-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 12
August 2022
551 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2022
Published in PVLDB Volume 15, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)77
  • Downloads (Last 6 weeks)8
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Apache Wayang: A Unified Data Analytics FrameworkACM SIGMOD Record10.1145/3631504.363151052:3(30-35)Online publication date: 30-Oct-2023
  • (2023)Teaching Blue Elephants the Maths for Machine LearningProceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning10.1145/3595360.3595852(1-4)Online publication date: 18-Jun-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media