research-article

Public Access

Big Data Analytics with Datalog Queries on Spark

Authors:

Alexander Shkapsky,

Matteo Interlandi,

Carlo ZanioloAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 1135 - 1149

https://doi.org/10.1145/2882903.2915229

Published: 14 June 2016 Publication History

Abstract

There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.

References

[1]

Amway Business Reference Guide. https://www.amway.com/en/ResourceCenterDocuments/Visitor/ops-amw-gde-v-en--BusinessReferenceGuide.pdf.

[2]

Apache Giraph. http://giraph.apache.org.

[3]

Apache Hadoop. http://hadoop.apache.org.

[4]

Apache Spark. http://spark.apache.org.

[5]

arabic-2005 network. http://law.di.unimi.it/webdata/arabic-2005/.

[6]

Big Data Ecosystem at LinkedIn. http://www.slideshare.net/mitultiwari/big-data-ecosystem-at-linkedin-keynote-talk-at-big-data-innovators-gathering-at-www-2015. Keynote talk at Big Data Innovators Gathering at WWW 2015.

[7]

Deductive Application Language System (DeALS). http://wis.cs.ucla.edu/deals/.

[8]

GTgraph. http://www.cse.psu.edu/~kxm85/software/GTgraph.

[9]

LiveJournal social network. http://snap.stanford.edu/data/com-LiveJournal.html.

[10]

Orkut social network. http://snap.stanford.edu/data/com-Orkut.html.

[11]

PigMix. https://cwiki.apache.org/confluence/display/PIG/PigMix.

[12]

Recursive tree. https://en.wikipedia.org/wiki/Recursive_tree.

[13]

TPC-H. http://www.tpc.org/tpch/.

[14]

F. N. Afrati and J. D. Ullman. Transitive closure and recursive datalog implemented on clusters. In EDBT, pages 132--143, 2012.

Digital Library

[15]

T. J. Ameloot, F. Neven, and J. Van~den Bussche. Relational transducers for declarative networking. In PODS, pages 283--292, 2011.

Digital Library

[16]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In SIGMOD, pages 1383--1394, 2015.

Digital Library

[17]

F. Arni, K. Ong, S. Tsur, H. Wang, and C. Zaniolo. The deductive database system ldl+. TPLP, 3(1):61--94, 2003.

Digital Library

[18]

F. Bancilhon. Naive evaluation of recursively defined relations. In On Knowledge Base Management Systems, pages 165--178. Springer-Verlag, 1986.

Digital Library

[19]

P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler: A Scalable Fully Distributed Web Crawler. Software: Practice & Experience, 34(8):711--726, 2004.

Digital Library

[20]

V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011.

Digital Library

[21]

Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Scaling datalog for machine learning on big data. CoRR, abs/1203.0160, 2012.

[22]

Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. The haloop approach to large-scale iterative data analysis. The VLDB Journal, 21(2):169--190, 2012.

Digital Library

[23]

S. Cohen and O. Wolfson. Why a single parallelization strategy is not enough in knowledge bases. In PODS, pages 200--216, 1989.

Digital Library

[24]

T. Condie, D. Chu, J. M. Hellerstein, and P. Maniatis. Evita raced: metacompilation for declarative networks. PVLDB, 1(1):1153--1165, Aug. 2008.

Digital Library

[25]

N. Conway, W. R. Marczak, P. Alvaro, J. M. Hellerstein, and D. Maier. Logic and lattices for distributed programming. In SoCC, 2012.

Digital Library

[26]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.

Digital Library

[27]

J. Eisner and N. W. Filardo. Dyna: Extending datalog for modern ai. In Datalog Reloaded, pages 181--220, 2010.

Digital Library

[28]

J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In HPDC, pages 810--818, 2010.

Digital Library

[29]

S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. PVLDB, 5(11):1268--1279, 2012.

Digital Library

[30]

S. Ganguly, A. Silberschatz, and S. Tsur. A framework for the parallel processing of datalog queries. In SIGMOD, pages 143--152, 1990.

Digital Library

[31]

S. Ganguly, A. Silberschatz, and S. Tsur. Parallel bottom-up processing of datalog queries. The Journal of Logic Programming, 14(1):101--126, 1992.

Digital Library

[32]

A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2), 2009.

Digital Library

[33]

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, pages 599--613, 2014.

Digital Library

[34]

M. Interlandi and L. Tanca. On the CALM principle for BSP computation. In AMW, 2015.

[35]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.

Digital Library

[36]

H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591--600. ACM, 2010.

Digital Library

[37]

B. T. Loo, T. Condie, M. N. Garofalakis, D. E. Gay, J. M. Hellerstein, P. Maniatis, R. Ramakrishnan, T. Roscoe, and I. Stoica. Declarative networking: language, execution and optimization. In SIGMOD, pages 97--108. ACM, 2006.

Digital Library

[38]

B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis, T. Roscoe, and I. Stoica. Implementing declarative overlays. In SOSP, pages 75--90. ACM, 2005.

Digital Library

[39]

Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. PVLDB, 5(8):716--727, 2012.

Digital Library

[40]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.

Digital Library

[41]

M. Mazuran, E. Serra, and C. Zaniolo. A declarative extension of horn clauses, and its significance for datalog and its applications. TPLP, 13(4--5):609--623, 2013.

[42]

M. Mazuran, E. Serra, and C. Zaniolo. Extending the power of datalog recursion. The VLDB Journal, 22(4):471--493, 2013.

Digital Library

[43]

S. R. Mihaylov, Z. G. Ives, and S. Guha. Rex: Recursive, delta-based data-centric computation. PVLDB, 5(11):1280--1291, July 2012.

Digital Library

[44]

D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A timely dataflow system. In SOSP, pages 439--455, 2013.

Digital Library

[45]

K. A. Ross and Y. Sagiv. Monotonic aggregation in deductive databases. In PODS, pages 114--126, 1992.

Digital Library

[46]

J. Seib and G. Lausen. Parallelizing datalog programs by generalized pivoting. In PODS, pages 241--251, 1991.

Digital Library

[47]

J. Seo, S. Guo, and M. S. Lam. Socialite: Datalog extensions for efficient social network analysis. In ICDE, pages 278--289, 2013.

Digital Library

[48]

J. Seo, J. Park, J. Shin, and M. S. Lam. Distributed socialite: A datalog-based language for large-scale graph analysis. PVLDB, 6(14):1906--1917, 2013.

Digital Library

[49]

A. Shkapsky, M. Yang, and C. Zaniolo. Optimizing recursive queries with monotonic aggregates in deals. In ICDE, pages 867--878, 2015.

[50]

A. Shkapsky, K. Zeng, and C. Zaniolo. Graph queries in a next-generation datalog system. PVLDB, 6(12):1258--1261, 2013.

Digital Library

[51]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009.

Digital Library

[52]

A. Van~Gelder. A message passing framework for logical query evaluation. In SIGMOD, pages 155--165, 1986.

Digital Library

[53]

A. Van~Gelder. Foundations of aggregation in deductive databases. In DOOD, pages 13--34, 1993.

[54]

J. Wang, M. Balazinska, and D. Halperin. Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. PVLDB, 8(12):1542--1553, 2015.

Digital Library

[55]

M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in scalops, a higher order cloud computing language. In BigLearn, December 2011.

[56]

O. Wolfson and A. Ozeri. A new paradigm for parallel and distributed rule-processing. In SIGMOD, pages 133--142, 1990.

Digital Library

[57]

O. Wolfson and A. Silberschatz. Distributed processing of logic programs. In SIGMOD, pages 329--336, 1988.

Digital Library

[58]

J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems, 42(1):181--213, 2015.

Digital Library

[59]

M. Yang, A. Shkapsky, and C. Zaniolo. Parallel bottom-up evaluation of logic programs: DeALS on shared-memory multicore machines. In Technical Communications of ICLP, 2015.

[60]

M. Yang and C. Zaniolo. Main memory evaluation of recursive queries on multicore machines. In IEEE Big Data, pages 251--260, 2014.

[61]

Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008.

Digital Library

[62]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Digital Library

[63]

C. Zaniolo, S. Ceri, C. Faloutsos, R. T. Snodgrass, V. S. Subrahmanian, and R. Zicari. Advanced Database Systems. Morgan Kaufmann, 1997.

Digital Library

[64]

V. Zaychik~Moffitt, J. Stoyanovich, S. Abiteboul, and G. Miklau. Collaborative access control in webdamlog. In SIGMOD, pages 197--211, 2015.

Digital Library

[65]

Y. Zhang, Q. Gao, L. Gao, and C. Wang. Priter: A distributed framework for prioritized iterative computations. In SOCC, pages 13:1--13:14, 2011.

Digital Library

[66]

J. Zhou, N. Bruno, M.-C. Wu, P.-A. Larson, R. Chaiken, and D. Shakib. Scope: Parallel databases meet mapreduce. The VLDB Journal, 21(5):611--636, Oct. 2012.

Digital Library

Cited By

Fejza AGenevès PLayaïda N(2024)Efficient Enumeration of Recursive Plans in Transformation-Based Query OptimizersProceedings of the VLDB Endowment10.14778/3681954.368198617:11(3095-3108)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681986
Bembenek AGreenberg MChong S(2024)Making Formulog Fast: An Argument for Unconventional Datalog EvaluationProceedings of the ACM on Programming Languages10.1145/36897548:OOPSLA2(1219-1248)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689754
Zhang CWang LRigger M(2024)Finding Cross-Rule Optimization Bugs in Datalog EnginesProceedings of the ACM on Programming Languages10.1145/36498158:OOPSLA1(110-136)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649815
Show More Cited By

Index Terms

Big Data Analytics with Datalog Queries on Spark
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
    2. Query languages

Recommendations

Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Geospatial Big Data Analytics Engine for Spark
BigSpatial '17: Proceedings of the 6th ACM SIGSPATIAL Workshop on Analytics for Big Geospatial Data

With the rapid development of geospatial data acquisition and processing technology, the scale of spatial data is expanding. Mass production applications put forward higher requirements for the performance of geospatial data analysis. In this study, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
National Institute of Biomedical Imaging and Bioengineering

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

89
Total Citations
View Citations
1,719
Total Downloads

Downloads (Last 12 months)237
Downloads (Last 6 weeks)33

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fejza AGenevès PLayaïda N(2024)Efficient Enumeration of Recursive Plans in Transformation-Based Query OptimizersProceedings of the VLDB Endowment10.14778/3681954.368198617:11(3095-3108)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681986
Bembenek AGreenberg MChong S(2024)Making Formulog Fast: An Argument for Unconventional Datalog EvaluationProceedings of the ACM on Programming Languages10.1145/36897548:OOPSLA2(1219-1248)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689754
Zhang CWang LRigger M(2024)Finding Cross-Rule Optimization Bugs in Datalog EnginesProceedings of the ACM on Programming Languages10.1145/36498158:OOPSLA1(110-136)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649815
Abo Khamis MNgo HPichler RSuciu DWang Y(2024)Convergence of datalog over (Pre-) SemiringsJournal of the ACM10.1145/364302771:2(1-55)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3643027
Shaikhha ASuciu DSchleich MNgo H(2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639271
Carneiro Alves De Lima BKramer MApinis K(2023)On The Suitability of Differential Dataflow For Datalog Interpretation In Highly Dynamic SettingsProceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference10.1145/3639592.3639622(218-225)Online publication date: 16-Dec-2023
https://dl.acm.org/doi/10.1145/3639592.3639622
Sahebolamri ABarrett LMoore SMicinski K(2023)Bring Your Own Data Structures to DatalogProceedings of the ACM on Programming Languages10.1145/36228407:OOPSLA2(1198-1223)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3622840
Sun YKumar SGilray TMicinski K(2023)Communication-Avoiding Recursive Aggregation2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00024(197-208)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00024
Khamis MNgo HPichler RSuciu DRemy Wang Y(2022)Datalog in WonderlandACM SIGMOD Record10.1145/3552490.355249251:2(6-17)Online publication date: 29-Jul-2022
https://dl.acm.org/doi/10.1145/3552490.3552492
Zhang HYu JZhang YZhao KIves ZBonifati AEl Abbadi A(2022)Parallel Query Processing: To Separate Communication from ComputationProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526164(1447-1461)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526164
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents