Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2915229acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Big Data Analytics with Datalog Queries on Spark

Published: 14 June 2016 Publication History

Abstract

There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.

References

[1]
Amway Business Reference Guide. https://www.amway.com/en/ResourceCenterDocuments/Visitor/ops-amw-gde-v-en--BusinessReferenceGuide.pdf.
[2]
Apache Giraph. http://giraph.apache.org.
[3]
Apache Hadoop. http://hadoop.apache.org.
[4]
Apache Spark. http://spark.apache.org.
[5]
arabic-2005 network. http://law.di.unimi.it/webdata/arabic-2005/.
[6]
Big Data Ecosystem at LinkedIn. http://www.slideshare.net/mitultiwari/big-data-ecosystem-at-linkedin-keynote-talk-at-big-data-innovators-gathering-at-www-2015. Keynote talk at Big Data Innovators Gathering at WWW 2015.
[7]
Deductive Application Language System (DeALS). http://wis.cs.ucla.edu/deals/.
[8]
GTgraph. http://www.cse.psu.edu/~kxm85/software/GTgraph.
[9]
LiveJournal social network. http://snap.stanford.edu/data/com-LiveJournal.html.
[10]
Orkut social network. http://snap.stanford.edu/data/com-Orkut.html.
[11]
PigMix. https://cwiki.apache.org/confluence/display/PIG/PigMix.
[12]
Recursive tree. https://en.wikipedia.org/wiki/Recursive_tree.
[13]
TPC-H. http://www.tpc.org/tpch/.
[14]
F. N. Afrati and J. D. Ullman. Transitive closure and recursive datalog implemented on clusters. In EDBT, pages 132--143, 2012.
[15]
T. J. Ameloot, F. Neven, and J. Van~den Bussche. Relational transducers for declarative networking. In PODS, pages 283--292, 2011.
[16]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In SIGMOD, pages 1383--1394, 2015.
[17]
F. Arni, K. Ong, S. Tsur, H. Wang, and C. Zaniolo. The deductive database system ldl+. TPLP, 3(1):61--94, 2003.
[18]
F. Bancilhon. Naive evaluation of recursively defined relations. In On Knowledge Base Management Systems, pages 165--178. Springer-Verlag, 1986.
[19]
P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler: A Scalable Fully Distributed Web Crawler. Software: Practice & Experience, 34(8):711--726, 2004.
[20]
V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011.
[21]
Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, and R. Ramakrishnan. Scaling datalog for machine learning on big data. CoRR, abs/1203.0160, 2012.
[22]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. The haloop approach to large-scale iterative data analysis. The VLDB Journal, 21(2):169--190, 2012.
[23]
S. Cohen and O. Wolfson. Why a single parallelization strategy is not enough in knowledge bases. In PODS, pages 200--216, 1989.
[24]
T. Condie, D. Chu, J. M. Hellerstein, and P. Maniatis. Evita raced: metacompilation for declarative networks. PVLDB, 1(1):1153--1165, Aug. 2008.
[25]
N. Conway, W. R. Marczak, P. Alvaro, J. M. Hellerstein, and D. Maier. Logic and lattices for distributed programming. In SoCC, 2012.
[26]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[27]
J. Eisner and N. W. Filardo. Dyna: Extending datalog for modern ai. In Datalog Reloaded, pages 181--220, 2010.
[28]
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In HPDC, pages 810--818, 2010.
[29]
S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. PVLDB, 5(11):1268--1279, 2012.
[30]
S. Ganguly, A. Silberschatz, and S. Tsur. A framework for the parallel processing of datalog queries. In SIGMOD, pages 143--152, 1990.
[31]
S. Ganguly, A. Silberschatz, and S. Tsur. Parallel bottom-up processing of datalog queries. The Journal of Logic Programming, 14(1):101--126, 1992.
[32]
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2), 2009.
[33]
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, pages 599--613, 2014.
[34]
M. Interlandi and L. Tanca. On the CALM principle for BSP computation. In AMW, 2015.
[35]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.
[36]
H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591--600. ACM, 2010.
[37]
B. T. Loo, T. Condie, M. N. Garofalakis, D. E. Gay, J. M. Hellerstein, P. Maniatis, R. Ramakrishnan, T. Roscoe, and I. Stoica. Declarative networking: language, execution and optimization. In SIGMOD, pages 97--108. ACM, 2006.
[38]
B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis, T. Roscoe, and I. Stoica. Implementing declarative overlays. In SOSP, pages 75--90. ACM, 2005.
[39]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. PVLDB, 5(8):716--727, 2012.
[40]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.
[41]
M. Mazuran, E. Serra, and C. Zaniolo. A declarative extension of horn clauses, and its significance for datalog and its applications. TPLP, 13(4--5):609--623, 2013.
[42]
M. Mazuran, E. Serra, and C. Zaniolo. Extending the power of datalog recursion. The VLDB Journal, 22(4):471--493, 2013.
[43]
S. R. Mihaylov, Z. G. Ives, and S. Guha. Rex: Recursive, delta-based data-centric computation. PVLDB, 5(11):1280--1291, July 2012.
[44]
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A timely dataflow system. In SOSP, pages 439--455, 2013.
[45]
K. A. Ross and Y. Sagiv. Monotonic aggregation in deductive databases. In PODS, pages 114--126, 1992.
[46]
J. Seib and G. Lausen. Parallelizing datalog programs by generalized pivoting. In PODS, pages 241--251, 1991.
[47]
J. Seo, S. Guo, and M. S. Lam. Socialite: Datalog extensions for efficient social network analysis. In ICDE, pages 278--289, 2013.
[48]
J. Seo, J. Park, J. Shin, and M. S. Lam. Distributed socialite: A datalog-based language for large-scale graph analysis. PVLDB, 6(14):1906--1917, 2013.
[49]
A. Shkapsky, M. Yang, and C. Zaniolo. Optimizing recursive queries with monotonic aggregates in deals. In ICDE, pages 867--878, 2015.
[50]
A. Shkapsky, K. Zeng, and C. Zaniolo. Graph queries in a next-generation datalog system. PVLDB, 6(12):1258--1261, 2013.
[51]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009.
[52]
A. Van~Gelder. A message passing framework for logical query evaluation. In SIGMOD, pages 155--165, 1986.
[53]
A. Van~Gelder. Foundations of aggregation in deductive databases. In DOOD, pages 13--34, 1993.
[54]
J. Wang, M. Balazinska, and D. Halperin. Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. PVLDB, 8(12):1542--1553, 2015.
[55]
M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in scalops, a higher order cloud computing language. In BigLearn, December 2011.
[56]
O. Wolfson and A. Ozeri. A new paradigm for parallel and distributed rule-processing. In SIGMOD, pages 133--142, 1990.
[57]
O. Wolfson and A. Silberschatz. Distributed processing of logic programs. In SIGMOD, pages 329--336, 1988.
[58]
J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems, 42(1):181--213, 2015.
[59]
M. Yang, A. Shkapsky, and C. Zaniolo. Parallel bottom-up evaluation of logic programs: DeALS on shared-memory multicore machines. In Technical Communications of ICLP, 2015.
[60]
M. Yang and C. Zaniolo. Main memory evaluation of recursive queries on multicore machines. In IEEE Big Data, pages 251--260, 2014.
[61]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008.
[62]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.
[63]
C. Zaniolo, S. Ceri, C. Faloutsos, R. T. Snodgrass, V. S. Subrahmanian, and R. Zicari. Advanced Database Systems. Morgan Kaufmann, 1997.
[64]
V. Zaychik~Moffitt, J. Stoyanovich, S. Abiteboul, and G. Miklau. Collaborative access control in webdamlog. In SIGMOD, pages 197--211, 2015.
[65]
Y. Zhang, Q. Gao, L. Gao, and C. Wang. Priter: A distributed framework for prioritized iterative computations. In SOCC, pages 13:1--13:14, 2011.
[66]
J. Zhou, N. Bruno, M.-C. Wu, P.-A. Larson, R. Chaiken, and D. Shakib. Scope: Parallel databases meet mapreduce. The VLDB Journal, 21(5):611--636, Oct. 2012.

Cited By

View all
  • (2024)Efficient Enumeration of Recursive Plans in Transformation-Based Query OptimizersProceedings of the VLDB Endowment10.14778/3681954.368198617:11(3095-3108)Online publication date: 1-Jul-2024
  • (2024)Making Formulog Fast: An Argument for Unconventional Datalog EvaluationProceedings of the ACM on Programming Languages10.1145/36897548:OOPSLA2(1219-1248)Online publication date: 8-Oct-2024
  • (2024)Finding Cross-Rule Optimization Bugs in Datalog EnginesProceedings of the ACM on Programming Languages10.1145/36498158:OOPSLA1(110-136)Online publication date: 29-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. datalog
  2. monotonic aggregates
  3. recursive queries
  4. spark

Qualifiers

  • Research-article

Funding Sources

  • NSF
  • National Institute of Biomedical Imaging and Bioengineering

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)237
  • Downloads (Last 6 weeks)33
Reflects downloads up to 14 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Enumeration of Recursive Plans in Transformation-Based Query OptimizersProceedings of the VLDB Endowment10.14778/3681954.368198617:11(3095-3108)Online publication date: 1-Jul-2024
  • (2024)Making Formulog Fast: An Argument for Unconventional Datalog EvaluationProceedings of the ACM on Programming Languages10.1145/36897548:OOPSLA2(1219-1248)Online publication date: 8-Oct-2024
  • (2024)Finding Cross-Rule Optimization Bugs in Datalog EnginesProceedings of the ACM on Programming Languages10.1145/36498158:OOPSLA1(110-136)Online publication date: 29-Apr-2024
  • (2024)Convergence of datalog over (Pre-) SemiringsJournal of the ACM10.1145/364302771:2(1-55)Online publication date: 10-Apr-2024
  • (2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
  • (2023)On The Suitability of Differential Dataflow For Datalog Interpretation In Highly Dynamic SettingsProceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference10.1145/3639592.3639622(218-225)Online publication date: 16-Dec-2023
  • (2023)Bring Your Own Data Structures to DatalogProceedings of the ACM on Programming Languages10.1145/36228407:OOPSLA2(1198-1223)Online publication date: 16-Oct-2023
  • (2023)Communication-Avoiding Recursive Aggregation2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00024(197-208)Online publication date: 31-Oct-2023
  • (2022)Datalog in WonderlandACM SIGMOD Record10.1145/3552490.355249251:2(6-17)Online publication date: 29-Jul-2022
  • (2022)Parallel Query Processing: To Separate Communication from ComputationProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526164(1447-1461)Online publication date: 10-Jun-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media