Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Distributed socialite: a datalog-based language for large-scale graph analysis

Published: 01 September 2013 Publication History

Abstract

Large-scale graph analysis is becoming important with the rise of world-wide social network services. Recently in SociaLite, we proposed extensions to Datalog to efficiently and succinctly implement graph analysis programs on sequential machines. This paper describes novel extensions and optimizations of SociaLite for parallel and distributed executions to support large-scale graph analysis.
With distributed SociaLite, programmers simply annotate how data are to be distributed, then the necessary communication is automatically inferred to generate parallel code for cluster of multi-core machines. It optimizes the evaluation of recursive monotone aggregate functions using a delta stepping technique. In addition, approximate computation is supported in SociaLite, allowing programmers to trade off accuracy for less time and space.
We evaluated SociaLite with six core graph algorithms used in many social network analyses. Our experiment with 64 Amazon EC2 8-core instances shows that SociaLite programs performed within a factor of two with respect to ideal weak scaling. Compared to optimized Giraph, an open-source alternative of Pregel, SociaLite programs are 4 to 12 times faster across benchmark algorithms, and 22 times more succinct on average.
As a declarative query language, SociaLite, with the help of a compiler that generates efficient parallel and approximate code, can be used easily to create many social apps that operate on large-scale distributed graphs.

References

[1]
P. S. Almeida, C. Baquero, N. M. Preguiça, and D. Hutchison. Scalable bloom filters. Inf. Process. Lett., 101(6):255-261, 2007.
[2]
P. Alvaro, T. Condie, N. Conway, K. Elmeleegy, J. M. Hellerstein, and R. C. Sears. Boom analytics: Exploring data-centric, declarative programming for the cloud. In EuroSys, pages 223-236, 2010.
[3]
P. Alvaro, W. R. Marczak, N. Conway, J. M. Hellerstein, D. Maier, and R. Sears. Dedalus: Datalog in time and space. In Datalog, pages 262-281, 2010.
[4]
http://incubator.apache.org/giraph.
[5]
http://hadoop.apache.org.
[6]
http://hama.apache.org.
[7]
M. P. Ashley-Rollman, P. Lee, S. C. Goldstein, P. Pillai, and J. D. Campbell. A language for large ensembles of independently executing nodes. In ICLP, pages 265-280, 2009.
[8]
F. Bancilhon. Naive evaluation of recursively defined relations. In On Knowledge Base Management Systems (Islamorada), pages 165-178, 1985.
[9]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, pages 422-426, 1970.
[10]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW7, pages 107-117, 1998.
[11]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 3(1):285-296, 2010.
[12]
D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, 2004.
[13]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, pages 137-150, 2004.
[14]
L. C. Freeman. A set of measures of centrality based on betweenness. Sociometry, 40(1):35-41, 1977.
[15]
http://archive.org/details/friendster-dataset-201107.
[16]
S. Ganguly, A. Silberschatz, S. Tsur, and S. Tsur. A framework for the parallel processing of datalog queries. In SIGMOD, pages 143-152, 1990.
[17]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP, pages 29-43, 2003.
[18]
M. Granovetter. The Strength of Weak Ties. The American Journal of Sociology, 78(6):1360-1380, 1973.
[19]
J. L. Gustafson. Reevaluating Amdahl's Law. Communications of the ACM, 31(5):532-533, 1988.
[20]
M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59-72, 2007.
[21]
L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133-169, 1998.
[22]
D. Liben-Nowell and J. M. Kleinberg. The link-prediction problem for social networks. JASIST, 58(7):1019-1031, 2007.
[23]
http://www.livejournal.com.
[24]
Logicblox inc. http://www.logicblox.com/.
[25]
B. T. Loo, T. Condie, M. Garofalakis, D. E. Gay, J. M. Hellerstein, P. Maniatis, R. Ramakrishnan, T. Roscoe, and I. Stoica. Declarative networking. Commun. ACM, 52(11):87-95, 2009.
[26]
K. Madduri, D. A. Bader, J. W. Berry, and J. R. Crobak. An experimental study of a parallel shortest path algorithm for solving large-scale graph instances. In ALENEX, pages 23-35, 2007.
[27]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135-146. ACM, 2010.
[28]
U. Meyer and P. Sanders. Delta-stepping: A parallel single source shortest path algorithm. In ESA, pages 393-404, 1998.
[29]
S. R. Mihaylov, Z. G. Ives, and S. Guha. Rex: Recursive, delta-based data-centric computation. PVLDB, 5(11):1280-1291, 2012.
[30]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099-1110, 2008.
[31]
G. Palla, I. Derenyi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435:814, 2005.
[32]
J. Seo, S. Guo, and M. S. Lam. SociaLite: Datalog extensions for efficient social network analysis. In ICDE, pages 278-289, 2013.
[33]
http://www.graph500.org.
[34]
J. Ugander, B. Karrer, L. Backstrom, and C. Marlow. The anatomy of the facebook social graph. CoRR, 2011.
[35]
J. D. Ullman. Principles of database and knowledge-base systems, volume ii. 1989.
[36]
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103-111, 1990.
[37]
J. Whaley and M. S. Lam. Cloning-based context-sensitive pointer alias analyses using binary decision diagrams. In PLDI, pages 131-144, 2004.
[38]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1-14, 2008.
[39]
W. Zhang, K. Wang, S.-C. Chau, and S.-C. Chau. Data partition and parallel evaluation of datalog programs. IEEE Trans. Knowl. Data Eng., 7(1):163-176, 1995.

Cited By

View all
  • (2024)LSMGraph: A High-Performance Dynamic Graph Storage System with Multi-Level CSRProceedings of the ACM on Management of Data10.1145/36988182:6(1-28)Online publication date: 20-Dec-2024
  • (2024)Evaluating Datalog over Semirings: A Grounding-based ApproachProceedings of the ACM on Management of Data10.1145/36515912:2(1-26)Online publication date: 14-May-2024
  • (2023)Bring Your Own Data Structures to DatalogProceedings of the ACM on Programming Languages10.1145/36228407:OOPSLA2(1198-1223)Online publication date: 16-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 6, Issue 14
September 2013
384 pages

Publisher

VLDB Endowment

Publication History

Published: 01 September 2013
Published in PVLDB Volume 6, Issue 14

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)5
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)LSMGraph: A High-Performance Dynamic Graph Storage System with Multi-Level CSRProceedings of the ACM on Management of Data10.1145/36988182:6(1-28)Online publication date: 20-Dec-2024
  • (2024)Evaluating Datalog over Semirings: A Grounding-based ApproachProceedings of the ACM on Management of Data10.1145/36515912:2(1-26)Online publication date: 14-May-2024
  • (2023)Bring Your Own Data Structures to DatalogProceedings of the ACM on Programming Languages10.1145/36228407:OOPSLA2(1198-1223)Online publication date: 16-Oct-2023
  • (2023)Communication-Avoiding Recursive Aggregation2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00024(197-208)Online publication date: 31-Oct-2023
  • (2022)SageProceedings of the VLDB Endowment10.14778/3565838.356584415:13(3897-3910)Online publication date: 1-Sep-2022
  • (2022)Optimizing the Bruck Algorithm for Non-uniform All-to-all CommunicationProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531468(172-184)Online publication date: 27-Jun-2022
  • (2022)Seamless deductive inference via macrosProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517779(77-88)Online publication date: 19-Mar-2022
  • (2022)Parallel Logic Programming: A SequelTheory and Practice of Logic Programming10.1017/S147106842200005922:6(905-973)Online publication date: 28-Mar-2022
  • (2022)Fast datalog evaluation for batch and stream graph processingWorld Wide Web10.1007/s11280-021-00960-w25:2(971-1003)Online publication date: 20-Jan-2022
  • (2022)Datacenter Design and ManagementundefinedOnline publication date: 24-Mar-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media