research-article

UDA-GIST: an in-database framework to unify data-parallel and state-parallel analytics

Editors: Chen Li, Volker Markl Authors:

Daisy Zhe Wang,

Christopher DudleyAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 5

Pages 557 - 568

https://doi.org/10.14778/2735479.2735488

Published: 01 January 2015 Publication History

Abstract

Enterprise applications need sophisticated in-database analytics in addition to traditional online analytical processing from a database. To meet customers' pressing demands, database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes offer User-Defined Aggregate (UDA), a data-driven operator, to implement many of the analytical techniques in parallel. However, UDAs can not be used to implement statistical algorithms such as Markov chain Monte Carlo (MCMC), where most of the work is performed by iterative transitions over a large state that can not be naively partitioned due to data dependency. Typically, this type of statistical algorithm requires pre-processing to setup the large state in the first place and demands post-processing after the statistical inference. This paper presents General Iterative State Transition (GIST), a new database operator for parallel iterative state transitions over large states. GIST receives a state constructed by a UDA, and then performs rounds of transitions on the state until it converges. A final UDA performs post-processing and result extraction. We argue that the combination of UDA and GIST (UDA-GIST) unifies data-parallel and state-parallel processing in a single system, thus significantly extending the analytical capabilities of DBMSes. We exemplify the framework through two high-profile applications: cross-document coreference and image denoising. We show that the in-database framework allows us to tackle a 27 times larger problem than solved by the state-of-the-art for the first application and achieves 43 times speedup over the state-of-the-art for the second application.

References

[1]

S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. L. Perez. The datapath system: a data-centric analytic processing engine for large data warehouses. In A. K. Elmagarmid and D. Agrawal, editors, SIGMOD Conference, pages 519--530. ACM, 2010.

Digital Library

[2]

A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In C. Boitet and P. Whitelock, editors, COLING-ACL, pages 79--85. Morgan Kaufmann Publishers / ACL, 1998.

[3]

T. Bain, L. Davidson, R. Dewson, and C. Hawkins. User defined functions. In SQL Server 2000 Stored Procedures Handbook, pages 178--195. Springer, 2003.

[4]

H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. SIGPLAN Not., 46(8): 35--46, Feb. 2011.

Digital Library

[5]

S. Chib and E. Greenberg. Understanding the metropolis-hastings algorithm. The American Statistician, 49(4): 327--335, 1995.

[6]

J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: New analysis practices for big data. PVLDB, 2(2): 1481--1492, 2009.

Digital Library

[7]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150. USENIX Association, 2004.

Digital Library

[8]

A. Dobra. Datapath: High-performance database engine, June 2011.

[9]

P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation for early vision. In CVPR (1), pages 261--268, 2004.

[10]

N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine learning, 29(2-3): 131--163, 1997.

Digital Library

[11]

J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The madlib analytics library or mad skills, the sql. CoRR, abs/1208.4165, 2012.

Digital Library

[12]

A. T. Ihler, J. Iii, and A. S. Willsky. Loopy belief propagation: Convergence and effects of message errors. In Journal of Machine Learning Research, pages 905--936, 2005.

Digital Library

[13]

K. Li, C. Grant, D. Z. Wang, S. Khatri, and G. Chitouras. Gptext: Greenplum parallel statistical text analysis framework. In Proceedings of the Second Workshop on Data Analytics in the Cloud, DanaC '13, pages 31--35, New York, NY, USA, 2013. ACM.

Digital Library

[14]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. CoRR, abs/1006.4990, 2010.

[15]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. PVLDB, 5(8): 716--727, 2012.

Digital Library

[16]

A. Mahout. Scalable machine-learning and data-mining library. available at mahout.apache.org.

[17]

J. Meng, S. Chakradhar, and A. R. Best-effort parallel execution framework for recognition and mining applications. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, May 2009.

Digital Library

[18]

M. Mitzenmacher. The power of two choices in randomized load balancing. Parallel and Distributed Systems, IEEE Transactions on, 12(10): 1094--1104, 2001.

Digital Library

[19]

K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 467--475. Morgan Kaufmann Publishers Inc., 1999.

Digital Library

[20]

F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: scaling up statistical inference in markov logic networks using an rdbms. Proceedings of the VLDB Endowment, 4(6): 373--384, 2011.

Digital Library

[21]

Y. A. Rozanov. Markov random fields. Springer, 1982.

[22]

F. Rusu and A. Dobra. Glade: a scalable framework for efficient analytics. Operating Systems Review, 46(1): 12--18, 2012.

Digital Library

[23]

S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Wikilinks: A large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015, 2012.

[24]

S. Singh, A. Subramanya, F. C. N. Pereira, and A. McCallum. Large-scale cross-document coreference using distributed inference and hierarchical models. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, ACL, pages 793--803. The Association for Computer Linguistics, 2011.

Digital Library

[25]

D. Z. Wang, Y. Chen, C. Grant, and K. Li. Efficient in-database analytics with graphical models. IEEE Data Engineering Bulletin, 2014.

[26]

H. Wang and C. Zaniolo. User defined aggregates in object-relational systems. In Data Engineering, 2000. Proceedings. 16th International Conference on, pages 135--144, 2000.

Digital Library

[27]

R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. Graphx: A resilient distributed graph system on spark. In First International Workshop on Graph Data Management Experiences and Systems, page 2. ACM, 2013.

Digital Library

[28]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10--10, 2010.

Digital Library

Cited By

Eltabakh MSubramanian AAl-Omari AAl-Kateb MNair SHasan MCabrera WZhang CKishore APrasad S(2021)Not black-box anymore!Proceedings of the VLDB Endowment10.14778/3476311.347637514:12(2959-2971)Online publication date: 28-Oct-2021
https://dl.acm.org/doi/10.14778/3476311.3476375
Zhang YMcQuillan FJayaram NKak NKhanna EKislal OValdano DKumar A(2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.14778/3467861.3467867
Katib ARao PBarnard KKamhoua C(2019)Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian NetworksACM Transactions on Knowledge Discovery from Data10.1145/330130413:2(1-40)Online publication date: 13-Mar-2019
https://dl.acm.org/doi/10.1145/3301304
Show More Cited By

Recommendations

In-database batch and query-time inference over probabilistic graphical models using UDA---GIST

To meet customers' pressing demands, enterprise database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes use user-defined aggregates (UDAs), a data-driven operator, to implement analytical techniques in ...
UDA: A user-difference attention for group recommendation
Abstract
Human beings are gregarious by nature, and thus, group activities are indispensable in people’s daily lives. In light of this, group recommendation systems have attracted wide research attention in recent years. The pivotal task of ...
SP-GiST: An Extensible Database Index for Supporting Space Partitioning Trees

Emerging database applications require the use of new indexing structures beyond B-trees and R-trees. Examples are the k-D tree, the trie, the quadtree, and their variants. They are often proposed as supporting structures in data mining, GIS, and CAD/...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 5

January 2015

181 pages

ISSN:2150-8097

Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2015

Published in PVLDB Volume 8, Issue 5

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
113
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Eltabakh MSubramanian AAl-Omari AAl-Kateb MNair SHasan MCabrera WZhang CKishore APrasad S(2021)Not black-box anymore!Proceedings of the VLDB Endowment10.14778/3476311.347637514:12(2959-2971)Online publication date: 28-Oct-2021
https://dl.acm.org/doi/10.14778/3476311.3476375
Zhang YMcQuillan FJayaram NKak NKhanna EKislal OValdano DKumar A(2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.14778/3467861.3467867
Katib ARao PBarnard KKamhoua C(2019)Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian NetworksACM Transactions on Knowledge Discovery from Data10.1145/330130413:2(1-40)Online publication date: 13-Mar-2019
https://dl.acm.org/doi/10.1145/3301304
Chen YZhou XLi KWang D(2017)ArchimedesACM SIGMOD Record10.1145/3137586.313759246:2(30-35)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1145/3137586.3137592
Li KZhou XWang DGrant CDobra ADudley C(2017)In-database batch and query-time inference over probabilistic graphical models using UDA---GISTThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-016-0446-126:2(177-201)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1007/s00778-016-0446-1
Zhou XChen YWang D(2016)ArchimedesOneProceedings of the VLDB Endowment10.14778/3007263.30072849:13(1461-1464)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.14778/3007263.3007284
Chen YGoldberg SWang DJohri SÖzcan FKoutrika GMadden S(2016)Ontological PathfindingProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2882954(835-846)Online publication date: 26-Jun-2016
https://dl.acm.org/doi/10.1145/2882903.2882954
Olteanu DSchaik S(2016)ENFrameACM Transactions on Database Systems10.1145/287720541:1(1-44)Online publication date: 18-Mar-2016
https://dl.acm.org/doi/10.1145/2877205
Hutchison DKepner JGadepally VHowe B(2016)From NoSQL Accumulo to NewSQL Graphulo: Design and utility of graph algorithms inside a BigTable database2016 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2016.7761577(1-9)Online publication date: Sep-2016
https://doi.org/10.1109/HPEC.2016.7761577
Chen YWang DGoldberg S(2016)ScaLeKBThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-016-0444-325:6(893-918)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1007/s00778-016-0444-3

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents