Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

UDA-GIST: an in-database framework to unify data-parallel and state-parallel analytics

Published: 01 January 2015 Publication History

Abstract

Enterprise applications need sophisticated in-database analytics in addition to traditional online analytical processing from a database. To meet customers' pressing demands, database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes offer User-Defined Aggregate (UDA), a data-driven operator, to implement many of the analytical techniques in parallel. However, UDAs can not be used to implement statistical algorithms such as Markov chain Monte Carlo (MCMC), where most of the work is performed by iterative transitions over a large state that can not be naively partitioned due to data dependency. Typically, this type of statistical algorithm requires pre-processing to setup the large state in the first place and demands post-processing after the statistical inference. This paper presents General Iterative State Transition (GIST), a new database operator for parallel iterative state transitions over large states. GIST receives a state constructed by a UDA, and then performs rounds of transitions on the state until it converges. A final UDA performs post-processing and result extraction. We argue that the combination of UDA and GIST (UDA-GIST) unifies data-parallel and state-parallel processing in a single system, thus significantly extending the analytical capabilities of DBMSes. We exemplify the framework through two high-profile applications: cross-document coreference and image denoising. We show that the in-database framework allows us to tackle a 27 times larger problem than solved by the state-of-the-art for the first application and achieves 43 times speedup over the state-of-the-art for the second application.

References

[1]
S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. L. Perez. The datapath system: a data-centric analytic processing engine for large data warehouses. In A. K. Elmagarmid and D. Agrawal, editors, SIGMOD Conference, pages 519--530. ACM, 2010.
[2]
A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In C. Boitet and P. Whitelock, editors, COLING-ACL, pages 79--85. Morgan Kaufmann Publishers / ACL, 1998.
[3]
T. Bain, L. Davidson, R. Dewson, and C. Hawkins. User defined functions. In SQL Server 2000 Stored Procedures Handbook, pages 178--195. Springer, 2003.
[4]
H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. SIGPLAN Not., 46(8): 35--46, Feb. 2011.
[5]
S. Chib and E. Greenberg. Understanding the metropolis-hastings algorithm. The American Statistician, 49(4): 327--335, 1995.
[6]
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: New analysis practices for big data. PVLDB, 2(2): 1481--1492, 2009.
[7]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150. USENIX Association, 2004.
[8]
A. Dobra. Datapath: High-performance database engine, June 2011.
[9]
P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation for early vision. In CVPR (1), pages 261--268, 2004.
[10]
N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine learning, 29(2-3): 131--163, 1997.
[11]
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The madlib analytics library or mad skills, the sql. CoRR, abs/1208.4165, 2012.
[12]
A. T. Ihler, J. Iii, and A. S. Willsky. Loopy belief propagation: Convergence and effects of message errors. In Journal of Machine Learning Research, pages 905--936, 2005.
[13]
K. Li, C. Grant, D. Z. Wang, S. Khatri, and G. Chitouras. Gptext: Greenplum parallel statistical text analysis framework. In Proceedings of the Second Workshop on Data Analytics in the Cloud, DanaC '13, pages 31--35, New York, NY, USA, 2013. ACM.
[14]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. CoRR, abs/1006.4990, 2010.
[15]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. PVLDB, 5(8): 716--727, 2012.
[16]
A. Mahout. Scalable machine-learning and data-mining library. available at mahout.apache.org.
[17]
J. Meng, S. Chakradhar, and A. R. Best-effort parallel execution framework for recognition and mining applications. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, May 2009.
[18]
M. Mitzenmacher. The power of two choices in randomized load balancing. Parallel and Distributed Systems, IEEE Transactions on, 12(10): 1094--1104, 2001.
[19]
K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 467--475. Morgan Kaufmann Publishers Inc., 1999.
[20]
F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: scaling up statistical inference in markov logic networks using an rdbms. Proceedings of the VLDB Endowment, 4(6): 373--384, 2011.
[21]
Y. A. Rozanov. Markov random fields. Springer, 1982.
[22]
F. Rusu and A. Dobra. Glade: a scalable framework for efficient analytics. Operating Systems Review, 46(1): 12--18, 2012.
[23]
S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Wikilinks: A large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015, 2012.
[24]
S. Singh, A. Subramanya, F. C. N. Pereira, and A. McCallum. Large-scale cross-document coreference using distributed inference and hierarchical models. In D. Lin, Y. Matsumoto, and R. Mihalcea, editors, ACL, pages 793--803. The Association for Computer Linguistics, 2011.
[25]
D. Z. Wang, Y. Chen, C. Grant, and K. Li. Efficient in-database analytics with graphical models. IEEE Data Engineering Bulletin, 2014.
[26]
H. Wang and C. Zaniolo. User defined aggregates in object-relational systems. In Data Engineering, 2000. Proceedings. 16th International Conference on, pages 135--144, 2000.
[27]
R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. Graphx: A resilient distributed graph system on spark. In First International Workshop on Graph Data Management Experiences and Systems, page 2. ACM, 2013.
[28]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10--10, 2010.

Cited By

View all
  • (2021)Not black-box anymore!Proceedings of the VLDB Endowment10.14778/3476311.347637514:12(2959-2971)Online publication date: 28-Oct-2021
  • (2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 1-Jun-2021
  • (2019)Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian NetworksACM Transactions on Knowledge Discovery from Data10.1145/330130413:2(1-40)Online publication date: 13-Mar-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 5
January 2015
181 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2015
Published in PVLDB Volume 8, Issue 5

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Not black-box anymore!Proceedings of the VLDB Endowment10.14778/3476311.347637514:12(2959-2971)Online publication date: 28-Oct-2021
  • (2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 1-Jun-2021
  • (2019)Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian NetworksACM Transactions on Knowledge Discovery from Data10.1145/330130413:2(1-40)Online publication date: 13-Mar-2019
  • (2017)ArchimedesACM SIGMOD Record10.1145/3137586.313759246:2(30-35)Online publication date: 1-Sep-2017
  • (2017)In-database batch and query-time inference over probabilistic graphical models using UDA---GISTThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-016-0446-126:2(177-201)Online publication date: 1-Apr-2017
  • (2016)ArchimedesOneProceedings of the VLDB Endowment10.14778/3007263.30072849:13(1461-1464)Online publication date: 1-Sep-2016
  • (2016)Ontological PathfindingProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2882954(835-846)Online publication date: 26-Jun-2016
  • (2016)ENFrameACM Transactions on Database Systems10.1145/287720541:1(1-44)Online publication date: 18-Mar-2016
  • (2016)From NoSQL Accumulo to NewSQL Graphulo: Design and utility of graph algorithms inside a BigTable database2016 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2016.7761577(1-9)Online publication date: Sep-2016
  • (2016)ScaLeKBThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-016-0444-325:6(893-918)Online publication date: 1-Dec-2016

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media