research-article

A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Authors:

Meichun HsuAuthors Info & Claims

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

March 2013

Pages 613 - 624

https://doi.org/10.1145/2452376.2452448

Published: 18 March 2013 Publication History

Abstract

Text analytics has become increasingly important with the rapid growth of text data. Particularly, information extraction (IE), which extracts structured data from text, has received significant attention. Unfortunately, IE is often computationally intensive. To address this issue, MapReduce has been used for large scale IE. Recently, there are emerging efforts from both academia and industry on pushing IE inside DBMSs. This leads to an interesting and important question: Given that both MapReduce and parallel DBMSs are for large scale analytics, which platform is a better choice for large scale IE? In this paper, we propose a benchmark to systematically study the performance of both platforms for large scale IE tasks. The benchmark includes both statistical learning based and rule based IE programs, which have been extensively used in real-world IE tasks. We show how to express these programs on both platforms and conduct experiments on real-world datasets. Our results show that parallel DBMSs is a viable alternative for large scale IE.

References

[1]

http://trec.nist.gov/.

[2]

http://dumps.wikimedia.org/enwiki/20120307/.

[3]

http://cogcomp.cs.illinois.edu/page/software/.

[4]

http://www.freebase.com/.

[5]

http://crfpp.sourceforge.net/.

[6]

http://crf.sourceforge.net/.

[7]

A. Alexandrov, M. Heimel, V. Markl, D. Battré, F. Hueske, E. Nijkamp, S. Ewen, O. Kao, and D. Warneke. Massively parallel data analysis with pacts on nephele. PVLDB-10.

Digital Library

[8]

F. Chen, X. Feng, C. Ré, and M. Wang. Optimizing statistical information extraction programs over evolving text. ICDE-12.

[9]

F. Chen, B. Gao, A. Doan, J. Yang, and R. Ramakrishnan. Optimizing complex extraction programs over evolving text data. SIGMOD-09.

Digital Library

[10]

J. Cho and S. Rajagopalan. A fast regular expression indexing engine. ICDE-02.

[11]

X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. SIGMOD-05.

Digital Library

[12]

V. Ercegovac, D. DeWitt, and R. Ramakrishnan. The texture benchmark: measuring performance of text queries on a relational DBMS. In VLDB-05.

Digital Library

[13]

X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-RDBMS analytics. SIGMOD-12.

Digital Library

[14]

D. Ferrucci and A. Lally. UIMA: An architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10(3-4), 2004.

Digital Library

[15]

J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. ACL-05.

Digital Library

[16]

A. Floratou, N. Teletia, D. DeWitt, J. Patel, and D. Zhang. Can the elephants handle the NoSQL onslaught? PVLDB-12.

Digital Library

[17]

L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, et al. Approximate string joins in a database (almost) for free. VLDB-01.

Digital Library

[18]

J. Hellerstein, C. Ré, F. Schoppmann, D. Wang, E. Fratkin, A. Gorajek, K. Ng, C. Welton, X. Feng, K. Li, et al. The MADlib analytics library, or MAD skills, the SQL. PVLDB-12.

Digital Library

[19]

G. Kasneci, M. Ramanath, F. Suchanek, and G. Weikum. The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37(4), 2008.

Digital Library

[20]

J. Lin and C. Dyer. Data-intensive text processing with mapreduce. Syn. Lec. on Human Lang. Tech.-10.

Digital Library

[21]

A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. CoNLL-03.

Digital Library

[22]

F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: Scaling up statistical inference in markov logic networks using an RDBMS. PVLDB-11.

Digital Library

[23]

A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD-09.

Digital Library

[24]

F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL-04.

[25]

D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. SIGIR-03.

Digital Library

[26]

F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. ICDE-08.

Digital Library

[27]

S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008.

Digital Library

[28]

W. Shen, A. Doan, J. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. VLDB-07.

Digital Library

[29]

A. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal. Optimizing analytic data flows for multiple execution engines. SIGMOD-12.

Digital Library

[30]

A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. ICDE-10.

[31]

D. Tsang and S. Chawla. A robust index for regular expression queries. CIKM-11.

[32]

R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. SIGMOD-10.

Digital Library

[33]

D. Wang, M. Franklin, M. Garofalakis, and J. Hellerstein. Querying probabilistic information extraction. PVLDB-10.

Digital Library

[34]

D. Wang, M. Franklin, M. Garofalakis, J. Hellerstein, and M. Wick. Hybrid in-database inference for declarative information extraction. SIGMOD-11.

Digital Library

[35]

G. Weikum, J. Hoffart, N. Nakashole, M. Spaniol, F. Suchanek, and M. Yosef. Big data methods for computational linguistics. IEEE Data Eng. Bulletin, 35(3), 2012.

[36]

F. Wu, R. Hoffmann, and D. S. Weld. Information extraction from Wikipedia: moving down the long tail. SIGKDD-08.

Digital Library

Cited By

K TSRINIVASULU A(2016)CLUSTERING OF SUMMARIZING MULTI-DOCUMENTS (LARGE DATA) BY USING MAPREDUCE FRAMEWORKi-manager’s Journal on Cloud Computing10.26634/jcc.3.1.80733:1(1)Online publication date: 2016
https://doi.org/10.26634/jcc.3.1.8073
Hung Tien Tran Hiep Tuan Nguyen Viet-Trung Tran (2016)Large-scale geographically weighted regression on Spark2016 Eighth International Conference on Knowledge and Systems Engineering (KSE)10.1109/KSE.2016.7758041(127-132)Online publication date: Oct-2016
https://doi.org/10.1109/KSE.2016.7758041
Marín-Ortega PDmitriyev VAbilov MGómez J(2014)ELTA: New Approach in Designing Business Intelligence Solutions in Era of Big DataProcedia Technology10.1016/j.protcy.2014.10.01516(667-674)Online publication date: 2014
https://doi.org/10.1016/j.protcy.2014.10.015
Show More Cited By

Index Terms

A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Recommendations

Large-scale multilevel streaming data analytics
CASCON '18: Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering

There is a monumental shift happening in how data powers organizational and business operations. This shift is about moving away from traditional batch and real-time analytics to hybrid analytics involving both static and continuous data. Most analytics ...
Read More
Large-Scale Data Analytics
Read More
Large-scale complex analytics on semi-structured datasets using asterixDB and spark

Large quantities of raw data are being generated by many different sources in different formats. Private and public sectors alike acclaim the valuable information and insights that can be mined from such data to better understand the dynamics of ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

March 2013

793 pages

ISBN:9781450315975

DOI:10.1145/2452376

General Chair:
Giovanna Guerrini
Università di Genova, Italy
,
Program Chair:
Norman W. Paton
University of Manchester, UK

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

EDBT/ICDT '13

EDBT/ICDT '13: Joint 2013 EDBT/ICDT Conferences

March 18 - 22, 2013

Genoa, Italy

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
346
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

K TSRINIVASULU A(2016)CLUSTERING OF SUMMARIZING MULTI-DOCUMENTS (LARGE DATA) BY USING MAPREDUCE FRAMEWORKi-manager’s Journal on Cloud Computing10.26634/jcc.3.1.80733:1(1)Online publication date: 2016
https://doi.org/10.26634/jcc.3.1.8073
Hung Tien Tran Hiep Tuan Nguyen Viet-Trung Tran (2016)Large-scale geographically weighted regression on Spark2016 Eighth International Conference on Knowledge and Systems Engineering (KSE)10.1109/KSE.2016.7758041(127-132)Online publication date: Oct-2016
https://doi.org/10.1109/KSE.2016.7758041
Marín-Ortega PDmitriyev VAbilov MGómez J(2014)ELTA: New Approach in Designing Business Intelligence Solutions in Era of Big DataProcedia Technology10.1016/j.protcy.2014.10.01516(667-674)Online publication date: 2014
https://doi.org/10.1016/j.protcy.2014.10.015
Hsu M(2013)Live Analytics Service PlatformEnabling Real-Time Business Intelligence10.1007/978-3-642-39872-8_8(109-117)Online publication date: 2013
https://doi.org/10.1007/978-3-642-39872-8_8

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents