Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2452376.2452448acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Published: 18 March 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Text analytics has become increasingly important with the rapid growth of text data. Particularly, information extraction (IE), which extracts structured data from text, has received significant attention. Unfortunately, IE is often computationally intensive. To address this issue, MapReduce has been used for large scale IE. Recently, there are emerging efforts from both academia and industry on pushing IE inside DBMSs. This leads to an interesting and important question: Given that both MapReduce and parallel DBMSs are for large scale analytics, which platform is a better choice for large scale IE? In this paper, we propose a benchmark to systematically study the performance of both platforms for large scale IE tasks. The benchmark includes both statistical learning based and rule based IE programs, which have been extensively used in real-world IE tasks. We show how to express these programs on both platforms and conduct experiments on real-world datasets. Our results show that parallel DBMSs is a viable alternative for large scale IE.

    References

    [1]
    http://trec.nist.gov/.
    [2]
    http://dumps.wikimedia.org/enwiki/20120307/.
    [3]
    http://cogcomp.cs.illinois.edu/page/software/.
    [4]
    http://www.freebase.com/.
    [5]
    http://crfpp.sourceforge.net/.
    [6]
    http://crf.sourceforge.net/.
    [7]
    A. Alexandrov, M. Heimel, V. Markl, D. Battré, F. Hueske, E. Nijkamp, S. Ewen, O. Kao, and D. Warneke. Massively parallel data analysis with pacts on nephele. PVLDB-10.
    [8]
    F. Chen, X. Feng, C. Ré, and M. Wang. Optimizing statistical information extraction programs over evolving text. ICDE-12.
    [9]
    F. Chen, B. Gao, A. Doan, J. Yang, and R. Ramakrishnan. Optimizing complex extraction programs over evolving text data. SIGMOD-09.
    [10]
    J. Cho and S. Rajagopalan. A fast regular expression indexing engine. ICDE-02.
    [11]
    X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. SIGMOD-05.
    [12]
    V. Ercegovac, D. DeWitt, and R. Ramakrishnan. The texture benchmark: measuring performance of text queries on a relational DBMS. In VLDB-05.
    [13]
    X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-RDBMS analytics. SIGMOD-12.
    [14]
    D. Ferrucci and A. Lally. UIMA: An architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10(3-4), 2004.
    [15]
    J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. ACL-05.
    [16]
    A. Floratou, N. Teletia, D. DeWitt, J. Patel, and D. Zhang. Can the elephants handle the NoSQL onslaught? PVLDB-12.
    [17]
    L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, et al. Approximate string joins in a database (almost) for free. VLDB-01.
    [18]
    J. Hellerstein, C. Ré, F. Schoppmann, D. Wang, E. Fratkin, A. Gorajek, K. Ng, C. Welton, X. Feng, K. Li, et al. The MADlib analytics library, or MAD skills, the SQL. PVLDB-12.
    [19]
    G. Kasneci, M. Ramanath, F. Suchanek, and G. Weikum. The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37(4), 2008.
    [20]
    J. Lin and C. Dyer. Data-intensive text processing with mapreduce. Syn. Lec. on Human Lang. Tech.-10.
    [21]
    A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. CoNLL-03.
    [22]
    F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: Scaling up statistical inference in markov logic networks using an RDBMS. PVLDB-11.
    [23]
    A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD-09.
    [24]
    F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL-04.
    [25]
    D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. SIGIR-03.
    [26]
    F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. ICDE-08.
    [27]
    S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008.
    [28]
    W. Shen, A. Doan, J. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. VLDB-07.
    [29]
    A. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal. Optimizing analytic data flows for multiple execution engines. SIGMOD-12.
    [30]
    A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. ICDE-10.
    [31]
    D. Tsang and S. Chawla. A robust index for regular expression queries. CIKM-11.
    [32]
    R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. SIGMOD-10.
    [33]
    D. Wang, M. Franklin, M. Garofalakis, and J. Hellerstein. Querying probabilistic information extraction. PVLDB-10.
    [34]
    D. Wang, M. Franklin, M. Garofalakis, J. Hellerstein, and M. Wick. Hybrid in-database inference for declarative information extraction. SIGMOD-11.
    [35]
    G. Weikum, J. Hoffart, N. Nakashole, M. Spaniol, F. Suchanek, and M. Yosef. Big data methods for computational linguistics. IEEE Data Eng. Bulletin, 35(3), 2012.
    [36]
    F. Wu, R. Hoffmann, and D. S. Weld. Information extraction from Wikipedia: moving down the long tail. SIGKDD-08.

    Cited By

    View all
    • (2016)CLUSTERING OF SUMMARIZING MULTI-DOCUMENTS (LARGE DATA) BY USING MAPREDUCE FRAMEWORKi-manager’s Journal on Cloud Computing10.26634/jcc.3.1.80733:1(1)Online publication date: 2016
    • (2016)Large-scale geographically weighted regression on Spark2016 Eighth International Conference on Knowledge and Systems Engineering (KSE)10.1109/KSE.2016.7758041(127-132)Online publication date: Oct-2016
    • (2014)ELTA: New Approach in Designing Business Intelligence Solutions in Era of Big DataProcedia Technology10.1016/j.protcy.2014.10.01516(667-674)Online publication date: 2014
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
    March 2013
    793 pages
    ISBN:9781450315975
    DOI:10.1145/2452376
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 March 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    EDBT/ICDT '13

    Acceptance Rates

    Overall Acceptance Rate 7 of 10 submissions, 70%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2016)CLUSTERING OF SUMMARIZING MULTI-DOCUMENTS (LARGE DATA) BY USING MAPREDUCE FRAMEWORKi-manager’s Journal on Cloud Computing10.26634/jcc.3.1.80733:1(1)Online publication date: 2016
    • (2016)Large-scale geographically weighted regression on Spark2016 Eighth International Conference on Knowledge and Systems Engineering (KSE)10.1109/KSE.2016.7758041(127-132)Online publication date: Oct-2016
    • (2014)ELTA: New Approach in Designing Business Intelligence Solutions in Era of Big DataProcedia Technology10.1016/j.protcy.2014.10.01516(667-674)Online publication date: 2014
    • (2013)Live Analytics Service PlatformEnabling Real-Time Business Intelligence10.1007/978-3-642-39872-8_8(109-117)Online publication date: 2013

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media