research-article

I4E: interactive investigation of iterative information extraction

Authors:

Anish Das Sarma,

Divesh SrivastavaAuthors Info & Claims

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 795 - 806

https://doi.org/10.1145/1807167.1807253

Published: 06 June 2010 Publication History

Abstract

Information extraction systems are increasingly being used to mine structured information from unstructured text documents. A commonly used unsupervised technique is to build iterative information extraction (IIE) systems that learn task-specific rules, called patterns, to generate the desired tuples. Oftentimes, output from an information extraction system may contain unexpected results which may be due to an incorrect pattern, incorrect tuple, or both. In such scenarios, users and developers of the extraction system could greatly benefit from an investigation tool that can quickly help them reason about and repair the output.

In this paper, we develop an approach for interactive post-extraction investigation for IIE systems. We formalize three important phases of this investigation, namely, explain the IIE result, diagnose the influential and problematic components, and repair the output from an information extraction system. We show how to characterize the execution of an IIE system and build a suite of algorithms to answer questions pertaining to each of these phases. We experimentally evaluate our proposed approach over several domains over a Web corpus of about 500 million documents. We show that our approach effectively enables post-extraction investigation, while maximizing the gain from user and developer interaction.

References

[1]

Open Provenance Model. http://twiki.ipaw.info/bin/view/Challenge/OPM, 2009.

[2]

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In DL, 2000.

Digital Library

[3]

O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In Proc. of VLDB, 2006.

Digital Library

[4]

S. Brin. Extracting patterns and relations from the world wide web. In WebDB, 1998.

Digital Library

[5]

P. Buneman, A. Chapman, and J. Cheney. Provenance management in curated databases. In Proc. of ACM SIGMOD, 2006.

Digital Library

[6]

M. J. Cafarella, C. Re, D. Suciu, O. Etzioni, and M. Banko. Structured querying of web text: A technical challenge. In Proceedings of CIDR-07, 2007.

[7]

M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In IAAI, 1999.

Digital Library

[8]

A. Chapman and H. V. Jagadish. Issues in building practical provenance systems. IEEE Data Engineering Bulletin, 2007.

[9]

F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE, 2008.

Digital Library

[10]

L. Chiticariu, W. Tan, and G. Vijayvargiya. DBNotes: a post-it system for relational databases based on provenance. In Proc. of ACM SIGMOD, 2005.

Digital Library

[11]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press and McGraw-Hill, 2nd edition, 2001.

Digital Library

[12]

Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. VLDB Journal, 12(1), 2003.

Digital Library

[13]

D. Downey, O. Etzioni, and S. Soderland. A probabilistic model of redundancy in information extraction. In Proceedings of IJCAI-05, 2005.

Digital Library

[14]

O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell., 165(1):91--134, 2005.

Digital Library

[15]

O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in KnowItAll (preliminary results). In Proceedings of WWW-04, 2004.

Digital Library

[16]

M. R. Garey and D. S. Johnson. Computers and Intractability. W. H. Freeman and Company, 1979.

Digital Library

[17]

T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In Proc. of ACM PODS, 2007.

Digital Library

[18]

R. Gupta and S. Sarawagi. Curating probabilistic databases from information extraction models. In VLDB, 2006.

Digital Library

[19]

M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING-92. Association for Computational Linguistics, 1992.

Digital Library

[20]

D. S. Hochbaum and A. Pathria. Analysis of the greedy approach in problems of maximum k-coverage. Manuscript, 1994.

[21]

J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the provenance of non-answers to queries over extracted data. PVLDB, 1(1), 2008.

Digital Library

[22]

R. Ikeda and J. Widom. Data lineage: A survey. Technical report, Stanford University, 2009.

[23]

P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. Towards a query optimizer for text-centric tasks. ACM Transactions on Database Systems, 32(4), Dec. 2007.

Digital Library

[24]

A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In ICDE, 2008.

Digital Library

[25]

A. Jain, P. G. Ipeirotis, A. Doan, and L. Gravano. Join optimization of information extraction output: Quality matters! Technical Report CeDER-08-04, New York University, 2008.

[26]

A. Jain and D. Srivastava. Exploring a few good tuples from text databases. In ICDE, 2009.

Digital Library

[27]

I. Mansuri and S. Sarawagi. A system for integrating unstructured data into relational databases. In ICDE, 2006.

Digital Library

[28]

M. Paşca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Names and similarities on the web: Fact extraction in the fast lane. In Proceedings of ACL06, July 2006.

Digital Library

[29]

M. Paşca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching the world wide web of facts - step one: The one-million fact extraction challenge. In Proceedings of AAAI-06, 2006.

Digital Library

[30]

P. Pantel and M. Pennacchiotti. Espresso: leveraging generic patterns for automatically harvesting semantic relations. In Proc. of ACL, 2006.

Digital Library

[31]

C. Re and D. Suciu. Approximate lineage for probabilistic databases. In Proc. of VLDB, 2008.

Digital Library

[32]

E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of AAAI-99, 1999.

Digital Library

[33]

W.-C. Tan. Provenance in Databases: Past, Current, and Future. IEEE Data Engineering Bulletin, 2008.

[34]

A. Woodruff and M. Stonebraker. Supporting fine-grained data lineage in a database visualization environment. In Proc. of ICDE, pages 91--102, 1997.

Digital Library

[35]

R. Yangarber and R. Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7), 1998.

Cited By

Wang XDong XMeliou ASellis TDavidson SIves Z(2015)Data X-RayProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2750549(1231-1245)Online publication date: 27-May-2015
https://dl.acm.org/doi/10.1145/2723372.2750549
Chen QIwaihara MChung CBroder AShim KSuel T(2014)Iterative algorithm for inferring entity types from enumerative descriptionsProceedings of the 23rd International Conference on World Wide Web10.1145/2567948.2579706(1285-1290)Online publication date: 7-Apr-2014
https://dl.acm.org/doi/10.1145/2567948.2579706
Dong XSrivastava DSchwabe DAlmeida VGlaser HBaeza-Yates RMoon S(2013)Compact explanation of data fusion decisionsProceedings of the 22nd international conference on World Wide Web10.1145/2488388.2488422(379-390)Online publication date: 13-May-2013
https://dl.acm.org/doi/10.1145/2488388.2488422
Show More Cited By

Index Terms

I4E: interactive investigation of iterative information extraction
1. Information systems

Recommendations

A two-phase sampling technique for information extraction from hidden web databases
WIDM '04: Proceedings of the 6th annual ACM international workshop on Web information and data management

Hidden Web databases maintain a collection of specialised documents, which are dynamically generated in response to users' queries. However, the documents are generated by Web page templates, which contain information that is irrelevant to queries. This ...
Sampling, information extraction and summarisation of hidden web databases
Special issue: WIDM 2004

Hidden Web databases maintain a collection of specialised documents, which are dynamically generated using page templates. This paper presents the Two-Phase Sampling (2PS) technique that detects and extracts query-related information from documents ...
Sampling strategies for information extraction over the deep web

First large-scale and fine-grained evaluation of query-based sampling techniques.Learned keyword queries perform substantially better than queries derived from tuples.Focusing onand processing exhaustivelyeffective queries leads to high ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

June 2010

1286 pages

ISBN:9781450300322

DOI:10.1145/1807167

General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '10

Sponsor:

SIGMOD

SIGMOD/PODS '10: International Conference on Management of Data

June 6 - 10, 2010

Indiana, Indianapolis, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
450
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang XDong XMeliou ASellis TDavidson SIves Z(2015)Data X-RayProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2750549(1231-1245)Online publication date: 27-May-2015
https://dl.acm.org/doi/10.1145/2723372.2750549
Chen QIwaihara MChung CBroder AShim KSuel T(2014)Iterative algorithm for inferring entity types from enumerative descriptionsProceedings of the 23rd International Conference on World Wide Web10.1145/2567948.2579706(1285-1290)Online publication date: 7-Apr-2014
https://dl.acm.org/doi/10.1145/2567948.2579706
Dong XSrivastava DSchwabe DAlmeida VGlaser HBaeza-Yates RMoon S(2013)Compact explanation of data fusion decisionsProceedings of the 22nd international conference on World Wide Web10.1145/2488388.2488422(379-390)Online publication date: 13-May-2013
https://dl.acm.org/doi/10.1145/2488388.2488422
Das Sarma AJain ABohannon P(2011)Building a generic debugger for information extraction pipelinesProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063933(2229-2232)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.1145/2063576.2063933
Liu BChiticariu LChu VJagadish HReiss F(2010)Automatic rule refinement for information extractionProceedings of the VLDB Endowment10.14778/1920841.19209163:1-2(588-597)Online publication date: 1-Sep-2010
https://dl.acm.org/doi/10.14778/1920841.1920916

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents