Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scalable ad-hoc entity extraction from text collections

Published: 01 August 2008 Publication History

Abstract

Supporting entity extraction from large document collections is important for enabling a variety of important data analysis tasks. In this paper, we introduce the "ad-hoc" entity extraction task where entities of interest are constrained to be from a list of entities that is specific to the task. In such scenarios, traditional entity extraction techniques that process all the documents for each ad-hoc entity extraction task can be significantly expensive. In this paper, we propose an efficient approach that leverages the inverted index on the documents to identify the subset of documents relevant to the task and processes only those documents. We demonstrate the efficiency of our techniques on real datasets.

References

[1]
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In Proceedings of ICDE Conference, 2003.
[2]
A. Agresti. An introduction, to categorical data analysis. Wiley, 2007.
[3]
A. V. Aho and M. J. Corasick. Efficient string matching: An aid to bibliographic search. Communications of the ACM, June 1975.
[4]
D. E. Appelt and D. Israel. Introduction to Information Extraction Technology. IJCAI-99 Tutorial, 1999.
[5]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proceedings of VLDB, Seoul, South Korea, September 2006.
[6]
M. J. Cafarella and O. Etzioni. A Search Engine for Natural Language Applications. In WWW Conference, 2005.
[7]
K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin. An efficient filter for approximate membership checking. In Proceedings of ACM SIGMOD, 2008.
[8]
A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In Proceedings of ICDE, Los Alamitos, 2006.
[9]
S. Chaudhuri, P. Ganesan, and S. Sarawagi. Factorizing complex predicates in queries to exploit indexes. In Proceedings of ACM SIGMOD, June 2003.
[10]
W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In In Proceedings of ACM SIGKDD, 2004.
[11]
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in KnowItAll. In In Proceedings of WWW, 2004.
[12]
R. Grishman. Information Extraction: Techniques and Challenges. In SCIE, 1997.
[13]
P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To Search or to Crawl?: towards a Query Optimizer for Text-Centric Tasks. In SIGMOD, pages 265--276, 2006.
[14]
H. Jerry, R. Douglas, E. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel, and M. Tyson. Fastus: A cascaded finite-state transducer for extracting information from natural-language text, 1996.
[15]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, 2001.
[16]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In Proceedings of ACM SIGMOD, pages 743--754, 2004.
[17]
F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1): 1--47, 2002.
[18]
I. Witten, A. Moffat, and T. Bell. Managing gigabytes: Compressing and indexing documents and images. Morgan Kaufmann, 1999.

Cited By

View all
  • (2022)JENNERProceedings of the VLDB Endowment10.14778/3551793.355182215:11(2666-2678)Online publication date: 1-Jul-2022
  • (2020)Subjective Search Intent Predictions using Customer ReviewsProceedings of the 2020 Conference on Human Information Interaction and Retrieval10.1145/3343413.3377987(303-307)Online publication date: 14-Mar-2020
  • (2017)Natural Language Data Management and InterfacesProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3054783(1765-1770)Online publication date: 9-May-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 1, Issue 1
August 2008
1216 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008
Published in PVLDB Volume 1, Issue 1

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2022)JENNERProceedings of the VLDB Endowment10.14778/3551793.355182215:11(2666-2678)Online publication date: 1-Jul-2022
  • (2020)Subjective Search Intent Predictions using Customer ReviewsProceedings of the 2020 Conference on Human Information Interaction and Retrieval10.1145/3343413.3377987(303-307)Online publication date: 14-Mar-2020
  • (2017)Natural Language Data Management and InterfacesProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3054783(1765-1770)Online publication date: 9-May-2017
  • (2016)Local Similarity Search for Unstructured TextProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2915211(1991-2005)Online publication date: 26-Jun-2016
  • (2015)A unified framework for approximate dictionary-based entity extractionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-014-0367-924:1(143-167)Online publication date: 1-Feb-2015
  • (2014)Extending string similarity join to tolerant fuzzy token matchingACM Transactions on Database Systems10.1145/253562839:1(1-45)Online publication date: 6-Jan-2014
  • (2013)A partition-based method for string similarity joins with edit-distance constraintsACM Transactions on Database Systems10.1145/2487259.248726138:2(1-33)Online publication date: 4-Jul-2013
  • (2012)A framework for robust discovery of entity synonymsProceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2339530.2339743(1384-1392)Online publication date: 12-Aug-2012
  • (2012)Trie-joinThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-011-0252-821:4(437-461)Online publication date: 1-Aug-2012
  • (2011)Pass-joinProceedings of the VLDB Endowment10.14778/2078331.20783405:3(253-264)Online publication date: 1-Nov-2011
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media