Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2912567acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Automatic Entity Recognition and Typing in Massive Text Data

Published: 26 June 2016 Publication History

Abstract

In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. bio-medical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.

References

[1]
R. K. Ando and T. Zhang. A high-performance semi-supervised learning method for text chunking. In ACL, 2005.
[2]
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008.
[3]
A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.
[4]
W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In SIGKDD, 2004.
[5]
M. Collins. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In ACL, 2002.
[6]
J. R. Curran and S. Clark. Language independent ner using a maximum entropy tagger. In HLT-NAACL, 2003.
[7]
B. B. Dalvi, W. W. Cohen, and J. Callan. Websets: Extracting sets of entities from the web using unsupervised information extraction. In WSDM, 2012.
[8]
L. Dong, F. Wei, H. Sun, M. Zhou, and K. Xu. A hybrid neural model for type classification of entity mentions. In IJCAI, 2015.
[9]
A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 2015.
[10]
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial intelligence, 165(1):91--134, 2005.
[11]
A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011.
[12]
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005.
[13]
V. Ganti, A. C. König, and R. Vernica. Entity categorization over large document collections. In SIGKDD, 2008.
[14]
A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. VLDB, 6(11):1126--1137, 2013.
[15]
S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CONLL, 2014.
[16]
Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, 2011.
[17]
Y. Hong, D. Lu, D. Yu, X. Pan, X. Wang, Y. Chen, L. Huang, and H. Ji. Rpi_blender tac-kbp2015 system description. In Proc. Text Analysis Conference (TAC2015), 2015.
[18]
R. Huang and E. Riloff. Inducing domain-specific semantic class taggers from (almost) nothing. In ACL, 2010.
[19]
H. Ji and R. Grishman. Knowledge base population: Successful approaches and challenges. In ACL, 2011.
[20]
D. S. Kim, K. Verma, and P. Z. Yeh. Joint extraction and labeling via graph propagation for dictionary construction. In AAAI, 2013.
[21]
Z. Kozareva, K. Voevodski, and S.-H. Teng. Class label enhancement via related instances. In EMNLP, 2011.
[22]
C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In SIGIR, 2012.
[23]
Q. Li and H. Ji. Incremental joint extraction of entity mentions and relations. In ACL, 2014.
[24]
Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining evidences for named entity disambiguation. In SIGKDD, 2013.
[25]
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. VLDB, 3(1--2):1338--1347, 2010.
[26]
X. Ling and D. S. Weld. Fine-grained entity recognition. In AAAI, 2012.
[27]
J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015.
[28]
A. McCallum, D. Freitag, and F. C. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, volume 17, pages 591--598, 2000.
[29]
D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3--26, 2007.
[30]
N. Nguyen and R. Caruana. Classification with partial labels. In SIGKDD, 2008.
[31]
K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In CIKM, 2000.
[32]
L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In ACL, 2009.
[33]
X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. Clustype: Effective entity recognition and typing by relation phrase-based clustering. In SIGKDD, 2015.
[34]
A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In EMNLP, 2011.
[35]
W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, 27(99):1--20, 2014.
[36]
P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira. A context pattern induction method for named entity extraction. In CONLL, 2006.
[37]
J. Tang, M. Qu, and Q. Mei. Pte: Predictive text embedding through large-scale heterogeneous text networks. In SIGKDD, 2015.
[38]
J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, 2010.
[39]
W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In SIGMOD, 2012.
[40]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, 2012.
[41]
R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In COLING, 2002.
[42]
D. Yogatama, D. Gillick, and N. Lazic. Embedding methods for fine grained entity type classification. In ACL, 2015.
[43]
M. A. Yosef, S. Bauer, J. Hoffart, M. Spaniol, and G. Weikum. Hyena: Hierarchical type classification for entity names. In COLING, 2012.

Cited By

View all
  • (2020)Searching the Web for Cross-lingual Parallel DataProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401417(2417-2420)Online publication date: 25-Jul-2020
  • (2020)Multi-task Learning for Attribute Extraction from Unstructured Electronic Medical RecordsSemantic Technology10.1007/978-981-15-3412-6_12(117-128)Online publication date: 19-Feb-2020
  • (2019)Mining News Events from Comparable News Corpora: A Multi-Attribute Proximity Network Modeling Approach2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006049(105-114)Online publication date: Dec-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. entity
  2. entity recognition
  3. entity typing
  4. phrase mining
  5. phrases
  6. text mining
  7. typing

Qualifiers

  • Research-article

Funding Sources

  • IIS-1017362
  • IIS-1354329
  • HDTRA1-10-1-0120
  • IIS-1320617
  • 1U54GM114838
  • W911NF-09-2-0053

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Searching the Web for Cross-lingual Parallel DataProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401417(2417-2420)Online publication date: 25-Jul-2020
  • (2020)Multi-task Learning for Attribute Extraction from Unstructured Electronic Medical RecordsSemantic Technology10.1007/978-981-15-3412-6_12(117-128)Online publication date: 19-Feb-2020
  • (2019)Mining News Events from Comparable News Corpora: A Multi-Attribute Proximity Network Modeling Approach2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006049(105-114)Online publication date: Dec-2019
  • (2019)IDEL: In-Database Neural Entity Linking2019 IEEE International Conference on Big Data and Smart Computing (BigComp)10.1109/BIGCOMP.2019.8679486(1-8)Online publication date: Feb-2019
  • (2019)A neural model for type classification of entities for textKnowledge-Based Systems10.1016/j.knosys.2019.03.025176:C(122-132)Online publication date: 15-Jul-2019
  • (2018)METICProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271804(903-912)Online publication date: 17-Oct-2018
  • (2018)Concept Mining via Embedding2018 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2018.00042(267-276)Online publication date: Nov-2018

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media