research-article

Automatic Entity Recognition and Typing in Massive Text Data

Authors:

Ahmed El-Kishky,

Jiawei HanAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 2235 - 2239

https://doi.org/10.1145/2882903.2912567

Published: 26 June 2016 Publication History

Abstract

In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. bio-medical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.

References

[1]

R. K. Ando and T. Zhang. A high-performance semi-supervised learning method for text chunking. In ACL, 2005.

Digital Library

[2]

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008.

Digital Library

[3]

A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.

Digital Library

[4]

W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In SIGKDD, 2004.

Digital Library

[5]

M. Collins. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In ACL, 2002.

Digital Library

[6]

J. R. Curran and S. Clark. Language independent ner using a maximum entropy tagger. In HLT-NAACL, 2003.

Digital Library

[7]

B. B. Dalvi, W. W. Cohen, and J. Callan. Websets: Extracting sets of entities from the web using unsupervised information extraction. In WSDM, 2012.

Digital Library

[8]

L. Dong, F. Wei, H. Sun, M. Zhou, and K. Xu. A hybrid neural model for type classification of entity mentions. In IJCAI, 2015.

Digital Library

[9]

A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 2015.

Digital Library

[10]

O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial intelligence, 165(1):91--134, 2005.

Digital Library

[11]

A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011.

Digital Library

[12]

J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005.

Digital Library

[13]

V. Ganti, A. C. König, and R. Vernica. Entity categorization over large document collections. In SIGKDD, 2008.

Digital Library

[14]

A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. VLDB, 6(11):1126--1137, 2013.

Digital Library

[15]

S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CONLL, 2014.

[16]

Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, 2011.

Digital Library

[17]

Y. Hong, D. Lu, D. Yu, X. Pan, X. Wang, Y. Chen, L. Huang, and H. Ji. Rpi_blender tac-kbp2015 system description. In Proc. Text Analysis Conference (TAC2015), 2015.

[18]

R. Huang and E. Riloff. Inducing domain-specific semantic class taggers from (almost) nothing. In ACL, 2010.

Digital Library

[19]

H. Ji and R. Grishman. Knowledge base population: Successful approaches and challenges. In ACL, 2011.

Digital Library

[20]

D. S. Kim, K. Verma, and P. Z. Yeh. Joint extraction and labeling via graph propagation for dictionary construction. In AAAI, 2013.

Digital Library

[21]

Z. Kozareva, K. Voevodski, and S.-H. Teng. Class label enhancement via related instances. In EMNLP, 2011.

Digital Library

[22]

C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In SIGIR, 2012.

Digital Library

[23]

Q. Li and H. Ji. Incremental joint extraction of entity mentions and relations. In ACL, 2014.

[24]

Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining evidences for named entity disambiguation. In SIGKDD, 2013.

Digital Library

[25]

G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. VLDB, 3(1--2):1338--1347, 2010.

Digital Library

[26]

X. Ling and D. S. Weld. Fine-grained entity recognition. In AAAI, 2012.

Digital Library

[27]

J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015.

Digital Library

[28]

A. McCallum, D. Freitag, and F. C. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, volume 17, pages 591--598, 2000.

Digital Library

[29]

D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3--26, 2007.

[30]

N. Nguyen and R. Caruana. Classification with partial labels. In SIGKDD, 2008.

Digital Library

[31]

K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In CIKM, 2000.

Digital Library

[32]

L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In ACL, 2009.

Digital Library

[33]

X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. Clustype: Effective entity recognition and typing by relation phrase-based clustering. In SIGKDD, 2015.

Digital Library

[34]

A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In EMNLP, 2011.

Digital Library

[35]

W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, 27(99):1--20, 2014.

[36]

P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira. A context pattern induction method for named entity extraction. In CONLL, 2006.

Digital Library

[37]

J. Tang, M. Qu, and Q. Mei. Pte: Predictive text embedding through large-scale heterogeneous text networks. In SIGKDD, 2015.

Digital Library

[38]

J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, 2010.

Digital Library

[39]

W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: A probabilistic taxonomy for text understanding. In SIGMOD, 2012.

Digital Library

[40]

M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, 2012.

Digital Library

[41]

R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In COLING, 2002.

Digital Library

[42]

D. Yogatama, D. Gillick, and N. Lazic. Embedding methods for fine grained entity type classification. In ACL, 2015.

[43]

M. A. Yosef, S. Bauer, J. Hoffart, M. Spaniol, and G. Weikum. Hyena: Hierarchical type classification for entity names. In COLING, 2012.

Cited By

El-Kishky AKoehn PSchwenk HHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Searching the Web for Cross-lingual Parallel DataProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401417(2417-2420)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401417
Du MPang MXu B(2020)Multi-task Learning for Attribute Extraction from Unstructured Electronic Medical RecordsSemantic Technology10.1007/978-981-15-3412-6_12(117-128)Online publication date: 19-Feb-2020
https://doi.org/10.1007/978-981-15-3412-6_12
Kim HEl-Kishky ARen XHan J(2019)Mining News Events from Comparable News Corpora: A Multi-Attribute Proximity Network Modeling Approach2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006049(105-114)Online publication date: Dec-2019
https://doi.org/10.1109/BigData47090.2019.9006049
Show More Cited By

Index Terms

Automatic Entity Recognition and Typing in Massive Text Data
1. Information systems

Recommendations

Mining Quality Phrases from Massive Text Corpora
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ...
Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Current systems of fine-grained entity typing use distant supervision in conjunction with existing knowledge bases to assign categories (type labels) to entity mentions. However, the type labels so obtained from knowledge bases are often noisy (i.e., ...
Automatic Entity Recognition and Typing in Massive Text Corpora
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

In today's computerized and information-based society, we are soaked with vast amounts of natural language text data, ranging from news articles, product reviews, advertisements, to a wide range of user-generated content from social media. To turn such ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

IIS-1017362
IIS-1354329
HDTRA1-10-1-0120
IIS-1320617
1U54GM114838
W911NF-09-2-0053

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
407
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

El-Kishky AKoehn PSchwenk HHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Searching the Web for Cross-lingual Parallel DataProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401417(2417-2420)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401417
Du MPang MXu B(2020)Multi-task Learning for Attribute Extraction from Unstructured Electronic Medical RecordsSemantic Technology10.1007/978-981-15-3412-6_12(117-128)Online publication date: 19-Feb-2020
https://doi.org/10.1007/978-981-15-3412-6_12
Kim HEl-Kishky ARen XHan J(2019)Mining News Events from Comparable News Corpora: A Multi-Attribute Proximity Network Modeling Approach2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006049(105-114)Online publication date: Dec-2019
https://doi.org/10.1109/BigData47090.2019.9006049
Kilias TLoser AGers FZhang YKoopmanschap RKersten M(2019)IDEL: In-Database Neural Entity Linking2019 IEEE International Conference on Big Data and Smart Computing (BigComp)10.1109/BIGCOMP.2019.8679486(1-8)Online publication date: Feb-2019
https://doi.org/10.1109/BIGCOMP.2019.8679486
Li QDong JZhong JLi QWang C(2019)A neural model for type classification of entities for textKnowledge-Based Systems10.1016/j.knosys.2019.03.025176:C(122-132)Online publication date: 15-Jul-2019
https://dl.acm.org/doi/10.1016/j.knosys.2019.03.025
Xu BLuo ZHuang LLiang BXiao YYang DWang WCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)METICProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271804(903-912)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3271804
Li KZha HSu YYan X(2018)Concept Mining via Embedding2018 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2018.00042(267-276)Online publication date: Nov-2018
https://doi.org/10.1109/ICDM.2018.00042

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents