article

Unsupervised named-entity extraction from the Web: An experimental study

Authors:

Michael Cafarella,

Ana-Maria Popescu,

Stephen Soderland,

Daniel S. Weld,

Alexander YatesAuthors Info & Claims

Artificial Intelligence, Volume 165, Issue 1

Pages 91 - 134

Published: 01 June 2005 Publication History

Abstract

The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KnowItAll's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KnowItAll extracted over 50,000 class instances, but suggested a challenge: How can we improve KnowItAll's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., ''chemist'' and ''biologist'' are identified as sub-classes of ''scientist''). List Extraction locates lists of class instances, learns a ''wrapper'' for each list, and extracts elements of each list. Since each method bootstraps from KnowItAll's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KnowItAll a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.

References

[1]

Agichtein, E. and Gravano, L., Snowball: Extracting relations from large plain-text collections. In: Proceedings of the 5th ACM International Conference on Digital Libraries, San Antonio, TX, pp. 85-94.

[2]

Agichtein, E. and Gravano, L., Querying text databases for efficient information extraction. In: Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE 2003), Bangalore, India, pp. 113-124.

[3]

Agichtein, E., Gravano, L., Pavel, J., Sokolova, V. and Voskoboynik, A., Snowball: A prototype system for extracting relations from large text collections. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA,

[4]

Blum, A. and Mitchell, T., Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, Madison, WI, pp. 92-100.

[5]

Brill, E., Some advances in rule-based part of speech tagging. In: Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, WA, pp. 722-727.

[6]

Brin, S., Extracting patterns and relations from the World Wide Web. In: WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT'98, Valencia, Spain, pp. 172-183.

[7]

Califf, M.E. and Mooney, R.J., Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Menlo Park, CA, AAAI Press. pp. 6-11.

[8]

Ciravegna, F., Adaptive information extraction from text by rule induction and generalisation. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), Seattle, WA, pp. 1251-1256.

Digital Library

[9]

Ciravegna, F., Dingli, A., Guthrie, D. and Wilks, Y., Integrating information to bootstrap information extraction from Web sites. In: Proceedings of the II Web Workshop at the 19th International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico, pp. 9-14.

[10]

Cohen, W. and Fan, W., Web-Collaborative Filtering: Recommending music by crawling the Web. Computer Networks. v33 i1--6. 685-698.

[11]

Cohen, W., Hurst, M. and Jensen, L.S., A flexible learning system for wrapping tables and lists in HTML documents. In: Proceedings of the 11th International World Wide Web Conference, Honolulu, Hawaii, pp. 232-241.

[12]

Collins, M. and Singer, Y., Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, Maryland, USA, pp. 100-111.

[13]

Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K. and Slattery, S., Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence. v118 i1--2. 69-113.

[14]

Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J. and Zien, J., SemTag and Seeker: Bootstrapping the semantic Web via automated semantic annotation. In: Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary, pp. 178-186.

[15]

Domingos, P. and Pazzani, M., On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning. v29. 103-130.

[16]

Doorenbos, R., Etzioni, O. and Weld, D., A scalable comparison-shopping agent for the World-Wide Web. In: Proceedings of the First International Conference on Autonomous Agents, Marina del Rey, CA, pp. 39-48.

[17]

D. Downey, O. Etzioni, S. Soderland, A probabilistic model of redundancy in information extraction, submitted for publication

[18]

Downey, D., Etzioni, O., Soderland, S. and Weld, D.S., Learning text patterns for Web information extraction and assessment. In: AAAI-04 Workshop on Adaptive Text Extraction and Mining, pp. 50-55.

[19]

Etzioni, O., Moving up the information food chain: Softbots as information carnivores. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR,

[20]

Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A., Shaked, T., Soderland, S., Weld, D. and Yates, A., Web-scale information extraction in KnowItAll. In: Proceedings of the 13th International World Wide Web Conference (WWW-04), New York, pp. 100-110.

[21]

Freitag, D. and McCallum, A., Information extraction with HMMs and shrinkage. In: Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, FL,

[22]

Hearst, M., Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th International Conference on Computational Linguistics, Nantes, France, pp. 539-545.

[23]

Jones, R., Ghani, R., Mitchell, T. and Riloff, E., Active learning for information extraction with multiple view feature sets. In: Proceedings of the ECML/PKDD-03 Workshop on Adaptive Text Extraction and Mining, Catvat-Dubrovnik, Croatia,

[24]

Kushmerick, N., Weld, D. and Doorenbos, R., Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, Nagoya, Japan, Morgan Kaufmann, San Francisco, CA. pp. 729-737.

[25]

Kwok, C.T., Etzioni, O. and Weld, D., Scaling question answering to the Web. ACM Trans. Inform. Syst. v19 i3. 242-262.

[26]

Lin, W., Yangarber, R. and Grishman, R., Bootstrapped learning of semantic classes from positive and negative examples. In: Proceedings of ICML-2003 Workshop on The Continuum from Labeled to Unlabeled Data, Washington, DC, pp. 103-111.

[27]

Magnini, B., Negri, M. and Tanev, H., Is it the right answer? Exploiting Web redundancy for answer validation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 425-432.

[28]

Banko, M., Brill, E., Dumais, S. and Lin, J., AskMSR: Question answering using the Worldwide Web. In: Proceedings of 2002 AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases, Palo Alto, CA, pp. 7-9.

[29]

McCallum, A., Efficiently inducing features of conditional random fields. In: Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, Acapulco, Mexico, pp. 403-410.

[30]

Muslea, I., Minton, S. and Knoblock, C., Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems. v4 i1/2. 93-114.

[31]

Nigam, K. and Ghani, R., Understanding the behavior of co-training. In: Proceedings of the KDD-2000 Workshop on Text Mining, Boston, MA, pp. 105-107.

[32]

Nigam, K., Lafferty, J. and McCallum, A., Using maximum entropy for text classification. In: Proceedings of IJCAI-99 Workshop on Machine Learning for Information Filtering, Stockholm, Sweden, pp. 61-67.

[33]

Nigam, K., McCallum, A., Thrun, S. and Mitchell, T., Learning to classify text from labeled and unlabeled documents. In: Proceedings of the 15th Conference of the American Association for Artificial Intelligence (AAAI-98), Madison, WI, pp. 792-799.

[34]

Nigam, K., McCallum, A., Thrun, S. and Mitchell, T., Text classification from labeled and unlabeled documents using EM. Machine Learning. v39 i2/3. 103-134.

[35]

Phillips, W. and Riloff, E., Exploiting strong syntactic heuristics and co-training to learn semantic lexicons. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, pp. 125-132.

[36]

Ravichandran, D. and Hovy, D., Learning surface text patterns for a question answering system. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp. 41-47.

[37]

Riloff, E. and Jones, R., Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, Orlando, FL, pp. 474-479.

[38]

Rosch, E., Mervis, C.B., Gray, W., Johnson, D. and Boyes-Bream, P., Basic objects in natural categories. Cognitive Psychology. v3. 382-439.

[39]

Schubert, L., Can we derive general world knowledge from texts. In: Proceedings of Human Language Technology Conference,

[40]

Snow, R., Jurafsky, D. and Ng, A.Y., Learning syntactic patterns for automatic hypernym discovery. In: Saul, L.K., Weiss, Y., Bottou, L. (Eds.), Advances in Neural Information Processing Systems, vol. 17. MIT Press, Cambridge, MA.

[41]

Soderland, S., Learning information extraction rules for semi-structured and free text. Machine Learning. v34 i1--3. 233-272.

[42]

Thelen, M. and Riloff, E., A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proceedings of the 2002 Conference on Empirical Methods in NLP, Philadelphia, PA, pp. 214-221.

[43]

Turney, P.D., Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of the Twelfth European Conference on Machine Learning, Freiburg, Germany, pp. 491-502.

[44]

Turney, P.D., Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp. 129-159.

[45]

Turney, P.D. and Littman, M., Measuring praise and criticism: Inference of semantic orientation from association. ACM Trans. Inform. Syst. v21 i4. 315-346.

[46]

Uryupina, O., Semi-supervised learning of geographical references within text. In: Proceedings of the NAACL-03 Workshop on the Analysis of Geographic References, Edmonton, Canada, pp. 21-29.

Cited By

Bose KSarkar K(2024)Named Entity Recognition in Bengali and Hindi Using MuRIL and Conditional Random FieldsSN Computer Science10.1007/s42979-024-03211-75:7Online publication date: 4-Sep-2024
https://dl.acm.org/doi/10.1007/s42979-024-03211-7
Chew MCheng YMahan OIslam MHong JBures MPark JCerny T(2022)A comparative study of name entity recognition techniques in software engineering textsProceedings of the 37th ACM/SIGAPP Symposium on Applied Computing10.1145/3477314.3507200(1611-1614)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3477314.3507200
Wahed MGruhl DAlba AGentile ARistoski PDeLuca CWelch SLourentzou IDemartini GZuccon GCulpepper JHuang ZTong H(2021)SAUCEProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481950(4173-4183)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3481950
Show More Cited By

Index Terms

Unsupervised named-entity extraction from the Web: An experimental study
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
    2. Learning settings
2. Information systems
  1. Information retrieval

Recommendations

Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Generalisation in named entity recognition

Quantitative study of NER performance in diverse corpora of different genres, including newswire and social media.Multiple state of the art NER approaches are tested.Possible reasons for NER failure are analysed and quantified: NE diversity, unseen NEs ...

Comments

Information & Contributors

Information

Published In

cover image Artificial Intelligence

Artificial Intelligence Volume 165, Issue 1

June 2005

135 pages

ISSN:0004-3702

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2005.

Publisher

Elsevier Science Publishers Ltd.

United Kingdom

Publication History

Published: 01 June 2005

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

284
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bose KSarkar K(2024)Named Entity Recognition in Bengali and Hindi Using MuRIL and Conditional Random FieldsSN Computer Science10.1007/s42979-024-03211-75:7Online publication date: 4-Sep-2024
https://dl.acm.org/doi/10.1007/s42979-024-03211-7
Chew MCheng YMahan OIslam MHong JBures MPark JCerny T(2022)A comparative study of name entity recognition techniques in software engineering textsProceedings of the 37th ACM/SIGAPP Symposium on Applied Computing10.1145/3477314.3507200(1611-1614)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3477314.3507200
Wahed MGruhl DAlba AGentile ARistoski PDeLuca CWelch SLourentzou IDemartini GZuccon GCulpepper JHuang ZTong H(2021)SAUCEProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481950(4173-4183)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3481950
Nasar ZJaffry SMalik M(2021)Named Entity Recognition and Relation ExtractionACM Computing Surveys10.1145/344596554:1(1-39)Online publication date: 11-Feb-2021
https://dl.acm.org/doi/10.1145/3445965
Hanh TDoucet ASidere NMoreno JPollak S(2021)Named Entity Recognition Architecture Combining Contextual and Global FeaturesTowards Open and Trustworthy Digital Societies10.1007/978-3-030-91669-5_21(264-276)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/978-3-030-91669-5_21
Hu JOuyang YLi CWang CRong WXiong Z(2021)Hierarchical Lexicon Embedding Architecture for Chinese Named Entity RecognitionArtificial Neural Networks and Machine Learning – ICANN 202110.1007/978-3-030-86383-8_28(345-356)Online publication date: 14-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-86383-8_28
Ahmed SChowdhury AFawaz KRamanathan PCapkun SRoesner F(2020)PreechProceedings of the 29th USENIX Conference on Security Symposium10.5555/3489212.3489364(2703-2720)Online publication date: 12-Aug-2020
https://dl.acm.org/doi/10.5555/3489212.3489364
Ma TDou QJiang PLiu H(2020)Named entity recognition based on semi-supervised ensemble learning with the improved tri-training algorithmProceedings of the 2020 8th International Conference on Information Technology: IoT and Smart City10.1145/3446999.3447002(13-18)Online publication date: 25-Dec-2020
https://dl.acm.org/doi/10.1145/3446999.3447002
Xiao ZLi CChen H(2020)PatternRank+NNACM Transactions on the Web10.1145/338604214:3(1-15)Online publication date: 3-May-2020
https://dl.acm.org/doi/10.1145/3386042
Huang JXie YMeng YShen JZhang YHan J(2020)Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-ExpansionProceedings of The Web Conference 202010.1145/3366423.3380284(2188-2198)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366423.3380284
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents