Learning Information Extraction Rules for Semi-Structured and Free Text

Soderland, Stephen

doi:10.1023/A:1007562322031

Learning Information Extraction Rules for Semi-Structured and Free Text

Published: February 1999

Volume 34, pages 233–272, (1999)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Learning Information Extraction Rules for Semi-Structured and Free Text

Download PDF

Stephen Soderland¹

8417 Accesses
496 Citations
3 Altmetric
Explore all metrics

Abstract

A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically.

WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences. Such semi-structured text has largely been beyond the scope of previous systems. When used in conjunction with a syntactic analyzer and semantic tagging, WHISK can also handle extraction from free text such as news stories.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Ashish, N., & Knoblock, C. (1997). Wrapper generation for semi-structured Internet sources. SIGMOD Record, 26(4), 8–15.
Google Scholar
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. California: Wadsworth International Group.
Google Scholar
Califf, M.E., & Mooney, R. (1997). Relational learning of pattern-match rules for information extraction. Working Papers of ACL-97 Workshop on Natural Language Learning (pp. 9–15).
Cohen, W. (1996). Learning trees and rules with set-valued features. Proceedings of the Thirteenth National Conference on Artificial Intelligence (pp. 709–716).
Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201–221.
Google Scholar
Dagan, I., & Engelson, S. (1996). Sample selection in natural language learning. In S. Wermter, E. Riloff, & G. Scheller (Eds.), Connectionist, statistical, and symbolic approaches to learning for natural language processing. Berlin: Springer.
Google Scholar
Domingos, P. (1994). The RISE system: Conquering without separating. Proceedings of the Sixth IEEE International Conference on Tools with Artificial Intelligence (pp. 704–707).
Fisher, D., Soderland, S., McCarthy, J., Feng, F., & Lehnert, W. (1995). Description of the UMass system as used for MUC-6. Proceedings of the Sixth Message Understanding Conference (pp. 221–236), San Fransisco, CA: Morgan Kaufmann.
Google Scholar
Freitag, D. (1998). Multistrategy learning for information extraction. Proceedings of the Fifteenth International Machine Learning Conference (pp. 161–169).
Huffman, S. (1996). Learning information extraction patterns from examples. In S. Wermter, E. Riloff, & G. Scheller (Eds.), Connectionist, statistical, and symbolic approaches to learning for natural language processing. Berlin: Springer.
Google Scholar
Kim, J., & Moldovan, D. (1993). Acquisition of semantic patterns for information extraction from corpora. Proceedings of the Ninth IEEE Conference on Artificial Intelligence for Applications (pp. 171–176). IEEE Computer Society Press.
Krupka, G. (1995). Description of the SRA system as used for MUC-6. Proceedings of the Sixth Message Understanding Conference (pp. 221–236). San Fransisco, CA: Morgan Kaufmann.
Google Scholar
Kushmerick, N., Weld, D., & Doorenbos, R. (1997). Wrapper induction for information extraction. Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (pp. 729–737).
Lewis, D., & Gale, W. (1994). A sequential algorithm for training text classifiers. Proceedings of ACM-SIGIR Conference on Information Retrieval (pp. 3–12).
Michalski, R.S. (1983). A theory and methodology of inductive learning, In Michalski, Carbonell, & Mitchell (Eds.), Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga Publishing.
Google Scholar
MUC-6. (1995). Proceedings of the Sixth Message Understanding Conference. San Fransisco, CA: Morgan Kaufmann.
Google Scholar
Quinlan, J.R. (1990). Learning logical definitions from relations. Machine Learning, 5(3), 239–266.
Google Scholar
Quinlan, J.R. (1993). C4.5: Programs for machine learning. San Fransisco, CA: Morgan Kaufmann.
Google Scholar
Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks. Proceedings of the Eleventh National Conference on Artificial Intelligence (pp. 811–816).
Soderland, S. (1997). Learning text analysis rules for domain-specific natural language processing. Ph.D. thesis (Technical Report UM-CS-1996-087). University of Massachusetts, Amherst.
Google Scholar
Soderland, S. (1997a). Learning to extract text-based information from the World Wide Web. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining.
Soderland, S., Fisher, D., Aseltine, J., & Lehnert, W. (1995). CRYSTAL: Inducing a conceptual dictionary. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (pp. 1314–1321).
Valiant, L. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.
Google Scholar

Download references

Author information

Authors and Affiliations

Department Computer Science and Engineering, University of Washington, Seattle, WA, 98195-2350
Stephen Soderland

Authors

Stephen Soderland
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Soderland, S. Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34, 233–272 (1999). https://doi.org/10.1023/A:1007562322031

Download citation

Issue Date: February 1999
DOI: https://doi.org/10.1023/A:1007562322031

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning Information Extraction Rules for Semi-Structured and Free Text

Abstract

Article PDF

Similar content being viewed by others

Representation, Analysis, and Extraction of Knowledge from Unstructured Natural Language Texts

Large Scale Text Mining Approaches for Information Retrieval and Extraction

DataWords: Getting Contrarian with Text, Structured Data and Explanations

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Learning Information Extraction Rules for Semi-Structured and Free Text

Abstract

Article PDF

Similar content being viewed by others

Representation, Analysis, and Extraction of Knowledge from Unstructured Natural Language Texts

Large Scale Text Mining Approaches for Information Retrieval and Extraction

DataWords: Getting Contrarian with Text, Structured Data and Explanations

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation