Abstract
A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically.
WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences. Such semi-structured text has largely been beyond the scope of previous systems. When used in conjunction with a syntactic analyzer and semantic tagging, WHISK can also handle extraction from free text such as news stories.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Ashish, N., & Knoblock, C. (1997). Wrapper generation for semi-structured Internet sources. SIGMOD Record, 26(4), 8–15.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. California: Wadsworth International Group.
Califf, M.E., & Mooney, R. (1997). Relational learning of pattern-match rules for information extraction. Working Papers of ACL-97 Workshop on Natural Language Learning (pp. 9–15).
Cohen, W. (1996). Learning trees and rules with set-valued features. Proceedings of the Thirteenth National Conference on Artificial Intelligence (pp. 709–716).
Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201–221.
Dagan, I., & Engelson, S. (1996). Sample selection in natural language learning. In S. Wermter, E. Riloff, & G. Scheller (Eds.), Connectionist, statistical, and symbolic approaches to learning for natural language processing. Berlin: Springer.
Domingos, P. (1994). The RISE system: Conquering without separating. Proceedings of the Sixth IEEE International Conference on Tools with Artificial Intelligence (pp. 704–707).
Fisher, D., Soderland, S., McCarthy, J., Feng, F., & Lehnert, W. (1995). Description of the UMass system as used for MUC-6. Proceedings of the Sixth Message Understanding Conference (pp. 221–236), San Fransisco, CA: Morgan Kaufmann.
Freitag, D. (1998). Multistrategy learning for information extraction. Proceedings of the Fifteenth International Machine Learning Conference (pp. 161–169).
Huffman, S. (1996). Learning information extraction patterns from examples. In S. Wermter, E. Riloff, & G. Scheller (Eds.), Connectionist, statistical, and symbolic approaches to learning for natural language processing. Berlin: Springer.
Kim, J., & Moldovan, D. (1993). Acquisition of semantic patterns for information extraction from corpora. Proceedings of the Ninth IEEE Conference on Artificial Intelligence for Applications (pp. 171–176). IEEE Computer Society Press.
Krupka, G. (1995). Description of the SRA system as used for MUC-6. Proceedings of the Sixth Message Understanding Conference (pp. 221–236). San Fransisco, CA: Morgan Kaufmann.
Kushmerick, N., Weld, D., & Doorenbos, R. (1997). Wrapper induction for information extraction. Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (pp. 729–737).
Lewis, D., & Gale, W. (1994). A sequential algorithm for training text classifiers. Proceedings of ACM-SIGIR Conference on Information Retrieval (pp. 3–12).
Michalski, R.S. (1983). A theory and methodology of inductive learning, In Michalski, Carbonell, & Mitchell (Eds.), Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga Publishing.
MUC-6. (1995). Proceedings of the Sixth Message Understanding Conference. San Fransisco, CA: Morgan Kaufmann.
Quinlan, J.R. (1990). Learning logical definitions from relations. Machine Learning, 5(3), 239–266.
Quinlan, J.R. (1993). C4.5: Programs for machine learning. San Fransisco, CA: Morgan Kaufmann.
Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks. Proceedings of the Eleventh National Conference on Artificial Intelligence (pp. 811–816).
Soderland, S. (1997). Learning text analysis rules for domain-specific natural language processing. Ph.D. thesis (Technical Report UM-CS-1996-087). University of Massachusetts, Amherst.
Soderland, S. (1997a). Learning to extract text-based information from the World Wide Web. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining.
Soderland, S., Fisher, D., Aseltine, J., & Lehnert, W. (1995). CRYSTAL: Inducing a conceptual dictionary. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (pp. 1314–1321).
Valiant, L. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Soderland, S. Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34, 233–272 (1999). https://doi.org/10.1023/A:1007562322031
Issue Date:
DOI: https://doi.org/10.1023/A:1007562322031