Abstract
This paper describes Information Extraction for applications concerning the automated filling of templates from an input of HTML documents. We developed a complete system to extract information from Web sites. The system is able to use a number of algorithms to learn the document structure, rules and keywords to locate specific information and spatial relations between different information items. Experiments with well known data set show a substantial performance improvement over standard wrapper systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Relational Learning of Pattern-Match Rules for Information Extraction, M E Califf and R J Mooney, Proceedings ACL-97: Workshop on Natural Language Learning, 1997
A Simple, Fast, and Effective Rule Learner, W Cohen, AAAI-99 Proceeding, 1999
Information Extraction a User Guide, H Cunningham, CS-99-07, 1999
Information Extraction from HTML: Application of a General Machine Learning Approch, D Freitag, AAAI-98 Proceeding, 1998
Trends and controversies: Information Integration, A Levy, C Knoblock, S Minton, W Cohen, IEEE Intelligent Systems 13 (5), 1998
Wrapper induction: Efficiency and expressiveness, N Kushmeric, Artificial Intelligence 118, 15–68, 2000
STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources-Muslea I, Minton S, AAAI’98 Workshop “AI and Information Integration”
Information Extraction as a Basis for High-Precision Text Classification, E Riloff and W Lehnert, ACM Transactions on Information Systems vol. 12 no. 3 1994.
Learning Information Extraction Rules for Semi-Structured and Free Text, S Sonderland, Machine Learning 34, 233–272, 1999
Where to Position the Precision in Knowledge Extraction from Text, L Xiao, 2000
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xiao, L., Wissmann, D., Brown, M., Jablonski, S. (2001). Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web. In: Monostori, L., Váncza, J., Ali, M. (eds) Engineering of Intelligent Systems. IEA/AIE 2001. Lecture Notes in Computer Science(), vol 2070. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45517-5_20
Download citation
DOI: https://doi.org/10.1007/3-540-45517-5_20
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42219-8
Online ISBN: 978-3-540-45517-2
eBook Packages: Springer Book Archive