Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

Xiao, Luo; Wissmann, Dieter; Brown, Michael; Jablonski, Stefan

doi:10.1007/3-540-45517-5_20

Luo Xiao³,
Dieter Wissmann³,
Michael Brown⁴ &
…
Stefan Jablonski⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2070))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

741 Accesses
4 Citations

Abstract

This paper describes Information Extraction for applications concerning the automated filling of templates from an input of HTML documents. We developed a complete system to extract information from Web sites. The system is able to use a number of algorithms to learn the document structure, rules and keywords to locate specific information and spatial relations between different information items. Experiments with well known data set show a substantial performance improvement over standard wrapper systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Article Open access 20 August 2017

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

Information Extraction Approaches: A Survey

References

Relational Learning of Pattern-Match Rules for Information Extraction, M E Califf and R J Mooney, Proceedings ACL-97: Workshop on Natural Language Learning, 1997
Google Scholar
A Simple, Fast, and Effective Rule Learner, W Cohen, AAAI-99 Proceeding, 1999
Google Scholar
Information Extraction a User Guide, H Cunningham, CS-99-07, 1999
Google Scholar
Information Extraction from HTML: Application of a General Machine Learning Approch, D Freitag, AAAI-98 Proceeding, 1998
Google Scholar
Trends and controversies: Information Integration, A Levy, C Knoblock, S Minton, W Cohen, IEEE Intelligent Systems 13 (5), 1998
Google Scholar
Wrapper induction: Efficiency and expressiveness, N Kushmeric, Artificial Intelligence 118, 15–68, 2000
Google Scholar
STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources-Muslea I, Minton S, AAAI’98 Workshop “AI and Information Integration”
Google Scholar
Information Extraction as a Basis for High-Precision Text Classification, E Riloff and W Lehnert, ACM Transactions on Information Systems vol. 12 no. 3 1994.
Google Scholar
Learning Information Extraction Rules for Semi-Structured and Free Text, S Sonderland, Machine Learning 34, 233–272, 1999
Google Scholar
Where to Position the Precision in Knowledge Extraction from Text, L Xiao, 2000
Google Scholar

Download references

Author information

Authors and Affiliations

Interprice Technologies GmbH, Berlin, Germany
Luo Xiao & Dieter Wissmann
Dept. of Computer Sciences VI (IMMD VI), University of Erlangen-Nuremberg, Germany
Michael Brown
Siemens AG, CT SE 5, Erlangen, Germany
Stefan Jablonski

Authors

Luo Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Dieter Wissmann
View author publications
You can also search for this author in PubMed Google Scholar
Michael Brown
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Jablonski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Hungarian Academy of Sciences, Intelligent Manufacturing and Business Processes Computer and Automation Research Institute, Kende utca 13-17, 1111, Budapest, Hungary
László Monostori & József Váncza &
Department of Computer Science 601 University Drive, Southwest Texas State University, San Marcos, TX, 78666-4616, USA
Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, L., Wissmann, D., Brown, M., Jablonski, S. (2001). Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web. In: Monostori, L., Váncza, J., Ali, M. (eds) Engineering of Intelligent Systems. IEA/AIE 2001. Lecture Notes in Computer Science(), vol 2070. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45517-5_20

Download citation

DOI: https://doi.org/10.1007/3-540-45517-5_20
Published: 18 June 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42219-8
Online ISBN: 978-3-540-45517-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

Information Extraction Approaches: A Survey

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

Information Extraction Approaches: A Survey

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation