Abstract
Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a “bag of words” and then to perform an additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes the browser screen coordinates for every HTML object in a page. Using a spatial information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the center of a page. We show in initial experiments that using our heuristics, defined objects are recognized properly in 73% of cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Quinlan, J.R., “Induction of decision trees”, Machine Learning, 1986, pp. 81–106.
Salton, G., McGill, M.J., An Introduction to Modern Information Retrieval, McGraw-Hill, 1983.
Chakrabarti S., van den Berg M., Dom B., “Focused crawling: A new approach to topicspecific web resource discovery”, Proceedings of the 8th Int. World Wide Web Conference, Toronto, Canada, 1999.
Diligenti M., Coetzee F., Lawrence S., Giles C., Gori M., “Focused crawling using context graphs”, Proceedings of the 26th Int. Conf. On Very Large Databases, Cairo, Egypt, 2000.
Rennie J., McCallum A., “Using reinforcement learning to spider the web efficiently”, Proceedings of the Int. Conf. On Machine Learning, Bled, Slovenia, 1999.
Embley D.W., Jiang Y.S., Ng Y.K., “Record-Boundary Discovery in Web Documents”, Proceedings of SIGMOD, Philadelphia, USA, 1999.
Lim S. J., Ng Y. K., “Extracting Structures of HTML Documents Using a High-Level Stack Machine”, Proceedings of the 12th International Conference on Information Networking ICOIN, Tokyo, Japan, 1998
World Wide Web Consortium (W3C), “HTML 4.01 Specification”, http://www.w3c.org/TR/html401/, December 1999.
Bernard L.M., “Criteria for optimal web design (designing for usability)”, http://psychology.wichita.edu/optimalweb/position.htm, 2001
James F., “Representing Structured Information in Audio Interfaces: A Framework for Selecting Audio Marking Techniques to Represent Document Structures”, Ph.D. thesis, Stanford University, available online at http://www-pcd.stanford.edu/frankie/thesis/, 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kovačević, M., Dilligenti, M., Gori, M., Milutinović, V. (2002). Recognition of Common Areas in a Web Page Using a Visualization Approach. In: Scott, D. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2002. Lecture Notes in Computer Science(), vol 2443. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46148-5_21
Download citation
DOI: https://doi.org/10.1007/3-540-46148-5_21
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44127-4
Online ISBN: 978-3-540-46148-7
eBook Packages: Springer Book Archive