Abstract
Data intensive information is often published on the internet in the format of HTML tables. Extracting some of the information that is of users’ interest from the internet, especially when large number of web pages need to be accessed, is time consuming. To automate the processes of information extraction, this paper proposes an XML way of semantically analyzing HTML tables for the data od interest. It firstly introduces a mini language in XML syntax for specifying ontologies that represent the data of interest. Then it defines algorithms that parse HTML tables to a specially defined type of XML trees. The XML trees are then compared with the ontologies to semantically analyze and locate the part of table or nested tables that have the interesting data. Finally, interesting data, once identified, is output as XML documents.
This research was supported by the international join research grant of the IITA (Institute of Information Technology Assessment) foreign professor invitation program of the MIC (Ministry of Information and Communication), Korea.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brasethvik, T., Gulla, J.A.: Natural language analysis for semantic document modeling. DKE 38(1), 45–62 (2001)
Bray, T., Paoli, J., Sperberg-McQueen, C.M.: Extensible markup language (xml) 1.0 (1998), http://www.w3.org/TR/1998/REC-xml-19980210
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. In: SIGMOD Conference, p. 624 (2002)
Crescenzi, V., Mecca, G., Merialdo, P., Missier, P.: An automatic data grabber for large web sites. In: VLDB, pp. 1321–1324 (2004)
Embley, D.W., Tao, C., Liddle, S.W.: Automatically extracting ontologically specified data from html tables of unknown structure. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 322–337. Springer, Heidelberg (2002)
Filha, I.M.R.E., da Silva, A.S., Laender, A.H.F., Embley, D.W.: Using nested tables for representing and querying semistructured web data. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 719–723. Springer, Heidelberg (2002)
Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., Crespo, A.: Extracting semistructured information from the web. In: Proceedings of the Workshop on Management of Semistructured Data (1997)
HTML-Working-Group. Hypertext markup language (html), W3C (2004), http://www.w3.org/MarkUp/
Lam, W., Lin, W.-Y.: Learning to extract hierarchical information from semi-structured documents. In: CIKM, pp. 250–257 (2000)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.A.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD Conference, pp. 119–130 (2004)
Lerman, K., Knoblock, C.A., Minton, S.: Automatic data extraction from lists and tables in web sources. In: Automatic Text Extraction and Mining workshop (ATEM 2001), IJCAI 2001, Seattle, WA (2001), http://www.isi.edu/~lerman/papers/lerman-atem2001.pdf
Lim, S.-J., Nag, Y.-K.: An automated approach for retrieving hierarchical data from html tables. In: CIKM, pp. 466–474 (1999)
Soderland, S.: Learning to extract text-based information from the world wide web. In: KDD, pp. 251–254 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, J., Ao, Z., Park, HH., Chen, Y. (2005). An XML Approach to Semantically Extract Data from HTML Tables. In: Andersen, K.V., Debenham, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2005. Lecture Notes in Computer Science, vol 3588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11546924_68
Download citation
DOI: https://doi.org/10.1007/11546924_68
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28566-3
Online ISBN: 978-3-540-31729-6
eBook Packages: Computer ScienceComputer Science (R0)