Abstract
We consider finding descriptive labels for anonymous, structured datasets, such as those produced by state-of-the-art Web wrappers. We give a probabilistic model to estimate the affinity between attributes and labels, and describe a method that uses a Web search engine to populate the model. We discuss a method for finding good candidate labels for unlabeled datasets. Ours is the first unsupervised labeling method that does not rely on mining the HTML pages containing the data. Experimental results with data from 8 different domains show that our methods achieve high accuracy even with very few search engine accesses.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD (2003)
Arlotta, L., Crescenzi, V., Mecca, G., Merialdo, P.: Automatic annotation of data extracted from large web sites. In: WebDB (2003)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB (2001)
de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: WWW (2004)
Etzioni, O., Cafarella, M., Downey, D., Shaked, A.-M.P.T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artif. Intell. 165(1), 91–134 (2005)
Hearst, M.A.: Automatic Acquisition of Hyponyms from Large Text Corpora. In: COLING, pp. 539–545 (1992)
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)
Lassila, O., Swick, R.R.: Resource Description Framework (RDF) Model and Syntax Specification. W3C Recommendation (1999)
McGuinness, D.L., van Harmelen, F.: OWL Web Ontology Language Overview. W3C Recommendation (2004)
Rafiei, D., Mendelzon, A.O.: What is this page known for? Computing Web page reputations. Computer Networks 33(1-6), 823–835 (2000)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Jorunal 10(4), 334–350 (2001)
Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: ECML (2001)
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW (2003)
West, D.B.: Introduction to Graph Theory. Prentice-Hall, Englewood Cliffs (1996)
Wu, W., Doan, A., Yu, C.T.: WebIQ: Learning from the Web to Match Deep-Web Query Interfaces. In: ICDE (2006)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
da Silva, A.S., Barbosa, D., Cavalcanti, J.M.B., Sevalho, M.A.S. (2007). Labeling Data Extracted from the Web. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2007: CoopIS, DOA, ODBASE, GADA, and IS. OTM 2007. Lecture Notes in Computer Science, vol 4803. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76848-7_72
Download citation
DOI: https://doi.org/10.1007/978-3-540-76848-7_72
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76846-3
Online ISBN: 978-3-540-76848-7
eBook Packages: Computer ScienceComputer Science (R0)