Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4803))

  • 1240 Accesses

Abstract

We consider finding descriptive labels for anonymous, structured datasets, such as those produced by state-of-the-art Web wrappers. We give a probabilistic model to estimate the affinity between attributes and labels, and describe a method that uses a Web search engine to populate the model. We discuss a method for finding good candidate labels for unlabeled datasets. Ours is the first unsupervised labeling method that does not rely on mining the HTML pages containing the data. Experimental results with data from 8 different domains show that our methods achieve high accuracy even with very few search engine accesses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD (2003)

    Google Scholar 

  2. Arlotta, L., Crescenzi, V., Mecca, G., Merialdo, P.: Automatic annotation of data extracted from large web sites. In: WebDB (2003)

    Google Scholar 

  3. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB (2001)

    Google Scholar 

  4. de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: WWW (2004)

    Google Scholar 

  5. Etzioni, O., Cafarella, M., Downey, D., Shaked, A.-M.P.T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artif. Intell. 165(1), 91–134 (2005)

    Article  Google Scholar 

  6. Hearst, M.A.: Automatic Acquisition of Hyponyms from Large Text Corpora. In: COLING, pp. 539–545 (1992)

    Google Scholar 

  7. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  8. Lassila, O., Swick, R.R.: Resource Description Framework (RDF) Model and Syntax Specification. W3C Recommendation (1999)

    Google Scholar 

  9. McGuinness, D.L., van Harmelen, F.: OWL Web Ontology Language Overview. W3C Recommendation (2004)

    Google Scholar 

  10. Rafiei, D., Mendelzon, A.O.: What is this page known for? Computing Web page reputations. Computer Networks 33(1-6), 823–835 (2000)

    Article  Google Scholar 

  11. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Jorunal 10(4), 334–350 (2001)

    Article  MATH  Google Scholar 

  12. Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: ECML (2001)

    Google Scholar 

  13. Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW (2003)

    Google Scholar 

  14. West, D.B.: Introduction to Graph Theory. Prentice-Hall, Englewood Cliffs (1996)

    MATH  Google Scholar 

  15. Wu, W., Doan, A., Yu, C.T.: WebIQ: Learning from the Web to Match Deep-Web Query Interfaces. In: ICDE (2006)

    Google Scholar 

  16. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Robert Meersman Zahir Tari

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

da Silva, A.S., Barbosa, D., Cavalcanti, J.M.B., Sevalho, M.A.S. (2007). Labeling Data Extracted from the Web. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2007: CoopIS, DOA, ODBASE, GADA, and IS. OTM 2007. Lecture Notes in Computer Science, vol 4803. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76848-7_72

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-76848-7_72

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-76846-3

  • Online ISBN: 978-3-540-76848-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics