Labeling Data Extracted from the Web

da Silva, Altigran S.; Barbosa, Denilson; Cavalcanti, João M. B.; Sevalho, Marco A. S.

doi:10.1007/978-3-540-76848-7_72

Altigran S. da Silva¹,
Denilson Barbosa²,
João M. B. Cavalcanti¹ &
…
Marco A. S. Sevalho¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4803))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

1240 Accesses

Abstract

We consider finding descriptive labels for anonymous, structured datasets, such as those produced by state-of-the-art Web wrappers. We give a probabilistic model to estimate the affinity between attributes and labels, and describe a method that uses a Web search engine to populate the model. We discuss a method for finding good candidate labels for unlabeled datasets. Ours is the first unsupervised labeling method that does not rely on mining the HTML pages containing the data. Experimental results with data from 8 different domains show that our methods achieve high accuracy even with very few search engine accesses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Assigning Semantic Labels to Data Sources

Semantic Labeling: A Domain-Independent Approach

Relationships Are Complicated! An Analysis of Relationships Between Datasets on the Web

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD (2003)
Google Scholar
Arlotta, L., Crescenzi, V., Mecca, G., Merialdo, P.: Automatic annotation of data extracted from large web sites. In: WebDB (2003)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB (2001)
Google Scholar
de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: WWW (2004)
Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Shaked, A.-M.P.T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artif. Intell. 165(1), 91–134 (2005)
Article Google Scholar
Hearst, M.A.: Automatic Acquisition of Hyponyms from Large Text Corpora. In: COLING, pp. 539–545 (1992)
Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)
Article Google Scholar
Lassila, O., Swick, R.R.: Resource Description Framework (RDF) Model and Syntax Specification. W3C Recommendation (1999)
Google Scholar
McGuinness, D.L., van Harmelen, F.: OWL Web Ontology Language Overview. W3C Recommendation (2004)
Google Scholar
Rafiei, D., Mendelzon, A.O.: What is this page known for? Computing Web page reputations. Computer Networks 33(1-6), 823–835 (2000)
Article Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Jorunal 10(4), 334–350 (2001)
Article MATH Google Scholar
Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: ECML (2001)
Google Scholar
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW (2003)
Google Scholar
West, D.B.: Introduction to Graph Theory. Prentice-Hall, Englewood Cliffs (1996)
MATH Google Scholar
Wu, W., Doan, A., Yu, C.T.: WebIQ: Learning from the Web to Match Deep-Web Query Interfaces. In: ICDE (2006)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Universidade Federal do Amazonas, Manaus, AM, Brazil
Altigran S. da Silva, João M. B. Cavalcanti & Marco A. S. Sevalho
University of Calgary, Calgary, AB, Canada
Denilson Barbosa

Authors

Altigran S. da Silva
View author publications
You can also search for this author in PubMed Google Scholar
Denilson Barbosa
View author publications
You can also search for this author in PubMed Google Scholar
João M. B. Cavalcanti
View author publications
You can also search for this author in PubMed Google Scholar
Marco A. S. Sevalho
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Robert Meersman Zahir Tari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

da Silva, A.S., Barbosa, D., Cavalcanti, J.M.B., Sevalho, M.A.S. (2007). Labeling Data Extracted from the Web. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2007: CoopIS, DOA, ODBASE, GADA, and IS. OTM 2007. Lecture Notes in Computer Science, vol 4803. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76848-7_72

Download citation

DOI: https://doi.org/10.1007/978-3-540-76848-7_72
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76846-3
Online ISBN: 978-3-540-76848-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Labeling Data Extracted from the Web

Abstract

Access this chapter

Preview

Similar content being viewed by others

Assigning Semantic Labels to Data Sources

Semantic Labeling: A Domain-Independent Approach

Relationships Are Complicated! An Analysis of Relationships Between Datasets on the Web

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Labeling Data Extracted from the Web

Abstract

Access this chapter

Preview

Similar content being viewed by others

Assigning Semantic Labels to Data Sources

Semantic Labeling: A Domain-Independent Approach

Relationships Are Complicated! An Analysis of Relationships Between Datasets on the Web

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation