Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2433396.2433499acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
abstract

Exploring structure and content on the web: extraction and integration of the semi-structured web

Published: 04 February 2013 Publication History

Abstract

In this tutorial we view the World Wide Web as a type of massive, decentralized database. At present, this "Web database" is presented in a manner largely devoid of any consistent meaning or schema. That is not to say that Web-data lacks an underlying organization; in fact, most Web content is generated from an underlying schema-bound, or otherwise structured database. Information extraction is generally concerned with the reconciliation of unstructured or semi-structured Web content with the neatly structured database paradigm. With this Web-database in hand, researchers and practitioners have recently begun developing mechanisms which return structured results in response to an unstructured query. These new developments are a product of (1) record, list and table extraction from large numbers of semi-structured Web pages, (2) integration of these disparate extraction results into a consistent form, and (3) analysis of the newly extracted and integrated Web data.
Among the many fruits of this line of work is the ability for semi-structured Web data to enhance the search capabilities of a schema-bound database. Alternatively, structured database records have also been used to augment Web page collections typically used by Web search engines. We will cover several key technologies, and principles explored so far in the area of Web information extraction, search and exploration.

References

[1]
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Flint: Google-basing the Web. In EDBT, pages 720--724. ACM Press, Mar. 2008.
[2]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: exploring the power of tables on the web. VLDB, 1(1):538--549, Aug. 2008.
[3]
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB, pages 109--118, Sept. 2001.
[4]
H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. The VLDB Journal, 20(2):209--226, Mar. 2011.
[5]
F. Fumarola, T. Weninger, R. Barber, D. Malerba, and J. Han. HyLiEn: A Hybrid Approach to General List Extraction on the Web. In WWW, page 35. ACM Press, 2011.
[6]
R. Gupta and S. Sarawagi. Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1):289--300, 2009.
[7]
J. Han. Construction of Web-Based, Service-Oriented Information Networks: A Data Mining Perspective. In H. Gao, L. Lim, W. Wang, C. Li, and L. Chen, editors, WAIM, volume 7418 of Lecture Notes in Computer Science, page 6, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
[8]
B. Liu, R. Grossman, and Y. Zhai. Mining data records in Web pages. In SIGKDD, page 601. ACM Press, Aug. 2003.
[9]
I. Mansuri and S. Sarawagi. Integrating Unstructured Data into Relational Databases. In ICDE, page 29. IEEE, Apr. 2006.
[10]
S. Tong and J. Dean. System and methods for automatically creating lists., 2008.
[11]
T. Weninger, Y. Bisk, and J. Han. Document Topic Hierarchies from Document Graphs. In CIKM, Maui, Hawaii, 2012.
[12]
T. Weninger, W. H. Hsu, and J. Han. CETR. In WWW, page 971. ACM Press, Apr. 2010.
[13]
T. Weninger, C. Zhai, and J. Han. Building enriched web page representations using link paths. In HT, page 53. ACM Press, June 2012.
[14]
Y. Zhai and B. Liu. Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12):1614--1628, Dec. 2006.

Cited By

View all

Index Terms

  1. Exploring structure and content on the web: extraction and integration of the semi-structured web

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining
      February 2013
      816 pages
      ISBN:9781450318693
      DOI:10.1145/2433396
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 February 2013

      Check for updates

      Author Tags

      1. information extraction
      2. information integration
      3. semi-structured data

      Qualifiers

      • Abstract

      Conference

      WSDM 2013

      Acceptance Rates

      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Closed sequential pattern mining for sitemap generationWorld Wide Web10.1007/s11280-020-00839-224:1(175-203)Online publication date: 27-Sep-2020
      • (2018)Profiling Web users using big dataSocial Network Analysis and Mining10.1007/s13278-018-0495-08:1Online publication date: 22-Mar-2018
      • (2016)BayesWipeJournal of Data and Information Quality10.1145/29927878:1(1-30)Online publication date: 25-Oct-2016
      • (2013)Extracting the semantic content of web pages via repeated structures2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)10.1109/ICMEW.2013.6618450(1-6)Online publication date: Jul-2013

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media