abstract

Exploring structure and content on the web: extraction and integration of the semi-structured web

Authors:

Tim Weninger,

Jiawei HanAuthors Info & Claims

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

Pages 779 - 780

https://doi.org/10.1145/2433396.2433499

Published: 04 February 2013 Publication History

Get Access

Abstract

In this tutorial we view the World Wide Web as a type of massive, decentralized database. At present, this "Web database" is presented in a manner largely devoid of any consistent meaning or schema. That is not to say that Web-data lacks an underlying organization; in fact, most Web content is generated from an underlying schema-bound, or otherwise structured database. Information extraction is generally concerned with the reconciliation of unstructured or semi-structured Web content with the neatly structured database paradigm. With this Web-database in hand, researchers and practitioners have recently begun developing mechanisms which return structured results in response to an unstructured query. These new developments are a product of (1) record, list and table extraction from large numbers of semi-structured Web pages, (2) integration of these disparate extraction results into a consistent form, and (3) analysis of the newly extracted and integrated Web data.

Among the many fruits of this line of work is the ability for semi-structured Web data to enhance the search capabilities of a schema-bound database. Alternatively, structured database records have also been used to augment Web page collections typically used by Web search engines. We will cover several key technologies, and principles explored so far in the area of Web information extraction, search and exploration.

References

[1]

L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Flint: Google-basing the Web. In EDBT, pages 720--724. ACM Press, Mar. 2008.

Digital Library

Google Scholar

[2]

M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTables: exploring the power of tables on the web. VLDB, 1(1):538--549, Aug. 2008.

Digital Library

Google Scholar

[3]

V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB, pages 109--118, Sept. 2001.

Digital Library

Google Scholar

[4]

H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. The VLDB Journal, 20(2):209--226, Mar. 2011.

Digital Library

Google Scholar

[5]

F. Fumarola, T. Weninger, R. Barber, D. Malerba, and J. Han. HyLiEn: A Hybrid Approach to General List Extraction on the Web. In WWW, page 35. ACM Press, 2011.

Digital Library

Google Scholar

[6]

R. Gupta and S. Sarawagi. Answering Table Augmentation Queries from Unstructured Lists on the Web. PVLDB, 2(1):289--300, 2009.

Digital Library

Google Scholar

[7]

J. Han. Construction of Web-Based, Service-Oriented Information Networks: A Data Mining Perspective. In H. Gao, L. Lim, W. Wang, C. Li, and L. Chen, editors, WAIM, volume 7418 of Lecture Notes in Computer Science, page 6, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.

Google Scholar

[8]

B. Liu, R. Grossman, and Y. Zhai. Mining data records in Web pages. In SIGKDD, page 601. ACM Press, Aug. 2003.

Digital Library

Google Scholar

[9]

I. Mansuri and S. Sarawagi. Integrating Unstructured Data into Relational Databases. In ICDE, page 29. IEEE, Apr. 2006.

Digital Library

Google Scholar

[10]

S. Tong and J. Dean. System and methods for automatically creating lists., 2008.

Google Scholar

[11]

T. Weninger, Y. Bisk, and J. Han. Document Topic Hierarchies from Document Graphs. In CIKM, Maui, Hawaii, 2012.

Digital Library

Google Scholar

[12]

T. Weninger, W. H. Hsu, and J. Han. CETR. In WWW, page 971. ACM Press, Apr. 2010.

Digital Library

Google Scholar

[13]

T. Weninger, C. Zhai, and J. Han. Building enriched web page representations using link paths. In HT, page 53. ACM Press, June 2012.

Digital Library

Google Scholar

[14]

Y. Zhai and B. Liu. Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transactions on Knowledge and Data Engineering, 18(12):1614--1628, Dec. 2006.

Digital Library

Google Scholar

Cited By

View all

Ceci MLanotte P(2020)Closed sequential pattern mining for sitemap generationWorld Wide Web10.1007/s11280-020-00839-224:1(175-203)Online publication date: 27-Sep-2020
https://doi.org/10.1007/s11280-020-00839-2
Gu XYang HTang JZhang JZhang FLiu DHall WFu X(2018)Profiling Web users using big dataSocial Network Analysis and Mining10.1007/s13278-018-0495-08:1Online publication date: 22-Mar-2018
https://doi.org/10.1007/s13278-018-0495-0
De SHu YMeduri VChen YKambhampati S(2016)BayesWipeJournal of Data and Information Quality10.1145/29927878:1(1-30)Online publication date: 25-Oct-2016
https://dl.acm.org/doi/10.1145/2992787
Show More Cited By

Index Terms

Exploring structure and content on the web: extraction and integration of the semi-structured web
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Navigation objects extraction for better content structure understanding
WI '17: Proceedings of the International Conference on Web Intelligence

Existing works for extracting navigation objects from webpages focus on navigation menus, so as to reveal the information architecture of the site. However, web 2.0 sites such as social networks, e-commerce portals etc. are making the understanding of ...
Automatic information extraction from semi-structured Web pages by pattern discovery
Web retrieval and mining

The World Wide Web is now undeniably the richest and most dense source of information; yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery approach to the rapid generation of ...
Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

How do we surface the large amount of information present in HTML documents on the Web, from news articles to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including knowledge base construction, ...

Comments

Information & Contributors

Information

Published In

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

February 2013

816 pages

ISBN:9781450318693

DOI:10.1145/2433396

General Chairs:
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy
,
Program Chairs:
Paolo Ferragina
University of Pisa, Italy
,
Aristides Gionis
Yahoo! Research, Barcelona, Spain

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 February 2013

Check for updates

Author Tags

Qualifiers

Abstract

Conference

WSDM 2013

Sponsor:

WSDM 2013: Sixth ACM International Conference on Web Search and Data Mining

February 4 - 8, 2013

Rome, Italy

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
477
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Ceci MLanotte P(2020)Closed sequential pattern mining for sitemap generationWorld Wide Web10.1007/s11280-020-00839-224:1(175-203)Online publication date: 27-Sep-2020
https://doi.org/10.1007/s11280-020-00839-2
Gu XYang HTang JZhang JZhang FLiu DHall WFu X(2018)Profiling Web users using big dataSocial Network Analysis and Mining10.1007/s13278-018-0495-08:1Online publication date: 22-Mar-2018
https://doi.org/10.1007/s13278-018-0495-0
De SHu YMeduri VChen YKambhampati S(2016)BayesWipeJournal of Data and Information Quality10.1145/29927878:1(1-30)Online publication date: 25-Oct-2016
https://dl.acm.org/doi/10.1145/2992787
Zheng He Hangzai Luo Jianping Fan Xiao Liu (2013)Extracting the semantic content of web pages via repeated structures2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)10.1109/ICMEW.2013.6618450(1-6)Online publication date: Jul-2013
https://doi.org/10.1109/ICMEW.2013.6618450

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Navigation objects extraction for better content structure understanding

Automatic information extraction from semi-structured Web pages by pattern discovery

Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web