Article

Using the structure of Web sites for automatic segmentation of tables

Authors:

Kristina Lerman,

Craig KnoblockAuthors Info & Claims

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Pages 119 - 130

https://doi.org/10.1145/1007568.1007584

Published: 13 June 2004 Publication History

Abstract

Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit structure, both in layout and content, that we can exploit for the purpose of information extraction. This paper describes an approach to automatic extraction and segmentation of records from Web tables. Automatic methods do not require any user input, but rely solely on the layout and content of the Web source. Our approach relies on the common structure of many Web sites, which present information as a list or a table, with a link in each entry leading to a detail page containing additional information about that item. We describe two algorithms that use redundancies in the content of table and detail pages to aid in information extraction. The first algorithm encodes additional information provided by detail pages as constraints and finds the segmentation by solving a constraint satisfaction problem. The second algorithm uses probabilistic inference to find the record segmentation. We show how each approach can exploit the web site structure in a general, domain-independent manner, and we demonstrate the effectiveness of each algorithm on a set of twelve Web sites.

References

[1]

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of data, 2003.

Digital Library

[2]

L. Arlotta, V. Crescenzi, G. Mecca, and P. Marialdo. Automatic annotation of data extracted from large web sites. In Proceedings of the Sixth International Workshop on Web and Databases (WebDB03), 2003.

[3]

V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records full text. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, 2001.

Digital Library

[4]

C. H. Chang, and S. C. Lui. IEPAD: Information Extraction based on Pattern Discovery. In 10th International World Wide Web Conference (WWW10), Hong Kong, 2001.

Digital Library

[5]

H. Chen, S. Tsai, and J. Tsai. Mining tables from large scale html texts. In 18th International Conference on Computational Linguistics (COLING), 2000.

Digital Library

[6]

W. W. Cohen, M. Hurst, and L. S. Jensen. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In 11th International World Wide Web Conference (WWW10), Honolulu, Hawaii, 2002.

Digital Library

[7]

V. Crescenzi, G. Mecca, and P. Merialdo. Automatic web information extraction in the ROADRUNNER system. In Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), 2001.

Digital Library

[8]

V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th Conference on Very Large Databases (VLDB), Rome, Italy, 2001.

Digital Library

[9]

C. Gazen. Thesis proposal, Carnegie Mellon University.

[10]

Z. Ghahramani and M. I. Jordan. Factorial hidden Markov models. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Proc. Conf. Advances in Neural Information Processing Systems, NIPS, volume 8, pages 472--478. MIT Press, 1995.

[11]

M. Hurst. Layout and language: Challenges for table understanding on the web. In In Web Document Analysis, Proceedings of the 1st International Workshop on Web Document Analysis, 2001.

[12]

M. Hurst and S. Douglas. Layout and language: Preliminary investigations in recognizing the structure of tables. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 1997.

Digital Library

[13]

Y. Jiang. Record-Boundary Discovery In Web Documents. PhD thesis, BYU, Utah, 1998.

[14]

N. Kushmerick and B. Thoma. Intelligent Information Agents R&D in Europe: An AgentLink perspective, chapter Adaptive information extraction: Core technologies for information agents. Springer, 2002.

Digital Library

[15]

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001.

Digital Library

[16]

K. Lerman and S. Minton. Learning the Common Structure of Data. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-2000), Menlo Park, 2000. AAAI Press.

Digital Library

[17]

K. Lerman, C. A. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In Proceedings of the workshop on Advances in Text Extraction and Mining (IJCAI-2001), Menlo Park, 2001. AAAI Press.

[18]

K. Lerman, S. Minton, and C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18:149--181, 2003.

Digital Library

[19]

K. Lerman, C. Gazen, S. Minton, and C. A. Knoblock,. Populating the Semantic Web. Submitted to the workshop on Advances in Text Extraction and Mining (ATEM-2004), 2004.

[20]

K. Murphy. Dynamic bayesian networks: Representation, inference and learning. PhD Thesis, UC Berkeley, 2002.

Digital Library

[21]

I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the 19th International Conference on Machine Learning (ICML 2002), pages 435--442. Morgan Kaufmann, San Francisco, CA, 2002.

Digital Library

[22]

H. T. Ng, H. Y. Lim, and J. L. T. Koo. Learning to recognize tables in free text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL99), 1999.

Digital Library

[23]

D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In Proceedings of the ACM SIGIR), 2003.

Digital Library

[24]

P. Pyreddy and W. B. Croft. Tintin: A system for retrieval in text tables. In Proceedings of 2nd International Conference on Digital Libraries, 1997.

Digital Library

[25]

L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Readings in Speech Recognition.

Digital Library

[26]

S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proceedings of the Twenty-seventh International Conference on Very Large Databases, 2001.

Digital Library

[27]

J. P. Walser. Wsat(oip) package.

[28]

J. P. Walser. Integer Optimization by Local Search: A Domain Independent Approach, volume 1637 of LNCS. Springer, New York, 1999.

Digital Library

[29]

Y. Wang and J. Hu. Detecting tables in html documents. In Fifth IAPR International Workshop on Document Analysis Systems, Princeton, New Jersey, August 2002.

Digital Library

[30]

Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In In The Elevent International World Web Conference, Honolulu, Hawaii, USA, May 2002., 2002.

Digital Library

[31]

Y. Wang, I. T. Phillips, and R. Haralick. Table detection via probability optimization. In Fifth IAPR International Workshop on Document Analysis Systems, Princeton, New Jersey, August 2002.

Digital Library

[32]

M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web. In in Proceedings of the International Workshop on Web Document Analysis (WDA 2001), Seattle, U.S., September 2001.

Cited By

Haider WYesilada Y(2022)Classification of Layout vs. Relational Tables on the Web: Machine Learning with Rendered PagesACM Transactions on the Web10.1145/355534917:1(1-23)Online publication date: 20-Dec-2022
https://dl.acm.org/doi/10.1145/3555349
Roldán JJiménez PCorchuelo R(2022)On extracting data from tables that are encoded using HTMLKnowledge-Based Systems10.1016/j.knosys.2019.105157190:COnline publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.knosys.2019.105157
Shigarov A(2022)Table understanding: Problem overviewWIREs Data Mining and Knowledge Discovery10.1002/widm.148213:1Online publication date: 21-Nov-2022
https://doi.org/10.1002/widm.1482
Show More Cited By

Recommendations

A novel approach for comparing web sites by using MicroGenres

In this paper, a novel approach is introduced to compare web sites by analysing their web page content. Each web page can be expressed as a set of entities called MicroGenres, which in turn are abstractions about design patterns and genres for ...
Classifying web sites
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant ...
Automatic extraction of structure, content and usage data statistics of web sites
HT '10: Proceedings of the 21st ACM conference on Hypertext and hypermedia

In this paper we present a web mining tool which automatically extracts the structure, content and usage data statistics of web sites. This work inspired by the fact that web mining consists of three axes: web structure mining, web content mining and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

June 2004

988 pages

ISBN:1581138598

DOI:10.1145/1007568

Conference Chairs:
Arnd Christian König
Microsoft Research
,
Stefan Dessloch
University of Kaiserslautern, Germany
,
General Chair:
Patrick Valduriez
INRIA, France
,
Program Chair:
Gerhard Weikum
University of the Saarland

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS04

Sponsor:

SIGMOD

SIGMOD/PODS04: International Conference on Management of Data and Symposium on Principles Database and Systems

June 13 - 18, 2004

Paris, France

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

118
Total Citations
View Citations
1,416
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Haider WYesilada Y(2022)Classification of Layout vs. Relational Tables on the Web: Machine Learning with Rendered PagesACM Transactions on the Web10.1145/355534917:1(1-23)Online publication date: 20-Dec-2022
https://dl.acm.org/doi/10.1145/3555349
Roldán JJiménez PCorchuelo R(2022)On extracting data from tables that are encoded using HTMLKnowledge-Based Systems10.1016/j.knosys.2019.105157190:COnline publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.knosys.2019.105157
Shigarov A(2022)Table understanding: Problem overviewWIREs Data Mining and Knowledge Discovery10.1002/widm.148213:1Online publication date: 21-Nov-2022
https://doi.org/10.1002/widm.1482
Kumar SKumar R(2021)A Study on Different Aspects of Web Mining and Research IssuesIOP Conference Series: Materials Science and Engineering10.1088/1757-899X/1022/1/0120181022(012018)Online publication date: 19-Jan-2021
https://doi.org/10.1088/1757-899X/1022/1/012018
Burdick DDanilevsky MEvfimievski AKatsis YWang N(2020)Table extraction and understanding for scientific and enterprise applicationsProceedings of the VLDB Endowment10.14778/3415478.341556313:12(3433-3436)Online publication date: 1-Aug-2020
https://dl.acm.org/doi/10.14778/3415478.3415563
Kaushik RRamamurthy R(2020)WhodunitProceedings of the VLDB Endowment10.14778/3402755.34027834:12(1410-1413)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402755.3402783
Chepurko NMarcus RZgraggen EFernandez RKraska TKarger D(2020)ARDAProceedings of the VLDB Endowment10.14778/3397230.339723513:9(1373-1387)Online publication date: 26-Jun-2020
https://dl.acm.org/doi/10.14778/3397230.3397235
Li THuang RChen LJensen CPedersen T(2020)Compression of uncertain trajectories in road networksProceedings of the VLDB Endowment10.14778/3384345.338435313:7(1050-1063)Online publication date: 1-Mar-2020
https://dl.acm.org/doi/10.14778/3384345.3384353
Lin XLi HXin HLi ZChen L(2020)KBPearlProceedings of the VLDB Endowment10.14778/3384345.338435213:7(1035-1049)Online publication date: 1-Mar-2020
https://dl.acm.org/doi/10.14778/3384345.3384352
Asudeh AJagadish HWu YYu C(2020)On detecting cherry-picked trendlinesProceedings of the VLDB Endowment10.14778/3380750.338076213:6(939-952)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.14778/3380750.3380762
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten