Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2063576.2063761acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Towards a unified solution: data record region detection and segmentation

Published: 24 October 2011 Publication History

Abstract

Although the task of data record extraction from Web pages has been studied extensively, yet it fails to handle many pages due to their complexity in format or layout. In this paper, we propose a unified method to tackle this task by addressing several key issues in a uniform manner. A new search structure, named as Record Segmentation Tree (RST), is designed, and several efficient search pruning strategies on the RST structure are proposed to identify the records in a given Web page. Another characteristic of our method which is significantly different from previous works is that it can effectively handle complicated and challenging data record regions. It is achieved by generating subtree groups dynamically from the RST structure during the search process. Furthermore, instead of using string edit distance or tree edit distance, we propose a token-based edit distance which takes each DOM node as a basic unit in the cost calculation. Extensive experiments are conducted on four data sets, including flat, nested, and intertwine records. The experimental results demonstrate that our method achieves higher accuracy compared with three state-of-the-art methods.

References

[1]
B. Adelberg. Nodose - a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec., 27:283--294, June 1998.
[2]
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD '03, pages 337--348, 2003.
[3]
G. O. Arocena and A. O. Mendelzon. Weboql: restructuring documents, databases, and webs. Theor. Pract. Object Syst., 5:127--141, August 1999.
[4]
R. Baumgartner, G. Gottlob, and M. Herzog. Scalable web data extraction for online market intelligence. Proc. VLDB Endow., 2:1512--1523, August 2009.
[5]
D. Buttler, L. Liu, and C. Pu. A fully automated object extraction system for the world wide web. In ICDCS '01, pages 361--370, 2001.
[6]
C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW '01, pages 681--688, 2001.
[7]
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB '01, pages 109--118, 2001.
[8]
D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y.-K. Ng, and R. D. Smith. Conceptual model-based data extraction from multiple-record web pages. Data Knowl. Eng., 31:227--251, November 1999.
[9]
D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In SIGMOD '99, pages 467--478, 1999.
[10]
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW '07, pages 71--80, 2007.
[11]
D. Gusfield and J. Stoye. Linear time algorithms for finding and representing all the tandem repeats in a string. J. Comput. Syst. Sci., 69:525--546, December 2004.
[12]
A. Hogue and D. Karger. Thresher: automating the unwrapping of semantic content from the world wide web. In WWW '05, pages 86--95, 2005.
[13]
C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst., 23:521--538, December 1998.
[14]
N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artif. Intell., 118:15--68, April 2000.
[15]
A. H. F. Laender, B. Ribeiro-Neto, and A. S. da Silva. Debye - date extraction by example. Data Knowl. Eng., 40:121--154, February 2002.
[16]
B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD '03, pages 601--606, 2003.
[17]
L. Liu, C. Pu, and W. Han. Xwrap: An xml-enabled wrapper construction system for web information sources. In ICDE '00, pages 611--621, 2000.
[18]
W. Liu, X. Meng, and W. Meng. Vide: A vision-based approach for deep web data extraction. IEEE Trans. on Knowl. and Data Eng., 22:447--460, March 2010.
[19]
G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser. Extracting data records from the web using tag path clustering. In WWW '09, pages 981--990, 2009.
[20]
I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4:93--114, March 2001.
[21]
K. Simon and G. Lausen. Viper: augmenting automatic information extraction with visual perceptions. In CIKM '05, pages 381--388, 2005.
[22]
W. Su, J. Wang, and F. H. Lochovsky. Ode: Ontology-assisted data extraction. ACM Trans. Database Syst., 34:12:1--12:35, July 2009.
[23]
J. Wang and F. H. Lochovsky. Data extraction and label assignment for web databases. In WWW '03, pages 187--196, 2003.
[24]
Y. Yamada, N. Craswell, T. Nakatoh, and S. Hirokawa. Testbed for information extraction from deep web. In WWW Alt. '04, pages 346--347, 2004.
[25]
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW '05, pages 76--85, 2005.
[26]
Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Trans. on Knowl. and Data Eng., 18:1614--1628, December 2006.
[27]
Y. Zhai and B. Liu. Extracting web data using instance-based learning. World Wide Web, 10:113--132, June 2007.
[28]
H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In WWW '05, pages 66--75, 2005.
[29]
H. Zhao, W. Meng, and C. Yu. Mining templates from search result records of search engines. In KDD '07, pages 884--893, 2007.
[30]
S. Zheng, R. Song, J.-R. Wen, and C. L. Giles. Efficient record-level wrapper induction. In CIKM '09, pages 47--56, 2009.
[31]
S. Zheng, R. Song, J.-R. Wen, and D. Wu. Joint optimization of wrapper generation and template detection. In KDD '07, pages 894--902, 2007.
[32]
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD '06, pages 494--503, 2006.

Cited By

View all
  • (2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
  • (2019)A Pure Visual Approach for Automatically Extracting and Aligning Structured Web DataACM Transactions on Internet Technology10.1145/336537619:4(1-26)Online publication date: 1-Nov-2019
  • (2019)Constructing a Comprehensive Events Database from the WebProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357986(229-238)Online publication date: 3-Nov-2019
  • Show More Cited By

Index Terms

  1. Towards a unified solution: data record region detection and segmentation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
      October 2011
      2712 pages
      ISBN:9781450307178
      DOI:10.1145/2063576
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 October 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. RST structure
      2. web data record extraction
      3. web information integration

      Qualifiers

      • Research-article

      Conference

      CIKM '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
      • (2019)A Pure Visual Approach for Automatically Extracting and Aligning Structured Web DataACM Transactions on Internet Technology10.1145/336537619:4(1-26)Online publication date: 1-Nov-2019
      • (2019)Constructing a Comprehensive Events Database from the WebProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357986(229-238)Online publication date: 3-Nov-2019
      • (2019)Efficiency Improvement Approach of Deep Web Data Extraction2019 14th International Conference on Computer Engineering and Systems (ICCES)10.1109/ICCES48960.2019.9068134(214-221)Online publication date: Dec-2019
      • (2018)Web HarvestingThe Dark Web10.4018/978-1-5225-3163-0.ch010(199-226)Online publication date: 2018
      • (2018)STEMKnowledge and Information Systems10.1007/s10115-017-1062-055:2(305-331)Online publication date: 1-May-2018
      • (2017)Web HarvestingWeb Usage Mining Techniques and Applications Across Industries10.4018/978-1-5225-0613-3.ch014(351-378)Online publication date: 2017
      • (2017)LTDE: A Layout Tree Based Approach for Deep Page Data ExtractionIEICE Transactions on Information and Systems10.1587/transinf.2016EDP7375E100.D:5(1067-1078)Online publication date: 2017
      • (2017)GrandBase: generating actionable knowledge from Big DataPSU Research Review10.1108/PRR-01-2017-00051:2(105-126)Online publication date: 14-Aug-2017
      • (2016)Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer ReviewsACM Transactions on Internet Technology10.1145/285705416:2(1-17)Online publication date: 15-Apr-2016
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media