Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Record-boundary discovery in Web documents

Published: 01 June 1999 Publication History

Abstract

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By “record” we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the experiments we conducted).

References

[1]
B. Adelberg. Nodose- a tool for semiautomatically extracting structured and semistructured data from text documents. in Proceedings of the 1998 A CM SIGMOD International Conference on Management of Data, pages 283-294, Seattle, Washington, June 1998.
[2]
N. Ashish and C. Knoblock. Semiautomatic wrapper generation for internet information sources. In Proceedings of the CooplS'97, 1997.
[3]
N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Record, 26(4):8-15, December 1997.
[4]
P. Atzeni and G. Mecca. Cut and paste. In Proceedings of the 16th A CM PODS, pages 144-153, May 1997.
[5]
P.M.G. Apers. Identifying internetrelated database research. In Proceedings o/ the 2nd International East-West Database Workshop, pages 183-193, Klagenfurt, 1994. Springer-Verlag.
[6]
P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings o/ the International Conference on Database Theory (ICDT), 1997.
[7]
R.B. Doorenbos, O. Etzioni, and D.S. Weld. A scalable comparison-shopping agent for the world-wide web. In Proceedings of the First International Conference on Autonomous Agents, pages 39-48, Marina Del Rey, California, February 1997.
[8]
D. Embley, D. Campbell, Y. Jiang, Y.-K. Ng, R. Smith, S. Liddle, and D. Quass. A conceptual-modeling approach to extracting data from the web. In Proceedings of the 17th International Con/erence on Conceptual Modeling (ER'98), Singapore, November 1998. (to appear).
[9]
D.W. Embley, D.M. Campbell, S.W. Liddle, and R.D. Smith. Ontology-based extraction and structuring of information from data-rich unstructured documents. In Proceedings of the Conference on In- /ormation and Knowledge Management (CIKM'98), Washington D.C., November 1998. (to appear).
[10]
A. Gupta, V. Harinarayan, and A. Rajaraman. Virtual database technology. SIGMOD Record, 26(4):57-61, December 1997.
[11]
J. Haramer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997.
[12]
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 1997 International Joint Conference on Artificial Intelligence, pages 729-735, 1997.
[13]
G.F. Luger and W.A. Stubblefield. Artificial Intelligence: Structures and Strategies for Complex Problem Solving, Third Edition. Addison Wesley Longman, Inc., 1998.
[14]
I. Mus}ea, S. Minton, and C. Knoblock. Stakler: learning extraction rules for seraistructured, web-based information sources. In Proceedings of AAAI'98: Workshop on AI and Information Integration, Madison, Wisconsin, July 1998.
[15]
S. Soderland. Learning to extrac{ textbased :information from the world wide web. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 251-254, Newport Beach, California, August 1997.
[16]
Homepage for BYU data extraction research :group. URL: http://www.deg.byu. edu.

Cited By

View all
  • (2022)Advanced Metasearch Engine TechnologyundefinedOnline publication date: 24-Feb-2022
  • (2017)CMDR: Classifying nodes for mining data records with different HTML structuresTENCON 2017 - 2017 IEEE Region 10 Conference10.1109/TENCON.2017.8228162(1862-1862)Online publication date: Nov-2017
  • (2016)A survey of methods for the extraction of information from Web resourcesProgramming and Computing Software10.1134/S036176881605007842:5(279-291)Online publication date: 1-Sep-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 28, Issue 2
June 1999
599 pages
ISSN:0163-5808
DOI:10.1145/304181
Issue’s Table of Contents
  • cover image ACM Conferences
    SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
    June 1999
    604 pages
    ISBN:1581130848
    DOI:10.1145/304182
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1999
Published in SIGMOD Volume 28, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)73
  • Downloads (Last 6 weeks)16
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Advanced Metasearch Engine TechnologyundefinedOnline publication date: 24-Feb-2022
  • (2017)CMDR: Classifying nodes for mining data records with different HTML structuresTENCON 2017 - 2017 IEEE Region 10 Conference10.1109/TENCON.2017.8228162(1862-1862)Online publication date: Nov-2017
  • (2016)A survey of methods for the extraction of information from Web resourcesProgramming and Computing Software10.1134/S036176881605007842:5(279-291)Online publication date: 1-Sep-2016
  • (2016)The BBC News Hunter: A Novel Crawler for BBC NewsSocial Computing10.1007/978-981-10-2098-8_26(217-225)Online publication date: 31-Jul-2016
  • (2015)Annotating Needles in the Haystack without LookingProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2788580(2257-2266)Online publication date: 10-Aug-2015
  • (2013)An Approach of Web Page Information ExtractionApplied Mechanics and Materials10.4028/www.scientific.net/AMM.347-350.2479347-350(2479-2482)Online publication date: Aug-2013
  • (2013)Locating Discharge Medications in Natural Language SummariesProceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics10.1145/2506583.2512364(917-925)Online publication date: 22-Sep-2013
  • (2013)SCOPAS — SEMANTIC COMPUTATION OF PAGE SCOREInternational Journal of Information Technology & Decision Making10.1142/S021962201350038712:06(1309-1331)Online publication date: 12-Dec-2013
  • (2013)Related WorkUnsupervised Information Extraction by Text Segmentation10.1007/978-3-319-02597-1_2(9-17)Online publication date: 24-Oct-2013
  • (2012)A Visual Based Page Segmentation for Deep Web Data ExtractionProceedings of the International Conference on Soft Computing for Problem Solving (SocProS 2011) December 20-22, 201110.1007/978-81-322-0491-6_72(791-804)Online publication date: 2012
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media