Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/276304.276330acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free access

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

Published: 01 June 1998 Publication History

Abstract

Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.

References

[1]
S. Abiteboul. Querying semi-structured data. In Proceedings of ICDT (invited talk), 1997.
[2]
B. Adelberg. NoDoSE - a tool for semiautomatic data extraction from text files. Technical report, Computer Science Department, Northwestern University, 1998.
[3]
N. Ashish and C.A. Knoblock. Semi-automatic wrapper generation for internet information sources. In Proceedings of cooperative information systems, 1997.
[4]
N. Ashish and C.A. Knoblock. Wrapper generation for semi-structured internet sources. In Workshop on management of semistructured data, 1997.
[5]
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS project: integration of heterogeneous information sources. In Proceedings of the processing society of japan, 1997.
[6]
A. Goldberg. Information models, views, and controllers. Dr. Dobb's Journal, July 1990.
[7]
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Workshop on management of semistructured data, 1997.
[8]
Krasner, Glenn, and S. Pope. A cookbook for using the model-view-controller user interface paradigm in smalltalk-80. Journal of Object-oriented programming, August/September 1988.
[9]
N. Kushmerick, D.S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of IJCAI, 1997.
[10]
M. Livny. DeNet user's guide. Technical report, University of Wisconsin-Madison, 1990.

Cited By

View all
  • (2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 5-Sep-2023
  • (2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
  • (2021)Constructing an Educational Knowledge Graph with Concepts Linked to WikipediaJournal of Computer Science and Technology10.1007/s11390-020-0328-236:5(1200-1211)Online publication date: 30-Sep-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1998

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS98
SIGMOD/PODS98: Special Interest Group on Management of Data
June 1 - 4, 1998
Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)156
  • Downloads (Last 6 weeks)27
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 5-Sep-2023
  • (2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
  • (2021)Constructing an Educational Knowledge Graph with Concepts Linked to WikipediaJournal of Computer Science and Technology10.1007/s11390-020-0328-236:5(1200-1211)Online publication date: 30-Sep-2021
  • (2019)Paratexts and Documentary PracticesBiotechnology10.4018/978-1-5225-8903-7.ch024(597-624)Online publication date: 2019
  • (2019)REALACM Transactions on Embedded Computing Systems10.1145/336210018:6(1-24)Online publication date: 15-Nov-2019
  • (2019)Hardware-Software Collaborative Thermal Sensing in Optical Network-on-Chip--based Manycore SystemsACM Transactions on Embedded Computing Systems10.1145/336209918:6(1-24)Online publication date: 15-Nov-2019
  • (2019)CxDNNACM Transactions on Embedded Computing Systems10.1145/336203518:6(1-23)Online publication date: 15-Nov-2019
  • (2019)BTMonitorACM Transactions on Embedded Computing Systems10.1145/336203418:6(1-23)Online publication date: 15-Nov-2019
  • (2019)Performance enhancement of extended AFDX via bandwidth reservation for TSN/BLS shapersACM SIGBED Review10.1145/3314206.331420916:1(21-26)Online publication date: 20-Feb-2019
  • (2019)Size-based queuingACM SIGBED Review10.1145/3314206.331420716:1(9-14)Online publication date: 20-Feb-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media