research-article

Open access

Structure interpretation of text formats

Authors:

Sumit Gulwani,

Vu Le,

Arjun Radhakrishna,

Ivan Radiček,

Mohammad RazaAuthors Info & Claims

Proceedings of the ACM on Programming Languages, Volume 4, Issue OOPSLA

Article No.: 212, Pages 1 - 29

https://doi.org/10.1145/3428280

Published: 13 November 2020 Publication History

PDF eReader

Abstract

We present Unravel, an extensible framework for structure interpretation of ad-hoc formats. Unravel can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired. Our key insight is to allow arbitrary combination of extraction and parsing techniques through a concept called partial structures. Partial structures act as a common language through which the file structure can be shared and refined by different techniques. This makes Unravel more powerful than applying the individual techniques in parallel or sequentially. Further, with this rule-based extensible approach, we introduce the novel notion of re-interpretation where the variety of techniques supported by our system can be exploited to improve accuracy while optimizing for particular quality measures or restricted environments. On our benchmark of 617 text files gathered from a variety of sources, Unravel is able to extract the intended table in many more cases compared to state-of-the-art techniques.

Supplementary Material

Auxiliary Presentation Video (oopsla20main-p421-p-video.mp4)

Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing. We present Unravel, an extensible framework for structure interpretation of ad-hoc formats. Unravel can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired.

Download
124.90 MB

References

[1]

Arvind Arasu and Hector Garcia-Molina. 2003. Extracting Structured Data from Web Pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California) ( SIGMOD '03). ACM, New York, NY, USA, 337-348. https://doi.org/10.1145/872757.872799

Abstract

Supplementary Material

References

Index Terms

Recommendations

Data extraction from the web based on pre-defined schema

FlashRelate: extracting relational data from semi-structured spreadsheets using examples

Nodose version 2.0

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations