Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Structure interpretation of text formats

Published: 13 November 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing.
    We present Unravel, an extensible framework for structure interpretation of ad-hoc formats. Unravel can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired. Our key insight is to allow arbitrary combination of extraction and parsing techniques through a concept called partial structures. Partial structures act as a common language through which the file structure can be shared and refined by different techniques. This makes Unravel more powerful than applying the individual techniques in parallel or sequentially. Further, with this rule-based extensible approach, we introduce the novel notion of re-interpretation where the variety of techniques supported by our system can be exploited to improve accuracy while optimizing for particular quality measures or restricted environments. On our benchmark of 617 text files gathered from a variety of sources, Unravel is able to extract the intended table in many more cases compared to state-of-the-art techniques.

    Supplementary Material

    Auxiliary Presentation Video (oopsla20main-p421-p-video.mp4)
    Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing. We present Unravel, an extensible framework for structure interpretation of ad-hoc formats. Unravel can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired.

    References

    [1]
    Arvind Arasu and Hector Garcia-Molina. 2003. Extracting Structured Data from Web Pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California) ( SIGMOD '03). ACM, New York, NY, USA, 337-348. https://doi.org/10.1145/872757.872799
    [2]
    Sarah Chasins and Rastislav Bodik. 2017. Skip Blocks: Reusing Execution History to Accelerate Web Scripts. Proc. ACM Program. Lang. 1, OOPSLA, Article 51 (Oct. 2017 ), 28 pages. https://doi.org/10.1145/3133875
    [3]
    Sarah E. Chasins, Maria Mueller, and Rastislav Bodik. 2018. Rousillon: Scraping Distributed Hierarchical Web Data. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (Berlin, Germany) ( UIST '18). Association for Computing Machinery, New York, NY, USA, 963-975. https://doi.org/10.1145/3242587.3242661
    [4]
    Cognos Analytics 2019. Cognos Analytics: How XML files are flattened. https://www.ibm.com/support/knowledgecenter/ en/SSEP7J_10.2.2/com.ibm. swg.ba.cognos.dg_rtm_wb.10.2.2.doc/c_howxmlfilesareflattenednd09ab.html. Accessed: 2019-11-20.
    [5]
    Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. In Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, Robert M. Graham, Michael A. Harrison, and Ravi Sethi (Eds.). ACM, 238-252. https://doi.org/10.1145/512950.512973
    [6]
    Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. 2001. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 109-118. http://dl.acm.org/citation.cfm?id= 645927. 672370
    [7]
    Allen Cypher, Daniel C. Halbert, David Kurlander, Henry Lieberman, David Maulsby, Brad A. Myers, and Alan Turransky (Eds.). 1993. Watch what I do: programming by demonstration. MIT Press, Cambridge, MA, USA. http://portal.acm.org/ citation.cfm?id= 168080
    [8]
    Mark Daly, Yitzhak Mandelbaum, David Walker, Mary Fernández, Kathleen Fisher, Robert Gruber, and Xuan Zheng. 2006. PADS: An End-to-end System for Processing Ad Hoc Data. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (Chicago, IL, USA) ( SIGMOD '06). ACM, New York, NY, USA, 727-729. https: //doi.org/10.1145/1142473.1142568
    [9]
    Data Miner 2019. Data Miner: Extract data from any website with 1 click. https://data-miner.io/. Accessed: 2019-11-20.
    [10]
    M. Du and F. Li. 2016. Spell: Streaming Parsing of System Event Logs. In 2016 IEEE 16th International Conference on Data Mining (ICDM). 859-864. https://doi.org/10.1109/ICDM. 2016.0103
    [11]
    ELK 2019. ELK. https://www.elastic.co/what-is/elk-stack. Accessed: 2019-11-20.
    [12]
    Kathleen Fisher and Robert Gruber. 2005. PADS: A Domain-specific Language for Processing Ad Hoc Data. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (Chicago, IL, USA) ( PLDI '05). ACM, New York, NY, USA, 295-304. https://doi.org/10.1145/1065010.1065046
    [13]
    Kathleen Fisher and David Walker. 2011. The PADS Project: An Overview. In Proceedings of the 14th International Conference on Database Theory (Uppsala, Sweden) ( ICDT '11). ACM, New York, NY, USA, 11-17. https://doi.org/10.1145/1938551.1938556
    [14]
    Kathleen Fisher, David Walker, Kenny Qili Zhu, and Peter White. 2008. From dirt to shovels: fully automatic tool generation from ad hoc data. In POPL, George C. Necula and Philip Wadler (Eds.). ACM, 421-434. http://dblp.uni-trier.de/db/conf/ popl/popl2008.html#FisherWZW08
    [15]
    Yihan Gao, Silu Huang, and Aditya G. Parameswaran. 2018. Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets. In SIGMOD Conference, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 943-958. http://dblp.uni-trier.de/db/conf/sigmod/sigmod2018.html#GaoHP18
    [16]
    Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale Information Extraction with Vertex. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE '11). IEEE Computer Society, Washington, DC, USA, 1209-1220. https://doi.org/10.1109/ICDE. 2011.5767842
    [17]
    Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jefrey Heer. 2011. Proactive Wrangling: Mixed-initiative End-user Programming of Data Transformation Scripts. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (Santa Barbara, California, USA) ( UIST '11). ACM, New York, NY, USA, 65-74. https://doi.org/ 10.1145/2047196.2047205
    [18]
    Hossein Hamooni, Biplob Debnath, Jianwu Xu, Hui Zhang, Guofei Jiang, and Abdullah Mueen. 2016. LogMine: Fast Pattern Recognition for Log Analytics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, Indiana, USA) ( CIKM '16). ACM, New York, NY, USA, 1573-1582. https: //doi.org/10.1145/2983323.2983358
    [19]
    Json Normalize 2019. pandas.io.json.json_normalize. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io. json.json_normalize.html/. Accessed: 2019-11-20.
    [20]
    Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jefrey Heer. 2011a. Wrangler: Interactive Visual Specification of Data Transformation Scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver,

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Programming Languages
    Proceedings of the ACM on Programming Languages  Volume 4, Issue OOPSLA
    November 2020
    3108 pages
    EISSN:2475-1421
    DOI:10.1145/3436718
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 November 2020
    Published in PACMPL Volume 4, Issue OOPSLA

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data extraction
    2. format diversity
    3. program synthesis

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 1,170
      Total Downloads
    • Downloads (Last 12 months)633
    • Downloads (Last 6 weeks)84
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media