Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1328438.1328488acmconferencesArticle/Chapter ViewAbstractPublication PagespoplConference Proceedingsconference-collections
research-article

From dirt to shovels: fully automatic tool generation from ad hoc data

Published: 07 January 2008 Publication History

Abstract

An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multi-phase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler, creating format-dependent modules that, when linked with format-independent algorithms for analysis and transformation, result infully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearlyin the size of the training data - completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5% of theavailable data.

References

[1]
Dana Angluin. Inference of reversible languages. Journal of the ACM, 29 (3):741--765, 1982.
[2]
Arvind Arasu and Hector Garcia-Molina. Extracting structured data from web pages. In SIGMOD, pages 337--348, 2003.
[3]
Geert Jan Bex, Frank Neven, Thomas Schwentick, and Karl Tuyls. Inference of concise DTDs from XML data. In VLDB, pages 115--126, 2006.
[4]
Geert Jan Bex, Frank Neven, and Stijn Vansummeren. Inferring XML schema definitions from XML data. In VLDB, pages 998--1009, 2007.
[5]
Vinayak Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmentation of text into structured records. In SIGMOD, pages 175--186, New York, NY, USA, 2001.
[6]
David Burke, Kathleen Fisher, David Walker, Peter White, and Kenny Q. Zhu. Towards 1-click tool generation with PADS. In CAGI, Corvallis, OR, June 2007.
[7]
Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, and Jennifer Widom. The TSIMMIS project: Integration of heterogeneous information sources. In 16th Meeting of the Information Processing Society of Japan, pages 7--18, Tokyo, Japan, 1994.
[8]
Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, San Francisco, CA, USA, 2001.
[9]
François Denis, Aurélien Lemay, and Alain Terlutte. Learning regular languages using RFSAs. Theoretical Computer Science, 313(2):267--294, 2004.
[10]
Mary F. Fernández, Kathleen Fisher, Robert Gruber, and Yitzhak Mandelbaum. PADX: Querying large-scale ad hoc data with XQuery. In PLAN-X, January 2006.
[11]
Henning Fernau. Learning XML grammars. In MLDM, pages 73--87, 2001.
[12]
Kathleen Fisher and Robert Gruber. PADS: A domain specific language for processing ad hoc data. In PLDI, pages 295--304, June 2005.
[13]
Kathleen Fisher, Yitzhak Mandelbaum, and David Walker. The next 700 data description languages. In POPL, January 2006.
[14]
Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, S. Seshadri, and Kyuseok Shim. XTRACT: A system for extracting document type descriptors from XML documents. In SIGMOD, pages 165--176, 2000.
[15]
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.
[16]
Peter D. Grünwald. The Minimum Description Length Principle. MIT Press, May 2007.
[17]
Theodore W. Hong. Grammatical Inference for Information Extraction and Visualisation on the Web. Ph.D. Thesis, Imperial College London, 2002.
[18]
Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
[19]
Jason L. Hutchens and Michael D. Alder. Finding structure via compression. In David M. W. Powers, editor, Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning, pages 79--82. 1998.
[20]
N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997. Department of Computer Science and Engineering.
[21]
Nicholas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997. Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004.
[22]
Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004.
[23]
J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1):145--151, 1991.
[24]
Yitzhak Mandelbaum, Kathleen Fisher, David Walker, Mary Fernandez, and Artem Gleyzer. PADS/ML: A functional data description language. In POPL, January 2007.
[25]
Wim Martens, Frank Neven, Thomas Schwentick, and Geert Jan Bex. Expressiveness and complexity of XML schema. ACM Transactions on Database Systems, 31(3):770--813, 2006.
[26]
Ion Muslea, Steve Minton, and Craig Knoblock. Active learning with strong and weak views: a case study on wrapper induction. In IJCAI, pages 415--420, 2003.
[27]
Hwee Tou Ng, Chung Yong Lim, and Jessica Li Teng Koo. Learning to recognize tables in free text. In ACL, pages 443--450, Morristown, NJ, USA, 1999.
[28]
J. Oncina and P. Garcia. Inferring regular languages in polynomial updated time. Machine Perception and Artificial Intelligence, 1:29--61, 1992.
[29]
PADS Project. PADS project. http://www.padsproj.org/, 2007.
[30]
David Pinto, Andrew McCallum, Xing Wei, and W. Bruce Croft. Table extraction using conditional random fields. In SIGIR, pages 235--242, New York, NY, USA, 2003.
[31]
Stefan Raeymaekers, Maurice Bruynooghe, and Jan Van den Bussche. Learning (k, l)-contextual tree languages for information extraction. In ECML, pages 305--316, 2005.
[32]
Vijayshankar Raman and Joseph M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381 -- 390, 2001.
[33]
Kurt A. Shoens, Allen Luniewski, Peter M. Schwarz, James W. Stamos, and II Joachim Thomas. The Rufus system: Information organization for semi--structured data. In VLDB, pages 97--107, San Francisco, CA, USA, 1993.
[34]
Stephen Soderland. Learning information extraction rules for semistructured and free text. Machine Learning, 34(1--3):233--272, 1999.
[35]
Andreas Stolcke and Stephen Omohundro. Inducing probabilistic grammars by bayesian model merging. In ICGI, pages 106--118, 1994.
[36]
Enrique Vidal. Grammatical inference: An introduction survey. In ICGI, pages 1--4, 1994.
[37]
R. M. Wharton. Approximate language identification. Information and Control, 26(3):236--255, 1974.

Cited By

View all
  • (2024)Diffy: Data-Driven Bug Finding for ConfigurationsProceedings of the ACM on Programming Languages10.1145/36563858:PLDI(199-222)Online publication date: 20-Jun-2024
  • (2024)Exploiting Data-pattern-aware Vertical Partitioning to Achieve Fast and Low-cost Cloud Log StorageACM Transactions on Storage10.1145/364364120:2(1-35)Online publication date: 19-Feb-2024
  • (2023)Saggitarius: A DSL for Specifying Grammatical DomainsProceedings of the ACM on Programming Languages10.1145/36228697:OOPSLA2(2023-2051)Online publication date: 16-Oct-2023
  • Show More Cited By

Index Terms

  1. From dirt to shovels: fully automatic tool generation from ad hoc data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    POPL '08: Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
    January 2008
    448 pages
    ISBN:9781595936899
    DOI:10.1145/1328438
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 43, Issue 1
      POPL '08
      January 2008
      420 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/1328897
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 January 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ad hoc data
    2. data description languages
    3. grammar induction
    4. tool generation

    Qualifiers

    • Research-article

    Conference

    POPL08

    Acceptance Rates

    Overall Acceptance Rate 824 of 4,130 submissions, 20%

    Upcoming Conference

    POPL '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Diffy: Data-Driven Bug Finding for ConfigurationsProceedings of the ACM on Programming Languages10.1145/36563858:PLDI(199-222)Online publication date: 20-Jun-2024
    • (2024)Exploiting Data-pattern-aware Vertical Partitioning to Achieve Fast and Low-cost Cloud Log StorageACM Transactions on Storage10.1145/364364120:2(1-35)Online publication date: 19-Feb-2024
    • (2023)Saggitarius: A DSL for Specifying Grammatical DomainsProceedings of the ACM on Programming Languages10.1145/36228697:OOPSLA2(2023-2051)Online publication date: 16-Oct-2023
    • (2020)ptype: probabilistic type inferenceData Mining and Knowledge Discovery10.1007/s10618-020-00680-1Online publication date: 16-Mar-2020
    • (2020)Finding Performance Patterns from Logs with High ConfidenceWeb Services – ICWS 202010.1007/978-3-030-59618-7_11(164-178)Online publication date: 19-Sep-2020
    • (2019)Synthesizing symmetric lensesProceedings of the ACM on Programming Languages10.1145/33416993:ICFP(1-28)Online publication date: 26-Jul-2019
    • (2019)A Method and Tool for Automated Induction of Relations from Quantitative Performance LogsCloud Computing – CLOUD 201910.1007/978-3-030-23502-4_2(11-25)Online publication date: 25-Jun-2019
    • (2018)Navigating the Data Lake with DATAMARANProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183746(943-958)Online publication date: 27-May-2018
    • (2018)SmartInspect: solidity smart contract inspector2018 International Workshop on Blockchain Oriented Software Engineering (IWBOSE)10.1109/IWBOSE.2018.8327566(9-18)Online publication date: 20-Mar-2018
    • (2016)PipeGenProceedings of the Seventh ACM Symposium on Cloud Computing10.1145/2987550.2987567(470-483)Online publication date: 5-Oct-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media