Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2737924.2737952acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

FlashRelate: extracting relational data from semi-structured spreadsheets using examples

Published: 03 June 2015 Publication History
  • Get Citation Alerts
  • Abstract

    With hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a poor man's database, leading to creative solutions for storing high-dimensional data. The trouble arises when users need to answer queries with their data. Data manipulation tools make strong assumptions about data layouts and cannot read these ad-hoc databases. Converting data into the appropriate layout requires programming skills or a major investment in manual reformatting. The effect is that a vast amount of real-world data is "locked-in" to a proliferation of one-off formats. We introduce FlashRelate, a synthesis engine that lets ordinary users extract structured relational data from spreadsheets without programming. Instead, users extract data by supplying examples of output relational tuples. FlashRelate uses these examples to synthesize a program in Flare. Flare is a novel extraction language that extends regular expressions with geometric constructs. An interactive user interface on top of FlashRelate lets end users extract data by point-and-click. We demonstrate that correct Flare programs can be synthesized in seconds from a small set of examples for 43 real-world scenarios. Finally, our case study demonstrates FlashRelate's usefulness addressing the widespread problem of data trapped in corporate and government formats.

    References

    [1]
    D. Angluin. Learning regular sets from queries and counterexamples. Inf. Comput., 75(2):87–106, 1987.
    [2]
    M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. CACM, 54(2):72–79, 2011.
    [3]
    C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW, 2001.
    [4]
    Z. Chen and M. Cafarella. Automatic web spreadsheet dat extraction. In SSW’13, 2013.
    [5]
    Z. Chen, M. Cafarella, J. Chen, D. Prevo, and J. Zhuang. Senbazuru: a prototype spreadsheet database management system. PVLDB, 6(12):1202–1205, 2013.
    [6]
    V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001.
    [7]
    J. Cunha, J. Saraiva, and J. Visser. From spreadsheets to relational databases and back. In PEPM 2009, pp. 179–188. ACM, 2009.
    [8]
    E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: a survey. arXiv preprint arXiv:1207.0246, 2012.
    [9]
    K. Fisher and D. Walker. The PADS project: an overview. In ICDT, 2011.
    [10]
    M. I. Fisher and G. Rothermel. The EUSES Spreadsheet Corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms. In 1st WEUSE, pp. 47–51, 2005.
    [11]
    S. Gulwani. Automating string processing in spreadsheets using inputoutput examples. In POPL, 2011.
    [12]
    S. Gulwani. Synthesis from examples: Interaction models and algorithms. In SYNASC, 2012.
    [13]
    W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011.
    [14]
    F. Hermans, M. Pinzger, and A. van Deursen. Automatically extracting class diagrams from spreadsheets. In ECOOP 2010 - Object-Oriented Programming, 24th European Conference, Maribor, Slovenia, June 21- 25, 2010. Proceedings, pp. 52–75, 2010.
    [15]
    C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst., 23(9), 1998.
    [16]
    S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, 2011.
    [17]
    J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proceedings of the American Mathematical Society, 7(1):48–50, Feb. 1956.
    [18]
    N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In IJCAI (1), 1997.
    [19]
    V. Le and S. Gulwani. FlashExtract: A Framework for Data Extraction by Examples. In PLDI, pp. 542–553, 2014.
    [20]
    H. Lieberman. Your Wish Is My Command: Programming by Example. Morgan Kaufmann, 2001.
    [21]
    E. Lu, R. Bodik, and B. Hartmann. Quicksilver: Automatic Synthesis of Relational Queries. Tech. Rep. UCB/EECS-2013-68, UC-Berkeley, May 2013.
    [22]
    I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Agents, 1999.
    [23]
    E. Oro and M. Ruffolo. Sila: a spatial instance learning approach for deep webpages. In CIKM 2011, pp. 2329–2332. ACM, 2011.
    [24]
    E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath towards spatial querying on web documents. PVLDB, 4(2):129–140, 2010.
    [25]
    ProPublica. Tabula: Extract tables from pdfs, 2014.
    [26]
    T. Register. Microsoft feeds excel to supercomputer, Nov. 2009.
    [27]
    R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing, Sept. 2013.
    [28]
    P. Wegner. A technique for counting ones in a binary computer. Commun. ACM, 3(5):322–, May 1960.
    [29]
    J. Weiss. How news organizations are using tabula for data journalism, Sept. 2013.
    [30]
    K. Q. Zhu, K. Fisher, and D. Walker. Learnpads++: Incremental inference of ad hoc data formats. In Proceedings of the 14th International Conference on Practical Aspects of Declarative Languages, PADL’12, pp. 168–182, Berlin, Heidelberg, 2012. Springer-Verlag. Introduction Flare Language FlashRelate Synthesis Algorithm Definitions Algorithm Step 1: Determine Cell Constraints Step 2: Determine Spatial Constraints Step 3: Find a Satisfying Set of Constraints Implementation Details Complexity Analysis Flare Run-Time FlashRelate Run-Time Evaluation Benchmark Spreadsheets and Tasks Experimental Setup Results Case Study Related Work Conclusion

    Cited By

    View all
    • (2024)Interactive Table Synthesis With Natural LanguageIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332912030:9(6130-6145)Online publication date: Sep-2024
    • (2023)Saggitarius: A DSL for Specifying Grammatical DomainsProceedings of the ACM on Programming Languages10.1145/36228697:OOPSLA2(2023-2051)Online publication date: 16-Oct-2023
    • (2023)Trace-Guided Inductive Synthesis of Recursive Functional ProgramsProceedings of the ACM on Programming Languages10.1145/35912557:PLDI(860-883)Online publication date: 6-Jun-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
    June 2015
    630 pages
    ISBN:9781450334686
    DOI:10.1145/2737924
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 50, Issue 6
      PLDI '15
      June 2015
      630 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2813885
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 June 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data extraction
    2. program synthesis
    3. relational data
    4. spreadsheets

    Qualifiers

    • Research-article

    Conference

    PLDI '15
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 406 of 2,067 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)48
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Interactive Table Synthesis With Natural LanguageIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332912030:9(6130-6145)Online publication date: Sep-2024
    • (2023)Saggitarius: A DSL for Specifying Grammatical DomainsProceedings of the ACM on Programming Languages10.1145/36228697:OOPSLA2(2023-2051)Online publication date: 16-Oct-2023
    • (2023)Trace-Guided Inductive Synthesis of Recursive Functional ProgramsProceedings of the ACM on Programming Languages10.1145/35912557:PLDI(860-883)Online publication date: 6-Jun-2023
    • (2023)TableProcessor: The Tool for the Analysis and the Interpretation of Web Tables to Create the Geo Knowledge Base of KazakhstanArtificial Intelligence in Models, Methods and Applications10.1007/978-3-031-22938-1_15(219-229)Online publication date: 25-Apr-2023
    • (2022)Spine: Scaling up Programming-by-Negative-Example for String Filtering and TransformationProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517908(521-530)Online publication date: 10-Jun-2022
    • (2022)Rigel: Transforming Tabular Data by Declarative MappingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.3209385(1-11)Online publication date: 2022
    • (2022)HiTailor: Interactive Transformation and Visualization for Hierarchical Tabular DataIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.3209354(1-10)Online publication date: 2022
    • (2022)Efficiently Transforming Tables for Joinability2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00169(1649-1661)Online publication date: May-2022
    • (2022)Robotic Process MiningProcess Mining Handbook10.1007/978-3-031-08848-3_16(468-491)Online publication date: 27-Jun-2022
    • (2022)Table understanding: Problem overviewWIREs Data Mining and Knowledge Discovery10.1002/widm.148213:1Online publication date: 21-Nov-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media