Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

FlashRelate: extracting relational data from semi-structured spreadsheets using examples

Published: 03 June 2015 Publication History

Abstract

With hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a poor man's database, leading to creative solutions for storing high-dimensional data. The trouble arises when users need to answer queries with their data. Data manipulation tools make strong assumptions about data layouts and cannot read these ad-hoc databases. Converting data into the appropriate layout requires programming skills or a major investment in manual reformatting. The effect is that a vast amount of real-world data is "locked-in" to a proliferation of one-off formats. We introduce FlashRelate, a synthesis engine that lets ordinary users extract structured relational data from spreadsheets without programming. Instead, users extract data by supplying examples of output relational tuples. FlashRelate uses these examples to synthesize a program in Flare. Flare is a novel extraction language that extends regular expressions with geometric constructs. An interactive user interface on top of FlashRelate lets end users extract data by point-and-click. We demonstrate that correct Flare programs can be synthesized in seconds from a small set of examples for 43 real-world scenarios. Finally, our case study demonstrates FlashRelate's usefulness addressing the widespread problem of data trapped in corporate and government formats.

References

[1]
D. Angluin. Learning regular sets from queries and counterexamples. Inf. Comput., 75(2):87–106, 1987.
[2]
M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. CACM, 54(2):72–79, 2011.
[3]
C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW, 2001.
[4]
Z. Chen and M. Cafarella. Automatic web spreadsheet dat extraction. In SSW’13, 2013.
[5]
Z. Chen, M. Cafarella, J. Chen, D. Prevo, and J. Zhuang. Senbazuru: a prototype spreadsheet database management system. PVLDB, 6(12):1202–1205, 2013.
[6]
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001.
[7]
J. Cunha, J. Saraiva, and J. Visser. From spreadsheets to relational databases and back. In PEPM 2009, pp. 179–188. ACM, 2009.
[8]
E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. Web data extraction, applications and techniques: a survey. arXiv preprint arXiv:1207.0246, 2012.
[9]
K. Fisher and D. Walker. The PADS project: an overview. In ICDT, 2011.
[10]
M. I. Fisher and G. Rothermel. The EUSES Spreadsheet Corpus: A shared resource for supporting experimentation with spreadsheet dependability mechanisms. In 1st WEUSE, pp. 47–51, 2005.
[11]
S. Gulwani. Automating string processing in spreadsheets using inputoutput examples. In POPL, 2011.
[12]
S. Gulwani. Synthesis from examples: Interaction models and algorithms. In SYNASC, 2012.
[13]
W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011.
[14]
F. Hermans, M. Pinzger, and A. van Deursen. Automatically extracting class diagrams from spreadsheets. In ECOOP 2010 - Object-Oriented Programming, 24th European Conference, Maribor, Slovenia, June 21- 25, 2010. Proceedings, pp. 52–75, 2010.
[15]
C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst., 23(9), 1998.
[16]
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, 2011.
[17]
J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proceedings of the American Mathematical Society, 7(1):48–50, Feb. 1956.
[18]
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In IJCAI (1), 1997.
[19]
V. Le and S. Gulwani. FlashExtract: A Framework for Data Extraction by Examples. In PLDI, pp. 542–553, 2014.
[20]
H. Lieberman. Your Wish Is My Command: Programming by Example. Morgan Kaufmann, 2001.
[21]
E. Lu, R. Bodik, and B. Hartmann. Quicksilver: Automatic Synthesis of Relational Queries. Tech. Rep. UCB/EECS-2013-68, UC-Berkeley, May 2013.
[22]
I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Agents, 1999.
[23]
E. Oro and M. Ruffolo. Sila: a spatial instance learning approach for deep webpages. In CIKM 2011, pp. 2329–2332. ACM, 2011.
[24]
E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath towards spatial querying on web documents. PVLDB, 4(2):129–140, 2010.
[25]
ProPublica. Tabula: Extract tables from pdfs, 2014.
[26]
T. Register. Microsoft feeds excel to supercomputer, Nov. 2009.
[27]
R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing, Sept. 2013.
[28]
P. Wegner. A technique for counting ones in a binary computer. Commun. ACM, 3(5):322–, May 1960.
[29]
J. Weiss. How news organizations are using tabula for data journalism, Sept. 2013.
[30]
K. Q. Zhu, K. Fisher, and D. Walker. Learnpads++: Incremental inference of ad hoc data formats. In Proceedings of the 14th International Conference on Practical Aspects of Declarative Languages, PADL’12, pp. 168–182, Berlin, Heidelberg, 2012. Springer-Verlag. Introduction Flare Language FlashRelate Synthesis Algorithm Definitions Algorithm Step 1: Determine Cell Constraints Step 2: Determine Spatial Constraints Step 3: Find a Satisfying Set of Constraints Implementation Details Complexity Analysis Flare Run-Time FlashRelate Run-Time Evaluation Benchmark Spreadsheets and Tasks Experimental Setup Results Case Study Related Work Conclusion

Cited By

View all
  • (2024)Anomaly Detection in IoT: Recent Advances, AI and ML Perspectives and ApplicationsAnomaly Detection - Recent Advances, AI and ML Perspectives and Applications10.5772/intechopen.111944Online publication date: 17-Jan-2024
  • (2024)Contextualized Data-Wrangling Code Generation in Computational NotebooksProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695503(1282-1294)Online publication date: 27-Oct-2024
  • (2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 50, Issue 6
PLDI '15
June 2015
630 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2813885
  • Editor:
  • Andy Gill
Issue’s Table of Contents
  • cover image ACM Conferences
    PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation
    June 2015
    630 pages
    ISBN:9781450334686
    DOI:10.1145/2737924
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2015
Published in SIGPLAN Volume 50, Issue 6

Check for updates

Author Tags

  1. data extraction
  2. program synthesis
  3. relational data
  4. spreadsheets

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Anomaly Detection in IoT: Recent Advances, AI and ML Perspectives and ApplicationsAnomaly Detection - Recent Advances, AI and ML Perspectives and Applications10.5772/intechopen.111944Online publication date: 17-Jan-2024
  • (2024)Contextualized Data-Wrangling Code Generation in Computational NotebooksProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695503(1282-1294)Online publication date: 27-Oct-2024
  • (2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
  • (2024)Visual Cues for Data Analysis Features Amplify Challenges for Blind Spreadsheet UsersProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642753(1-16)Online publication date: 11-May-2024
  • (2023)Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using ExamplesProceedings of the VLDB Endowment10.14778/3611479.361153416:11(3391-3403)Online publication date: 24-Aug-2023
  • (2023)Cross Modal Data Discovery over Structured and Unstructured Data LakesProceedings of the VLDB Endowment10.14778/3611479.361153316:11(3377-3390)Online publication date: 24-Aug-2023
  • (2023)How Large Language Models Will Disrupt Data ManagementProceedings of the VLDB Endowment10.14778/3611479.361152716:11(3302-3309)Online publication date: 24-Aug-2023
  • (2023)Check Out the Big Brain on BRAD: Simplifying Cloud Data Processing with Learned Automated Data MeshesProceedings of the VLDB Endowment10.14778/3611479.361152616:11(3293-3301)Online publication date: 24-Aug-2023
  • (2023)Through the Fairness Lens: Experimental Analysis and Evaluation of Entity MatchingProceedings of the VLDB Endowment10.14778/3611479.361152516:11(3279-3292)Online publication date: 24-Aug-2023
  • (2023)FlashAlloc: Dedicating Flash Blocks by ObjectsProceedings of the VLDB Endowment10.14778/3611479.361152416:11(3266-3278)Online publication date: 24-Aug-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media