research-article

Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks

Authors:

Yeye HeAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 1539 - 1554

https://doi.org/10.1145/3318464.3389738

Published: 31 May 2020 Publication History

Abstract

Data preparation is widely recognized as the most time-consuming process in modern business intelligence (BI) and machine learning (ML) projects. Automating complex data preparation steps (e.g., Pivot, Unpivot, Normalize-JSON, etc.)holds the potential to greatly improve user productivity, and has therefore become a central focus of research. We propose a novel approach to "auto-suggest" contextualized data preparation steps, by "learning" from how data scientists would manipulate data, which are documented by data science notebooks widely available today. Specifically, we crawled over 4M Jupyter notebooks on GitHub, and replayed them step-by-step, to observe not only full input/output tables (data-frames) at each step, but also the exact data-preparation choices data scientists make that they believe are best suited to the input data (e.g., how input tables are Joined/Pivoted/Unpivoted, etc.). By essentially "logging" how data scientists interact with diverse tables, and using the resulting logs as a proxy of "ground truth", we can learn-to-recommend data preparation steps best suited to given user data, just like how search engines (Google or Bing) leverage their click-through logs to learn-to-rank documents. This data-driven and log-driven approach leverages the "collective wisdom" of data scientists embodied in the notebooks, and is shown to significantly outperform strong baselines including commercial systems in terms of accuracy.

Supplementary Material

MP4 File (3318464.3389738.mp4)

Presentation Video

Download
102.52 MB

References

[1]

Add-Column-From-Examples (in Power BI). http://powerbi.microsoft.com/en-us/blog/power-bi-desktop-june-feature-summary/#addColumn.

[2]

Dresden Web Table Corpus. https://help.tableau.com/current/pro/desktop/en-us/datafields_typesandroles.htm.

[3]

Excel Forum: Questions tagged with Pivot. https://techcommunity.microsoft.com/t5/tag/pivot table/tg-p/board-id/ExcelGeneral.

[4]

FlashFill in Excel. https://www.microsoft.com/en-us/microsoft-365/blog/2012/08/09/flash-fill/.

[5]

Github api. https://developer.github.com/v3/.

[6]

Informatica Enterprise Data Prep. https://www.informatica.com/products/data-catalog/enterprise-data-prep.html.

[7]

Informatica Enterprise Data Prep. https://github.com/enterprise.

[8]

Jupyter notebooks. https://jupyter.org/.

[9]

Kaggle. https://www.kaggle.com/.

[10]

Kaggle dataset api. https://www.kaggle.com/docs/apiinteracting-with-datasets.

[11]

Keras. https://keras.io/.

[12]

Market guide for data preparation - gartner, 2017.

[13]

Pandas data frame list of apis. https://pandas.pydata.org/pandas-docs/stable/reference/frame.html.

[14]

Pandas groupby operator api. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html.

[15]

Pandas json normalize operator api. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html.

[16]

Pandas melt operator api. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html.

[17]

Pandas merge operator api. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html.

[18]

Pandas pivot operator api. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html.

[19]

Paxata data preparation. https://www.paxata.com/.

[20]

Power bi. https://docs.microsoft.com/en-us/power-bi/desktop-data-types.

[21]

Power BI Forum: How To Create Pivot Tables. https://community.powerbi.com/t5/Desktop/Create-Pivot-table/td-p/627586.

[22]

python tracing library. https://pymotw.com/2/trace/.

[23]

Salesforce Einstein Data Prep Recipe: Smart Suggestions. https://help.salesforce.com/articleView?id=bi_integrate_recipe_column_profile_suggestions.htm&type=5.

[24]

StackOverflow: How To Create Pivot. https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe.

[25]

StackOverflow: Questions tagged with Pivot. https://stackoverflow.com/questions/tagged/pivot.

[26]

Tableau prep. https://www.tableau.com/products/prep.

[27]

Transform Data by Example (from Microsoft Office Store). https://store.office.com/en-us/app.aspx?assetid=WA104380727&ui=en-US&rs=en-US&ad=US&appredirect=false.

[28]

Trifacta. https://www.trifacta.com/.

[29]

Trifacta Forum: How to create Pivot. https://community.trifacta.com/s/question/0D51L0000745dOdSAI/how-to-use-pivot-table-please-provide-one-example.

[30]

Trifacta: Predictive Transformations. https://docs.trifacta.com/display/SS/Overview+of+Predictive+Transformation.

[31]

Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? VLDB, 9(12), 2016.

Digital Library

[32]

Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. Dataxformer: A robust transformation discovery system. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 1134--1145. IEEE, 2016.

[33]

A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. S. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts. Linear road: a stream data management benchmark. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 480--491, 2004.

Digital Library

[34]

Y. Asahiro, R. Hassin, and K. Iwama. Complexity of finding dense subgraphs. Discrete Applied Mathematics, 121(1--3):15--26, 2002.

[35]

D. W. Barowy, S. Gulwani, T. Hart, and B. Zorn. Flashrelate: extracting relational data from semi-structured spreadsheets using examples. ACM SIGPLAN Notices, 50(6):218--228, 2015.

Digital Library

[36]

Z. Chen, V. Narasayya, and S. Chaudhuri. Fast foreign-key detection in microsoft sql server powerpivot for excel. Proceedings of the VLDB Endowment, 7(13):1417--1428, 2014.

Digital Library

[37]

X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In SIGMOD, 2015.

Digital Library

[38]

R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160--167. ACM, 2008.

Digital Library

[39]

T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, 2002.

Digital Library

[40]

D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In Cidr, 2017.

[41]

H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. Proceedings of the VLDB Endowment, 2(1):1078--1089, 2009.

Digital Library

[42]

V. Ercegovac, D. J. DeWitt, and R. Ramakrishnan. The texture benchmark: measuring performance of text queries on a relational dbms. In Proceedings of the 31st international conference on Very large data bases, pages 313--324, 2005.

[43]

A. Folting, K. Tupaj, R. C. Collie, and A. V. Grabar. Automated placement of fields in a data summary table, us patent 7,480,675, filed by microsoft, 2009. US Patent 7,480,675.

[44]

F. Funke, A. Kemper, and T. Neumann. Benchmarking hybrid oltp&olap database systems. Datenbanksysteme für Business, Technologie und Web (BTW), 2011.

[45]

Y. Gao, S. Huang, and A. Parameswaran. Navigating the data lake with datamaran: Automatically extracting structure from log datasets. In Proceedings of the 2018 International Conference on Management of Data, pages 943--958, 2018.

Digital Library

[46]

J. Gennick. SQL Pocket Guide: A Guide to SQL Usage. " O'Reilly Media, Inc.", 2010.

[47]

O. Goldschmidt and D. S. Hochbaum. A polynomial algorithm for the k-cut problem for fixed k. Mathematics of operations research, 19(1):24--37, 1994.

[48]

I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.

Digital Library

[49]

S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Communications of the ACM, 55(8), 2012.

[50]

W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In ACM SIGPLAN Notices, volume 46. ACM, 2011.

Digital Library

[51]

Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya, and S. Chaudhuri. Transform-data-by-example (TDE): an extensible search engine for data transformations. VLDB, 11(10), 2018.

Digital Library

[52]

Y. He, K. Ganjam, and X. Chu. SEMA-JOIN: joining semantically-related tables using big table corpora. VLDB, 8(12), 2015.

Digital Library

[53]

J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In CIDR, 2015.

[54]

A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. Holodetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data, pages 829--846, 2019.

Digital Library

[55]

Z. Huang and Y. He. Auto-Detect: Data-Driven Error Detection in Tables. In SIGMOD, 2018.

Digital Library

[56]

L. Jiang and F. Naumann. Holistic primary key and foreign key detection. Journal of Intelligent Information Systems, pages 1--23, 2019.

Digital Library

[57]

Z. Jin, M. R. Anderson, M. Cafarella, and H. Jagadish. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 683--698. ACM, 2017.

Digital Library

[58]

N. Kabra and Y. Saillet. Data de-duplication, 2019. US Patent 10,387,389.

[59]

S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3363--3372. ACM, 2011.

Digital Library

[60]

N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-aware autocompletion for sql. Proceedings of the VLDB Endowment, 4(1):22--33, 2010.

Digital Library

[61]

V. Le and S. Gulwani. Flashextract: a framework for data extraction by examples. In ACM SIGPLAN Notices, volume 49. ACM, 2014.

Digital Library

[62]

S. H. Lee, S. J. Kim, and W. Kim. The bord benchmark for object-relational databases. In International Conference on Database and Expert Systems Applications, pages 6--20. Springer, 2000.

[63]

T.-Y. Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225--331, 2009.

Digital Library

[64]

E. Loper and S. Bird. Nltk: the natural language toolkit. arXiv preprint cs/0205028, 2002.

Digital Library

[65]

J. Mackinlay, P. Hanrahan, and C. Stolte. Show me: Automatic presentation for visual analysis. IEEE transactions on visualization and computer graphics, 13(6):1137--1144, 2007.

[66]

C. D. Manning, C. D. Manning, and H. Schütze. Foundations of statistical natural language processing. MIT press, 1999.

Digital Library

[67]

T. Mikolov, M. Karafiát, L. Burget, J. vC ernockỳ, and S. Khudanpur. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010.

[68]

C. Ordonez. Horizontal aggregations for building tabular data sets. In Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pages 35--42, 2004.

Digital Library

[69]

S. Ortona, G. Orsi, M. Buoncristiano, and T. Furche. Wadar: Joint wrapper and data repair. Proceedings of the VLDB Endowment, 8(12):1996--1999, 2015.

Digital Library

[70]

W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846--850, 1971.

[71]

A. Rostin, O. Albrecht, J. Bauckmann, F. Naumann, and U. Leser. A machine learning approach to foreign key discovery. In WebDB, 2009.

[72]

S. Schelter, D. Lange, P. Schmidt, M. Celikel, and F. Biessmann. Automating largescale data quality verification. In VLDB, 2018.

[73]

A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, and R. Busse. Xmark: A benchmark for xml data management. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases, pages 974--985. Elsevier, 2002.

[74]

H. Schütze, C. D. Manning, and P. Raghavan. Introduction to information retrieval. In Proceedings of the international communication of association for computing machinery conference, page 260, 2008.

[75]

A. O. Shigarov and A. A. Mikhailov. Rule-based spreadsheet data transformation from arbitrary to relational tables. Information Systems, 71:123--136, 2017.

Digital Library

[76]

R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. VLDB, 9(10), 2016.

Digital Library

[77]

R. Singh, B. Livshits, and B. Zorn. Melford: Using neural networks to find spreadsheet errors. MSR technical report.

[78]

M. Stoer and F. Wagner. A simple min-cut algorithm. Journal of the ACM (JACM), 44(4):585--591, 1997.

[79]

G. Verbruggen and L. De Raedt. Automatically wrangling spreadsheets into machine learning data formats. In International Symposium on Intelligent Data Analysis, pages 367--379. Springer, 2018.

[80]

P. Wang and Y. He. Uni-detect: A unified approach to automated error detection in tables. In Proceedings of the 2019 International Conference on Management of Data, pages 811--828, 2019.

Digital Library

[81]

Y. Wang and Y. He. Synthesizing mapping relationships using table corpus. In SIGMOD, 2017.

Digital Library

[82]

J. N. Yan, O. Schulte, M. Zhang, J. Wang, and R. Cheng. Scoded: Statistical constraint oriented data error detection, 2020.

[83]

M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, and D. Srivastava. On multi-column foreign key discovery. Proceedings of the VLDB Endowment, 3(1--2):805--814, 2010.

Digital Library

[84]

E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging transformations. VLDB, 10(10), 2017.

Digital Library

Cited By

Zhang YChen PIves Z(2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682005
Kayali MLykov AFountalis IVasiloglou NOlteanu DSuciu D(2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659461
Li PHe YYan CWang YChaudhuri S(2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3665252.3665269
Show More Cited By

Index Terms

Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning
      2. Extraction, transformation and loading
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Data integration

Recommendations

Data Preparation: A Survey of Commercial Tools

Raw data are often messy: they follow different encodings, records are not well structured, values do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data ...
Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows,...
Data Preparation: A Technological Perspective and Review
Abstract
Data analysis often uses data sets that were collected for different purposes. Indeed, new insights are often obtained by combining data sets that were produced independently of each other, for example by combining data from outside an ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
1,264
Total Downloads

Downloads (Last 12 months)250
Downloads (Last 6 weeks)26

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YChen PIves Z(2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682005
Kayali MLykov AFountalis IVasiloglou NOlteanu DSuciu D(2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659461
Li PHe YYan CWang YChaudhuri S(2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3665252.3665269
Chen SHe YCui WFan JGe SZhang HZhang DChaudhuri S(2024)Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table RepresentationsProceedings of the ACM on Management of Data10.1145/36549252:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654925
Wang CThompson JLee B(2024)Data Formulator: AI-Powered Concept-Driven Visualization AuthoringIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332658530:1(1128-1138)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TVCG.2023.3326585
Tripura SLanka S(2024)Architecture of Data Recognition System Using Machine Learning and TensorFlow2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT)10.1109/IDCIoT59759.2024.10467371(1694-1699)Online publication date: 4-Jan-2024
https://doi.org/10.1109/IDCIoT59759.2024.10467371
Lu WHe KQin XLi CWang ZYuan TLiao XZhang FChen YDu X(2024)Xorbits: Automating Operator Tiling for Distributed Data Science2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00392(5211-5223)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00392
Helali MMonjazeb NVashisth SCarrier PHelal ACavalcante AAmmar KHose KMansour E(2024)KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00021(179-192)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00021
Foidl HGolendukhina VRamler RFelderer M(2024)Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developersJournal of Systems and Software10.1016/j.jss.2023.111855207(111855)Online publication date: Jan-2024
https://doi.org/10.1016/j.jss.2023.111855
Pucci ESancricca CAndolina SCappiello CMatera MBarberio A(2024)Improving Understandability and Control in Data Preparation: A Human-Centered ApproachAdvanced Information Systems Engineering10.1007/978-3-031-61057-8_17(284-299)Online publication date: 3-Jun-2024
https://doi.org/10.1007/978-3-031-61057-8_17
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents