Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389738acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks

Published: 31 May 2020 Publication History

Abstract

Data preparation is widely recognized as the most time-consuming process in modern business intelligence (BI) and machine learning (ML) projects. Automating complex data preparation steps (e.g., Pivot, Unpivot, Normalize-JSON, etc.)holds the potential to greatly improve user productivity, and has therefore become a central focus of research. We propose a novel approach to "auto-suggest" contextualized data preparation steps, by "learning" from how data scientists would manipulate data, which are documented by data science notebooks widely available today. Specifically, we crawled over 4M Jupyter notebooks on GitHub, and replayed them step-by-step, to observe not only full input/output tables (data-frames) at each step, but also the exact data-preparation choices data scientists make that they believe are best suited to the input data (e.g., how input tables are Joined/Pivoted/Unpivoted, etc.). By essentially "logging" how data scientists interact with diverse tables, and using the resulting logs as a proxy of "ground truth", we can learn-to-recommend data preparation steps best suited to given user data, just like how search engines (Google or Bing) leverage their click-through logs to learn-to-rank documents. This data-driven and log-driven approach leverages the "collective wisdom" of data scientists embodied in the notebooks, and is shown to significantly outperform strong baselines including commercial systems in terms of accuracy.

Supplementary Material

MP4 File (3318464.3389738.mp4)
Presentation Video

References

[1]
Add-Column-From-Examples (in Power BI). http://powerbi.microsoft.com/en-us/blog/power-bi-desktop-june-feature-summary/#addColumn.
[2]
Dresden Web Table Corpus. https://help.tableau.com/current/pro/desktop/en-us/datafields_typesandroles.htm.
[3]
Excel Forum: Questions tagged with Pivot. https://techcommunity.microsoft.com/t5/tag/pivot table/tg-p/board-id/ExcelGeneral.
[4]
FlashFill in Excel. https://www.microsoft.com/en-us/microsoft-365/blog/2012/08/09/flash-fill/.
[5]
Github api. https://developer.github.com/v3/.
[6]
Informatica Enterprise Data Prep. https://www.informatica.com/products/data-catalog/enterprise-data-prep.html.
[7]
Informatica Enterprise Data Prep. https://github.com/enterprise.
[8]
Jupyter notebooks. https://jupyter.org/.
[9]
Kaggle. https://www.kaggle.com/.
[10]
Kaggle dataset api. https://www.kaggle.com/docs/apiinteracting-with-datasets.
[11]
Keras. https://keras.io/.
[12]
Market guide for data preparation - gartner, 2017.
[13]
Pandas data frame list of apis. https://pandas.pydata.org/pandas-docs/stable/reference/frame.html.
[14]
Pandas groupby operator api. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html.
[15]
Pandas json normalize operator api. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html.
[16]
Pandas melt operator api. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html.
[17]
Pandas merge operator api. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html.
[18]
Pandas pivot operator api. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html.
[19]
Paxata data preparation. https://www.paxata.com/.
[20]
Power bi. https://docs.microsoft.com/en-us/power-bi/desktop-data-types.
[21]
Power BI Forum: How To Create Pivot Tables. https://community.powerbi.com/t5/Desktop/Create-Pivot-table/td-p/627586.
[22]
python tracing library. https://pymotw.com/2/trace/.
[23]
Salesforce Einstein Data Prep Recipe: Smart Suggestions. https://help.salesforce.com/articleView?id=bi_integrate_recipe_column_profile_suggestions.htm&type=5.
[24]
StackOverflow: How To Create Pivot. https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe.
[25]
StackOverflow: Questions tagged with Pivot. https://stackoverflow.com/questions/tagged/pivot.
[26]
Tableau prep. https://www.tableau.com/products/prep.
[27]
Transform Data by Example (from Microsoft Office Store). https://store.office.com/en-us/app.aspx?assetid=WA104380727&ui=en-US&rs=en-US&ad=US&appredirect=false.
[28]
Trifacta. https://www.trifacta.com/.
[29]
Trifacta Forum: How to create Pivot. https://community.trifacta.com/s/question/0D51L0000745dOdSAI/how-to-use-pivot-table-please-provide-one-example.
[30]
Trifacta: Predictive Transformations. https://docs.trifacta.com/display/SS/Overview+of+Predictive+Transformation.
[31]
Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? VLDB, 9(12), 2016.
[32]
Z. Abedjan, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, and M. Stonebraker. Dataxformer: A robust transformation discovery system. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 1134--1145. IEEE, 2016.
[33]
A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. S. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts. Linear road: a stream data management benchmark. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 480--491, 2004.
[34]
Y. Asahiro, R. Hassin, and K. Iwama. Complexity of finding dense subgraphs. Discrete Applied Mathematics, 121(1--3):15--26, 2002.
[35]
D. W. Barowy, S. Gulwani, T. Hart, and B. Zorn. Flashrelate: extracting relational data from semi-structured spreadsheets using examples. ACM SIGPLAN Notices, 50(6):218--228, 2015.
[36]
Z. Chen, V. Narasayya, and S. Chaudhuri. Fast foreign-key detection in microsoft sql server powerpivot for excel. Proceedings of the VLDB Endowment, 7(13):1417--1428, 2014.
[37]
X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In SIGMOD, 2015.
[38]
R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160--167. ACM, 2008.
[39]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In SIGMOD, 2002.
[40]
D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In Cidr, 2017.
[41]
H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. Proceedings of the VLDB Endowment, 2(1):1078--1089, 2009.
[42]
V. Ercegovac, D. J. DeWitt, and R. Ramakrishnan. The texture benchmark: measuring performance of text queries on a relational dbms. In Proceedings of the 31st international conference on Very large data bases, pages 313--324, 2005.
[43]
A. Folting, K. Tupaj, R. C. Collie, and A. V. Grabar. Automated placement of fields in a data summary table, us patent 7,480,675, filed by microsoft, 2009. US Patent 7,480,675.
[44]
F. Funke, A. Kemper, and T. Neumann. Benchmarking hybrid oltp&olap database systems. Datenbanksysteme für Business, Technologie und Web (BTW), 2011.
[45]
Y. Gao, S. Huang, and A. Parameswaran. Navigating the data lake with datamaran: Automatically extracting structure from log datasets. In Proceedings of the 2018 International Conference on Management of Data, pages 943--958, 2018.
[46]
J. Gennick. SQL Pocket Guide: A Guide to SQL Usage. " O'Reilly Media, Inc.", 2010.
[47]
O. Goldschmidt and D. S. Hochbaum. A polynomial algorithm for the k-cut problem for fixed k. Mathematics of operations research, 19(1):24--37, 1994.
[48]
I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
[49]
S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Communications of the ACM, 55(8), 2012.
[50]
W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In ACM SIGPLAN Notices, volume 46. ACM, 2011.
[51]
Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya, and S. Chaudhuri. Transform-data-by-example (TDE): an extensible search engine for data transformations. VLDB, 11(10), 2018.
[52]
Y. He, K. Ganjam, and X. Chu. SEMA-JOIN: joining semantically-related tables using big table corpora. VLDB, 8(12), 2015.
[53]
J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In CIDR, 2015.
[54]
A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. Holodetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data, pages 829--846, 2019.
[55]
Z. Huang and Y. He. Auto-Detect: Data-Driven Error Detection in Tables. In SIGMOD, 2018.
[56]
L. Jiang and F. Naumann. Holistic primary key and foreign key detection. Journal of Intelligent Information Systems, pages 1--23, 2019.
[57]
Z. Jin, M. R. Anderson, M. Cafarella, and H. Jagadish. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 683--698. ACM, 2017.
[58]
N. Kabra and Y. Saillet. Data de-duplication, 2019. US Patent 10,387,389.
[59]
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3363--3372. ACM, 2011.
[60]
N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-aware autocompletion for sql. Proceedings of the VLDB Endowment, 4(1):22--33, 2010.
[61]
V. Le and S. Gulwani. Flashextract: a framework for data extraction by examples. In ACM SIGPLAN Notices, volume 49. ACM, 2014.
[62]
S. H. Lee, S. J. Kim, and W. Kim. The bord benchmark for object-relational databases. In International Conference on Database and Expert Systems Applications, pages 6--20. Springer, 2000.
[63]
T.-Y. Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225--331, 2009.
[64]
E. Loper and S. Bird. Nltk: the natural language toolkit. arXiv preprint cs/0205028, 2002.
[65]
J. Mackinlay, P. Hanrahan, and C. Stolte. Show me: Automatic presentation for visual analysis. IEEE transactions on visualization and computer graphics, 13(6):1137--1144, 2007.
[66]
C. D. Manning, C. D. Manning, and H. Schütze. Foundations of statistical natural language processing. MIT press, 1999.
[67]
T. Mikolov, M. Karafiát, L. Burget, J. vC ernockỳ, and S. Khudanpur. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010.
[68]
C. Ordonez. Horizontal aggregations for building tabular data sets. In Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pages 35--42, 2004.
[69]
S. Ortona, G. Orsi, M. Buoncristiano, and T. Furche. Wadar: Joint wrapper and data repair. Proceedings of the VLDB Endowment, 8(12):1996--1999, 2015.
[70]
W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846--850, 1971.
[71]
A. Rostin, O. Albrecht, J. Bauckmann, F. Naumann, and U. Leser. A machine learning approach to foreign key discovery. In WebDB, 2009.
[72]
S. Schelter, D. Lange, P. Schmidt, M. Celikel, and F. Biessmann. Automating largescale data quality verification. In VLDB, 2018.
[73]
A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, and R. Busse. Xmark: A benchmark for xml data management. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases, pages 974--985. Elsevier, 2002.
[74]
H. Schütze, C. D. Manning, and P. Raghavan. Introduction to information retrieval. In Proceedings of the international communication of association for computing machinery conference, page 260, 2008.
[75]
A. O. Shigarov and A. A. Mikhailov. Rule-based spreadsheet data transformation from arbitrary to relational tables. Information Systems, 71:123--136, 2017.
[76]
R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. VLDB, 9(10), 2016.
[77]
R. Singh, B. Livshits, and B. Zorn. Melford: Using neural networks to find spreadsheet errors. MSR technical report.
[78]
M. Stoer and F. Wagner. A simple min-cut algorithm. Journal of the ACM (JACM), 44(4):585--591, 1997.
[79]
G. Verbruggen and L. De Raedt. Automatically wrangling spreadsheets into machine learning data formats. In International Symposium on Intelligent Data Analysis, pages 367--379. Springer, 2018.
[80]
P. Wang and Y. He. Uni-detect: A unified approach to automated error detection in tables. In Proceedings of the 2019 International Conference on Management of Data, pages 811--828, 2019.
[81]
Y. Wang and Y. He. Synthesizing mapping relationships using table corpus. In SIGMOD, 2017.
[82]
J. N. Yan, O. Schulte, M. Zhang, J. Wang, and R. Cheng. Scoded: Statistical constraint oriented data error detection, 2020.
[83]
M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, and D. Srivastava. On multi-column foreign key discovery. Proceedings of the VLDB Endowment, 3(1--2):805--814, 2010.
[84]
E. Zhu, Y. He, and S. Chaudhuri. Auto-join: Joining tables by leveraging transformations. VLDB, 10(10), 2017.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data preparation
  2. data wrangling
  3. learning-to-recommend
  4. pivot and unpivot

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)250
  • Downloads (Last 6 weeks)26
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
  • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
  • (2024)Auto-Tables: Relationalize Tables without Using ExamplesACM SIGMOD Record10.1145/3665252.366526953:1(76-85)Online publication date: 14-May-2024
  • (2024)Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table RepresentationsProceedings of the ACM on Management of Data10.1145/36549252:3(1-27)Online publication date: 30-May-2024
  • (2024)Data Formulator: AI-Powered Concept-Driven Visualization AuthoringIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332658530:1(1128-1138)Online publication date: 1-Jan-2024
  • (2024)Architecture of Data Recognition System Using Machine Learning and TensorFlow2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT)10.1109/IDCIoT59759.2024.10467371(1694-1699)Online publication date: 4-Jan-2024
  • (2024)Xorbits: Automating Operator Tiling for Distributed Data Science2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00392(5211-5223)Online publication date: 13-May-2024
  • (2024)KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00021(179-192)Online publication date: 13-May-2024
  • (2024)Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developersJournal of Systems and Software10.1016/j.jss.2023.111855207(111855)Online publication date: Jan-2024
  • (2024)Improving Understandability and Control in Data Preparation: A Human-Centered ApproachAdvanced Information Systems Engineering10.1007/978-3-031-61057-8_17(284-299)Online publication date: 3-Jun-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media