research-article

Open access

Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations

Authors:

Dongmei Zhang, and

Surajit ChaudhuriAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 3

Article No.: 122, Pages 1 - 27

https://doi.org/10.1145/3654925

Published: 30 May 2024 Publication History

Abstract

Spreadsheets are widely recognized as the most popular end-user programming tools, which blend the power of formula-based computation, with an intuitive table-based interface. Today, spreadsheets are used by billions of users to manipulate tables, most of whom are neither database experts nor professional programmers.

Despite the success of spreadsheets, authoring complex formulas remains challenging, as non-technical users need to look up and understand non-trivial formula syntax. To address this pain point, we leverage the observation that there is often an abundance of similar-looking spreadsheets in the same organization, which not only have similar data, but also share similar computation logic encoded as formulas. We develop an Auto-Formula system that can accurately predict formulas that users want to author in a target spreadsheet cell, by learning and adapting formulas that already exist in similar spreadsheets, using contrastive-learning techniques inspired by "similar-face recognition" from compute vision. Extensive evaluations on over 2K test formulas extracted from real enterprise spreadsheets show the effectiveness of Auto-Formula over alternatives. Our benchmark data is available at https://github.com/microsoft/Auto-Formula to facilitate future research.

References

[1]

[n. d.]. Auto-Formula: benchmark spreadsheet data. https://github.com/microsoft/Auto-Formula, https://1drv.ms/f/s! AkvY8ho1gepOiptfygjBTFLp_V3rtg?e=Ls1ses.

[2]

[n. d.]. Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations (extended version). https://arxiv.org/abs/2404.12608.

[3]

[n. d.]. Excel formula. https://support.microsoft.com/en-au/office/overview-of-formulas-in-excel-ecfdc708--9162--49e8-b993-c311f47ca173.

[4]

[n. d.]. Excel Forum: 20K questions tagged as "formulas and functions" (Retrieved 2023-09). https://techcommunity. microsoft.com/t5/forums/filteredbylabelpage/board-id/ExcelGeneral/label-name/formulas%20and%20functions/.

[5]

[n. d.]. Formula suggestion experience:. https://1drv.ms/i/s!AkvY8ho1gepOipteE2g_8Mjj5TFQlg?e=f6C2x9.

[6]

[n. d.]. Github Copilot. https://github.com/features/copilot.

[7]

[n. d.]. Google blog: New intelligent suggestions for formulas and functions in Google Sheets (Retrieved 2023-09). https: //workspaceupdates.googleblog.com/2021/08/intelligent-formula-and-function-suggestions-in-google-sheets.html.

[8]

[n. d.]. Google Sheets formula. https://support.google.com/docs/table/25273.

[9]

[n. d.]. IntelliSense: Auto-Complete in Visual Studio. https://code.visualstudio.com/docs/editor/intellisense.

[10]

[n. d.]. List of Excel functions. https://support.microsoft.com/en-us/office/excel-functions-alphabetical-b3944572-255d-4efb-bb96-c6d90033e188.

[11]

[n. d.]. List of Google Sheets functions. https://support.google.com/docs/table/25273?hl=en.

[12]

[n. d.]. Mondrian on GitHub (Retrieved 2023-09). https://github.com/HPI-Information-Systems/Mondrian.

[13]

[n. d.]. Spreadsheet workbook. https://support.microsoft.com/en-us/office/insert-or-delete-a-worksheet-19d3d21ea3b3-4e13-a422-d1f43f1faaf2.

[14]

Robin Abraham, Margaret M Burnett, and Martin Erwig. 2008. Spreadsheet Programming.

[15]

Javad Akbarnejad, Gloria Chatzopoulou, Magdalini Eirinaki, Suju Koshy, Sarika Mittal, Duc On, Neoklis Polyzotis, and Jothi S Vindhiya Varman. 2010. SQL QueRIE recommendations. Proceedings of the VLDB Endowment 3, 1--2 (2010), 1597--1600.

Digital Library

[16]

Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. 2019. AutoPandas: neural-backed generators for program synthesis. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1--27.

Digital Library

[17]

Jonathan Berant and Percy Liang. 2014. Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1415--1425.

[18]

Laure Berti-Equille. 2019. Learn2clean: Optimizing the sequence of tasks for web data preparation. In The world wide web conference. 2580--2586.

[19]

Polly S Brown and John D Gould. 1987. An experimental study of people creating spreadsheets. ACM Transactions on Information Systems (TOIS) 5, 3 (1987), 258--272.

Digital Library

[20]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

[21]

Chris Chambers and Chris Scaffidi. 2010. Struggling to excel: A field study of challenges faced by spreadsheet users. In 2010 IEEE Symposium on Visual Languages and Human-Centric Computing. IEEE, 187--194.

Digital Library

[22]

Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, Chengliang Chai, Guoliang Li, and Xiaoyong Du. 2023. Haipipe: Combining human-generated and machine-generated pipelines for data preparation. Proceedings of the ACM on Management of Data 1, 1 (2023), 1--26.

Digital Library

[23]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.

[24]

Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, and Denny Zhou. 2021. Spreadsheetcoder: Formula prediction from semi-structured context. In International Conference on Machine Learning. PMLR, 1661--1672.

[25]

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 702--703.

[26]

Martin Erwig. 2009. Software engineering for spreadsheets. IEEE software 26, 5 (2009), 25.

Digital Library

[27]

Ju Fan, Guoliang Li, and Lizhu Zhou. 2011. Interactive SQL query suggestion: Making databases user-friendly. In 2011 IEEE 27th International Conference on Data Engineering. IEEE, 351--362.

Digital Library

[28]

Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices 46, 1 (2011), 317--330.

Digital Library

[29]

Sumit Gulwani. 2016. Programming by examples. Dependable Software Systems Engineering 45, 137 (2016), 3--15.

[30]

Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek Narasayya, and Surajit Chaudhuri. 2018. Transform-Data-by-Example (TDE) an extensible search engine for data transformations. Proceedings of the VLDB Endowment 11, 10 (2018), 1165--1177.

Digital Library

[31]

Yeye He, Kris Ganjam, Kukjin Lee, Yue Wang, Vivek Narasayya, Surajit Chaudhuri, Xu Chu, and Yudian Zheng. 2018. Transform-data-by-example (tde) extensible data transformation in excel. In Proceedings of the 2018 International Conference on Management of Data. 1785--1788.

Digital Library

[32]

Felienne Hermans, Bas Jansen, Sohon Roy, Efthimia Aivaloglou, Alaaeddin Swidan, and David Hoepelman. 2016. Spreadsheets are code: An overview of software engineering approaches applied to spreadsheets. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 5. IEEE, 56--65.

[33]

Felienne Hermans and Emerson Murphy-Hill. 2015. Enron's spreadsheets and related emails: A dataset and analysis. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 7--16.

[34]

Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012. Measuring spreadsheet formula understandability. arXiv preprint arXiv:1209.3517 (2012).

[35]

Zhongjun Jin, Michael R Anderson, Michael Cafarella, and HV Jagadish. 2017. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data. 683--698.

Digital Library

[36]

Zhongjun Jin, Yeye He, and Surajit Chauduri. 2020. Auto-Transform: learning-to-transform by patterns. Proceedings of the VLDB Endowment 13, 12 (2020), 2368--2381.

Digital Library

[37]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535--547.

[38]

Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. 2010. SnipSuggest: Context-aware autocompletion for SQL. Proceedings of the VLDB Endowment 4, 1 (2010), 22--33.

Digital Library

[39]

Bryan Klimt and Yiming Yang. 2004. Introducing the Enron corpus. In CEAS, Vol. 45. 92--96.

[40]

Eugenie Yujing Lai, Zainab Zolaktaf, Mostafa Milani, Omar AlOmeir, Jianhao Cao, and Rachel Pottinger. 2023. Workload-Aware Query Recommendation Using Deep Learning. In EDBT. 53--65.

[41]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459--9474.

[42]

Peng Li, Yeye He, Cong Yan, Yue Wang, and Surajit Chauduri. 2023. Auto-Tables: Synthesizing multi-step transformations to relationalize tables without using examples. arXiv preprint arXiv:2307.14565 (2023).

[43]

Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, and Surajit Chaudhuri. 2023. Table-GPT: Table-tuned gpt for diverse table tasks. arXiv preprint arXiv:2310.09263 (2023).

[44]

Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. 2019. Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering 32, 8 (2019), 1475--1488.

[45]

Xingjun Li, Yizhi Zhang, Justin Leung, Chengnian Sun, and Jian Zhao. 2023. Edassistant: Supporting exploratory data analysis in computational notebooks with in situ code search and recommendation. ACM Transactions on Interactive Intelligent Systems 13, 1 (2023), 1--27.

Digital Library

[46]

Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou. 2022. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Networks Learn. Syst. 33, 12 (2022), 6999--7019. https: //doi.org/10.1109/TNNLS.2021.3084827

[47]

Ting Liu, Andrew Moore, Ke Yang, and Alexander Gray. 2004. An investigation of practical approximate nearest neighbor algorithms. Advances in neural information processing systems 17 (2004).

[48]

Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. 2017. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 212--220.

[49]

Weizheng Lu, Jiaming Zhang, Jing Zhang, and Yueguo Chen. 2024. Large Language Model for Table Processing: A Survey. arXiv preprint arXiv:2402.05121 (2024).

[50]

Fatemeh Nargesian, Erkang Zhu, Ken Q Pu, and Renée J Miller. 2018. Table union search on open data. Proceedings of the VLDB Endowment 11, 7 (2018), 813--825.

Digital Library

[51]

Raymond R Panko. 1998. What we know about spreadsheet errors. Journal of Organizational and End User Computing (JOEUC) 10, 2 (1998), 15--21.

Digital Library

[52]

Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305 (2015).

[53]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[54]

Stephen G Powell, Kenneth R Baker, and Barry Lawson. 2009. Errors in operational spreadsheets. Journal of Organizational and End User Computing (JOEUC) 21, 3 (2009), 24--36.

[55]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[56]

Kamalasen Rajalingham, David R Chadwick, and Brian Knight. 2008. Classification of spreadsheet errors. arXiv preprint arXiv:0805.4224 (2008).

[57]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:201646309

[58]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.

[59]

Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of big data 6, 1 (2019), 1--48.

[60]

Gerardo Vitagliano, Lucas Reisener, Lan Jiang, Mazhar Hameed, and Felix Naumann. 2022. Mondrian: Spreadsheet Layout Detection. In Proceedings of the 2022 International Conference on Management of Data. 2361--2364.

Digital Library

[61]

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. 2018. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5265--5274.

[62]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).

[63]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824--24837.

[64]

Cong Yan and Yeye He. 2020. Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1539--1554.

Digital Library

[65]

Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. arXiv preprint arXiv:2106.13861 (2021).

[66]

Xiaokang Zhang, Jing Zhang, Zeyao Ma, Yang Li, Bohan Zhang, Guanlin Li, Zijun Yao, Kangli Xu, Jinchang Zhou, Daniel Zhang-Li, et al. 2024. TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios. arXiv preprint arXiv:2403.19318 (2024).

[67]

Wei Zhao, Zhitao Hou, Siyuan Wu, Yan Gao, Haoyu Dong, Yao Wan, Hongyu Zhang, Yulei Sui, and Haidong Zhang. 2024. NL2Formula: Generating Spreadsheet Formulas from Natural Language Queries. arXiv preprint arXiv:2402.14853 (2024).

[68]

Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-join: Joining tables by leveraging transformations. Proceedings of the VLDB Endowment 10, 10 (2017), 1034--1045.

Digital Library

Index Terms

Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations
1. Information systems
  1. Data management systems

Recommendations

Certain classes of polynomial expansions and multiplication formulas

The authors first present a class of expansions in a series of Bernoulli polyomials and then show how this general result can be applied to yield various (known or new) polynomial expansions. The corresponding expansion problem involving the Euler ...
Read More
Positive interpolatory quadrature formulas and para-orthogonal polynomials
Special issue: Proceedings of the conference on orthogonal functions and related topics held in honor of Olav Njåstad

We establish a relation between quadrature formulas on the interval [-1,1] that approximate integrals of the form J"@m(F)=@!"-"1^1F(x)@m(x)dx and Szego quadrature formulas on the unit circle that approximate integrals of the form I"@w(f)=@!"-"@p^@pf(e^i^...
Read More
Quadrature formulae of Gauss type based on Euler identities

The aim of this paper is to derive quadrature formulae of Gauss type based on Euler identities. First, we derive quadrature formulae where the integral over [0,1] is approximated by values of the function in three points: x,1/2 and 1-x. As special cases,...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 3

SIGMOD

June 2024

1953 pages

EISSN:2836-6573

DOI:10.1145/3670010

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2024

Published in PACMMOD Volume 2, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
78
Total Downloads

Downloads (Last 12 months)78
Downloads (Last 6 weeks)76

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents