Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations

Published: 30 May 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Spreadsheets are widely recognized as the most popular end-user programming tools, which blend the power of formula-based computation, with an intuitive table-based interface. Today, spreadsheets are used by billions of users to manipulate tables, most of whom are neither database experts nor professional programmers.
    Despite the success of spreadsheets, authoring complex formulas remains challenging, as non-technical users need to look up and understand non-trivial formula syntax. To address this pain point, we leverage the observation that there is often an abundance of similar-looking spreadsheets in the same organization, which not only have similar data, but also share similar computation logic encoded as formulas. We develop an Auto-Formula system that can accurately predict formulas that users want to author in a target spreadsheet cell, by learning and adapting formulas that already exist in similar spreadsheets, using contrastive-learning techniques inspired by "similar-face recognition" from compute vision. Extensive evaluations on over 2K test formulas extracted from real enterprise spreadsheets show the effectiveness of Auto-Formula over alternatives. Our benchmark data is available at https://github.com/microsoft/Auto-Formula to facilitate future research.

    References

    [1]
    [n. d.]. Auto-Formula: benchmark spreadsheet data. https://github.com/microsoft/Auto-Formula, https://1drv.ms/f/s! AkvY8ho1gepOiptfygjBTFLp_V3rtg?e=Ls1ses.
    [2]
    [n. d.]. Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations (extended version). https://arxiv.org/abs/2404.12608.
    [3]
    [n. d.]. Excel formula. https://support.microsoft.com/en-au/office/overview-of-formulas-in-excel-ecfdc708--9162--49e8-b993-c311f47ca173.
    [4]
    [n. d.]. Excel Forum: 20K questions tagged as "formulas and functions" (Retrieved 2023-09). https://techcommunity. microsoft.com/t5/forums/filteredbylabelpage/board-id/ExcelGeneral/label-name/formulas%20and%20functions/.
    [5]
    [n. d.]. Formula suggestion experience:. https://1drv.ms/i/s!AkvY8ho1gepOipteE2g_8Mjj5TFQlg?e=f6C2x9.
    [6]
    [n. d.]. Github Copilot. https://github.com/features/copilot.
    [7]
    [n. d.]. Google blog: New intelligent suggestions for formulas and functions in Google Sheets (Retrieved 2023-09). https: //workspaceupdates.googleblog.com/2021/08/intelligent-formula-and-function-suggestions-in-google-sheets.html.
    [8]
    [n. d.]. Google Sheets formula. https://support.google.com/docs/table/25273.
    [9]
    [n. d.]. IntelliSense: Auto-Complete in Visual Studio. https://code.visualstudio.com/docs/editor/intellisense.
    [10]
    [n. d.]. List of Excel functions. https://support.microsoft.com/en-us/office/excel-functions-alphabetical-b3944572-255d-4efb-bb96-c6d90033e188.
    [11]
    [n. d.]. List of Google Sheets functions. https://support.google.com/docs/table/25273?hl=en.
    [12]
    [n. d.]. Mondrian on GitHub (Retrieved 2023-09). https://github.com/HPI-Information-Systems/Mondrian.
    [13]
    [n. d.]. Spreadsheet workbook. https://support.microsoft.com/en-us/office/insert-or-delete-a-worksheet-19d3d21ea3b3-4e13-a422-d1f43f1faaf2.
    [14]
    Robin Abraham, Margaret M Burnett, and Martin Erwig. 2008. Spreadsheet Programming.
    [15]
    Javad Akbarnejad, Gloria Chatzopoulou, Magdalini Eirinaki, Suju Koshy, Sarika Mittal, Duc On, Neoklis Polyzotis, and Jothi S Vindhiya Varman. 2010. SQL QueRIE recommendations. Proceedings of the VLDB Endowment 3, 1--2 (2010), 1597--1600.
    [16]
    Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. 2019. AutoPandas: neural-backed generators for program synthesis. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1--27.
    [17]
    Jonathan Berant and Percy Liang. 2014. Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1415--1425.
    [18]
    Laure Berti-Equille. 2019. Learn2clean: Optimizing the sequence of tasks for web data preparation. In The world wide web conference. 2580--2586.
    [19]
    Polly S Brown and John D Gould. 1987. An experimental study of people creating spreadsheets. ACM Transactions on Information Systems (TOIS) 5, 3 (1987), 258--272.
    [20]
    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
    [21]
    Chris Chambers and Chris Scaffidi. 2010. Struggling to excel: A field study of challenges faced by spreadsheet users. In 2010 IEEE Symposium on Visual Languages and Human-Centric Computing. IEEE, 187--194.
    [22]
    Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, Chengliang Chai, Guoliang Li, and Xiaoyong Du. 2023. Haipipe: Combining human-generated and machine-generated pipelines for data preparation. Proceedings of the ACM on Management of Data 1, 1 (2023), 1--26.
    [23]
    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.
    [24]
    Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, and Denny Zhou. 2021. Spreadsheetcoder: Formula prediction from semi-structured context. In International Conference on Machine Learning. PMLR, 1661--1672.
    [25]
    Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 702--703.
    [26]
    Martin Erwig. 2009. Software engineering for spreadsheets. IEEE software 26, 5 (2009), 25.
    [27]
    Ju Fan, Guoliang Li, and Lizhu Zhou. 2011. Interactive SQL query suggestion: Making databases user-friendly. In 2011 IEEE 27th International Conference on Data Engineering. IEEE, 351--362.
    [28]
    Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices 46, 1 (2011), 317--330.
    [29]
    Sumit Gulwani. 2016. Programming by examples. Dependable Software Systems Engineering 45, 137 (2016), 3--15.
    [30]
    Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek Narasayya, and Surajit Chaudhuri. 2018. Transform-Data-by-Example (TDE) an extensible search engine for data transformations. Proceedings of the VLDB Endowment 11, 10 (2018), 1165--1177.
    [31]
    Yeye He, Kris Ganjam, Kukjin Lee, Yue Wang, Vivek Narasayya, Surajit Chaudhuri, Xu Chu, and Yudian Zheng. 2018. Transform-data-by-example (tde) extensible data transformation in excel. In Proceedings of the 2018 International Conference on Management of Data. 1785--1788.
    [32]
    Felienne Hermans, Bas Jansen, Sohon Roy, Efthimia Aivaloglou, Alaaeddin Swidan, and David Hoepelman. 2016. Spreadsheets are code: An overview of software engineering approaches applied to spreadsheets. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 5. IEEE, 56--65.
    [33]
    Felienne Hermans and Emerson Murphy-Hill. 2015. Enron's spreadsheets and related emails: A dataset and analysis. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. IEEE, 7--16.
    [34]
    Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012. Measuring spreadsheet formula understandability. arXiv preprint arXiv:1209.3517 (2012).
    [35]
    Zhongjun Jin, Michael R Anderson, Michael Cafarella, and HV Jagadish. 2017. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data. 683--698.
    [36]
    Zhongjun Jin, Yeye He, and Surajit Chauduri. 2020. Auto-Transform: learning-to-transform by patterns. Proceedings of the VLDB Endowment 13, 12 (2020), 2368--2381.
    [37]
    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535--547.
    [38]
    Nodira Khoussainova, YongChul Kwon, Magdalena Balazinska, and Dan Suciu. 2010. SnipSuggest: Context-aware autocompletion for SQL. Proceedings of the VLDB Endowment 4, 1 (2010), 22--33.
    [39]
    Bryan Klimt and Yiming Yang. 2004. Introducing the Enron corpus. In CEAS, Vol. 45. 92--96.
    [40]
    Eugenie Yujing Lai, Zainab Zolaktaf, Mostafa Milani, Omar AlOmeir, Jianhao Cao, and Rachel Pottinger. 2023. Workload-Aware Query Recommendation Using Deep Learning. In EDBT. 53--65.
    [41]
    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459--9474.
    [42]
    Peng Li, Yeye He, Cong Yan, Yue Wang, and Surajit Chauduri. 2023. Auto-Tables: Synthesizing multi-step transformations to relationalize tables without using examples. arXiv preprint arXiv:2307.14565 (2023).
    [43]
    Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, and Surajit Chaudhuri. 2023. Table-GPT: Table-tuned gpt for diverse table tasks. arXiv preprint arXiv:2310.09263 (2023).
    [44]
    Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. 2019. Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering 32, 8 (2019), 1475--1488.
    [45]
    Xingjun Li, Yizhi Zhang, Justin Leung, Chengnian Sun, and Jian Zhao. 2023. Edassistant: Supporting exploratory data analysis in computational notebooks with in situ code search and recommendation. ACM Transactions on Interactive Intelligent Systems 13, 1 (2023), 1--27.
    [46]
    Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou. 2022. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Networks Learn. Syst. 33, 12 (2022), 6999--7019. https: //doi.org/10.1109/TNNLS.2021.3084827
    [47]
    Ting Liu, Andrew Moore, Ke Yang, and Alexander Gray. 2004. An investigation of practical approximate nearest neighbor algorithms. Advances in neural information processing systems 17 (2004).
    [48]
    Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. 2017. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 212--220.
    [49]
    Weizheng Lu, Jiaming Zhang, Jing Zhang, and Yueguo Chen. 2024. Large Language Model for Table Processing: A Survey. arXiv preprint arXiv:2402.05121 (2024).
    [50]
    Fatemeh Nargesian, Erkang Zhu, Ken Q Pu, and Renée J Miller. 2018. Table union search on open data. Proceedings of the VLDB Endowment 11, 7 (2018), 813--825.
    [51]
    Raymond R Panko. 1998. What we know about spreadsheet errors. Journal of Organizational and End User Computing (JOEUC) 10, 2 (1998), 15--21.
    [52]
    Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305 (2015).
    [53]
    Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
    [54]
    Stephen G Powell, Kenneth R Baker, and Barry Lawson. 2009. Errors in operational spreadsheets. Journal of Organizational and End User Computing (JOEUC) 21, 3 (2009), 24--36.
    [55]
    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
    [56]
    Kamalasen Rajalingham, David R Chadwick, and Brian Knight. 2008. Classification of spreadsheet errors. arXiv preprint arXiv:0805.4224 (2008).
    [57]
    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:201646309
    [58]
    Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.
    [59]
    Connor Shorten and Taghi M Khoshgoftaar. 2019. A survey on image data augmentation for deep learning. Journal of big data 6, 1 (2019), 1--48.
    [60]
    Gerardo Vitagliano, Lucas Reisener, Lan Jiang, Mazhar Hameed, and Felix Naumann. 2022. Mondrian: Spreadsheet Layout Detection. In Proceedings of the 2022 International Conference on Management of Data. 2361--2364.
    [61]
    Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. 2018. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5265--5274.
    [62]
    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).
    [63]
    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824--24837.
    [64]
    Cong Yan and Yeye He. 2020. Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1539--1554.
    [65]
    Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. arXiv preprint arXiv:2106.13861 (2021).
    [66]
    Xiaokang Zhang, Jing Zhang, Zeyao Ma, Yang Li, Bohan Zhang, Guanlin Li, Zijun Yao, Kangli Xu, Jinchang Zhou, Daniel Zhang-Li, et al. 2024. TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios. arXiv preprint arXiv:2403.19318 (2024).
    [67]
    Wei Zhao, Zhitao Hou, Siyuan Wu, Yan Gao, Haoyu Dong, Yao Wan, Hongyu Zhang, Yulei Sui, and Haidong Zhang. 2024. NL2Formula: Generating Spreadsheet Formulas from Natural Language Queries. arXiv preprint arXiv:2402.14853 (2024).
    [68]
    Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-join: Joining tables by leveraging transformations. Proceedings of the VLDB Endowment 10, 10 (2017), 1034--1045.

    Index Terms

    1. Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 2, Issue 3
      SIGMOD
      June 2024
      1953 pages
      EISSN:2836-6573
      DOI:10.1145/3670010
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 May 2024
      Published in PACMMOD Volume 2, Issue 3

      Permissions

      Request permissions for this article.

      Author Tags

      1. contextual recommendation
      2. contrastive learning
      3. formula prediction
      4. similar spreadsheets
      5. similar tables
      6. spreadsheet tables
      7. table embedding
      8. table representation learning

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 78
        Total Downloads
      • Downloads (Last 12 months)78
      • Downloads (Last 6 weeks)76

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media