Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Cornet: Learning Table Formatting Rules By Example

Published: 01 June 2023 Publication History

Abstract

Spreadsheets are widely used for table manipulation and presentation. Stylistic formatting of these tables is an important property for presentation and analysis. As a result, popular spreadsheet software, such as Excel, supports automatically formatting tables based on rules. Unfortunately, writing such formatting rules can be challenging for users as it requires knowledge of the underlying rule language and data logic. We present Cornet, a system that tackles the novel problem of automatically learning such formatting rules from user-provided formatted cells. Cornet takes inspiration from advances in inductive programming and combines symbolic rule enumeration with a neural ranker to learn conditional formatting rules. To motivate and evaluate our approach, we extracted tables with over 450K unique formatting rules from a corpus of over 1.8M real worksheets. Since we are the first to introduce the task of automatically learning conditional formatting rules, we compare Cornet to a wide range of symbolic and neural baselines adapted from related domains. Our results show that Cornet accurately learns rules across varying setups. Additionally, we show that in some cases Cornet can find rules that are shorter than those written by users and can also discover rules in spreadsheets that users have manually formatted. Furthermore, we present two case studies investigating the generality of our approach by extending Cornet to related data tasks (e.g., filtering) and generalizing to conditional formatting over multiple columns.

References

[1]
Sergei Abramovich, Stephen Sugden, Sergei Abramovich, and Stephen J Sugden. 2004. Spreadsheet Conditional Formatting: An Untapped Resource for Mathematics Education. Spreadsheets in Education 1 (2004), 85105.
[2]
Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-Hill. 2015. Fuse: a reproducible, extendable, internet-scale corpus of spreadsheets. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. IEEE, IEEE/ACM, Florence, Italy, 486--489.
[3]
Hendrik Blockeel and Luc De Raedt. 1998. Top-down induction of first-order logical decision trees. Artificial intelligence 101, 1--2 (1998), 285--297.
[4]
Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, and Denny Zhou. 2021. SpreadsheetCoder: Formula Prediction from Semi-structured Context. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Marina Meila and Tong Zhang (Eds.), Vol. 139. PMLR, virtual, 1661--1672. https://proceedings.mlr.press/v139/chen21m.html
[5]
Andrew Cropper and Rolf Morel. 2021. Learning Programs by Learning from Failures. Mach. Learn. 110, 4 (Apr 2021), 801--856.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, USA, 4171--4186.
[7]
Haoyu Dong, Jinyu Wang, Zhouyu Fu, Shi Han, and Dongmei Zhang. 2020. Neural Formatting for Spreadsheet Tables. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (Virtual Event, Ireland) (CIKM '20). Association for Computing Machinery, New York, NY, USA, 305--314.
[8]
Ian Drosos, Titus Barik, Philip J. Guo, Robert DeLine, and Sumit Gulwani. 2020. Wrex: A Unified Programming-by-Example Interaction for Synthesizing Readable Code for Data Scientists. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1--12.
[9]
Kevin Ellis and Sumit Gulwani. 2017. Learning to Learn Programs from Examples: Going Beyond Program Structure. In IJCAI 2017 (ijcai 2017 ed.). IJCAI 2017, Melbourne, Australia, 1638--1645. www.microsoft.com/research/publication/learning-learn-programs-examples-going-beyond-program-structure/
[10]
Microsoft Excel. 2022. Excel Tech Help Forum. https://techcommunity.microsoft.com/t5/forums/searchpage/tab/message?q=conditional%20formatting. Last Accessed: 2022-06-30.
[11]
Anna Fariha and Alexandra Meliou. 2019. Example-Driven Query Intent Discovery: Abductive Reasoning Using Semantic Similarity. Proc. VLDB Endow. 12, 11 (jul 2019), 1262--1275.
[12]
Anna Fariha, Ashish Tiwari, Alexandra Meliou, Arjun Radhakrishna, and Sumit Gulwani. 2021. CoCo: Interactive Exploration of Conformance Constraints for Data Understanding and Data Cleaning. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD '21). Association for Computing Machinery, New York, NY, USA, 2706--2710.
[13]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In EMNLP 2020. Association for Computational Linguistics, Online, 1536--1547.
[14]
Marc Fisher and Gregg Rothermel. 2005. The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. In Proceedings of the first workshop on End-user software engineering. Association for Computing Machinery, New York, NY, USA, 1--5.
[15]
Sumit Gulwani. 2011. Automating String Processing in Spreadsheets using Input-Output Examples. In PoPL'11, January 26--28, 2011, Austin, Texas, USA. Association for Computing Machinery, New York, NY, USA, 317--330. https://www.microsoft.com/en-us/research/publication/automating-string-processing-spreadsheets-using-input-output-examples/
[16]
Sumit Gulwani, Vu Le, Arjun Radhakrishna, Ivan Radicek, and Mohammad Raza. 2020. Structure interpretation of text formats. In Object-Oriented Programming, Systems, Languages & Applications (OOPSLA). ACM, Association for Computing Machinery, New York, NY, USA, 29. https://www.microsoft.com/en-us/research/publication/structure-interpretation-of-text-formats/
[17]
Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. Tapas: Weakly Supervised Table Parsing via Pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Seattle, Washington, United States, 4320--4333. https://www.aclweb.org/anthology/2020.acl-main.398/
[18]
Nathan Hurst, Kim Marriott, and Peter Moulder. 2005. Toward tighter tables. In Proceedings of the 2005 ACM symposium on Document engineering. Association for Computing Machinery, New York, NY, USA, 74--83.
[19]
Leonard Kaufman and Peter J Rousseeuw. 2009. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, online.
[20]
Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. In 2014 Programming Language Design and Implementation. ACM, New York, NY, USA, 542--553. https://www.microsoft.com/en-us/research/publication/flashextract-framework-data-extraction-examples/
[21]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 212--228.
[22]
Hao Li, Chee-Yong Chan, and David Maier. 2015. Query from Examples: An Iterative, Data-Driven Approach to Query Construction. Proc. VLDB Endow. 8, 13 (sep 2015), 2158--2169.
[23]
Pietro Liguori, Erfan Al-Hossami, Domenico Cotroneo, Roberto Natella, Bojan Cukic, and Samira Shaikh. 2022. Can we generate shellcodes via natural language? An empirical study. Automated Software Engineering 29 (2022), 1--34.
[24]
Xiaofan Lin. 2006. Active layout engine: Algorithms and applications in variable data printing. Computer-Aided Design 38, 5 (2006), 444--456.
[25]
Davide Mottin, Matteo Lissandrini, Yannis Velegrakis, and Themis Palpanas. 2016. Exemplar Queries: A New Way of Searching. The VLDB Journal 25, 6 (dec 2016), 741--765.
[26]
Joseph N. 2022. Number of Google Sheets and Excel Users Worldwide. https://askwonder.com/research/number-google-sheets-users-worldwide-eoskdoxav. Last Accessed: 2022-07-30.
[27]
Nagarajan Natarajan, Danny Simmons, Naren Datha, Prateek Jain, and Sumit Gulwani. 2019. Learning Natural Programs from a Few Examples in Real-Time. In AIStats. PMLR, online, 1714--1722. https://www.microsoft.com/en-us/research/publication/learning-natural-programs-from-a-few-examples-in-real-time/
[28]
Erich Neuwirth and Deane Arganbright. 2003. The Active Modeler: Mathematical Modeling With Microsoft Excel. Duxbury Press, online.
[29]
Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, and Charles Sutton. 2021. BUSTLE: Bottom-up program-Synthesis Through Learning-guided Exploration. ArXiv abs/2007.14381 (2021).
[30]
Saswat Padhi, Prateek Jain, Daniel Perelman, Oleksandr Polozov, Sumit Gulwani, and Todd D. Millstein. 2017. FlashProfile: Interactive Synthesis of Syntactic Profiles. CoRR abs/1709.05725, Article 150 (2017), 28 pages. arXiv:1709.05725 http://arxiv.org/abs/1709.05725
[31]
Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable code generation from pre-trained language models. CoRR abs/2201.11227 (2022). arXiv:2201.11227 https://arxiv.org/abs/2201.11227
[32]
Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A Framework for Inductive Program Synthesis. SIGPLAN Not. 50, 10 (oct 2015), 107--126.
[33]
Mohammad Raza and Sumit Gulwani. 2020. Web data extraction using hybrid program synthesis: A combination of top-down and bottom-up inference. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, 1967--1978.
[34]
Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, and Lev Novik. 2014. Discovering Queries Based on Example Tuples. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 493--504.
[35]
Kexuan Sun, Harsha Rayudu, and Jay Pujara. 2021. A Hybrid Probabilistic Approach for Table Understanding. Proceedings of the AAAI Conference on Artificial Intelligence 35, 5 (May 2021), 4366--4374. https://ojs.aaai.org/index.php/AAAI/article/view/16562
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[37]
Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schrödl. 2001. Constrained K-Means Clustering with Background Knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 577--584.
[38]
Zhiruo Wang, Haoyu Dong, Ran Jia, Jia Li, Zhiyi Fu, Shi Han, and Dongmei Zhang. 2021. TUTA: Tree-Based Transformers for Generally Structured Table Pre-Training. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD '21). Association for Computing Machinery, New York, USA, 1780--1790.
[39]
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8413--8426.

Cited By

View all
  • (2023)FormaT5: Abstention and Examples for Conditional Table Formatting with Natural LanguageProceedings of the VLDB Endowment10.14778/3632093.363211117:3(497-510)Online publication date: 1-Nov-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 16, Issue 10
June 2023
295 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2023
Published in PVLDB Volume 16, Issue 10

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)FormaT5: Abstention and Examples for Conditional Table Formatting with Natural LanguageProceedings of the VLDB Endowment10.14778/3632093.363211117:3(497-510)Online publication date: 1-Nov-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media