research-article

Cornet: Learning Table Formatting Rules By Example

Authors:

José Cambronero Sánchez,

Carina Negreanu,

Gust VerbruggenAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 16, Issue 10

Pages 2632 - 2644

https://doi.org/10.14778/3603581.3603600

Published: 01 June 2023 Publication History

Abstract

Spreadsheets are widely used for table manipulation and presentation. Stylistic formatting of these tables is an important property for presentation and analysis. As a result, popular spreadsheet software, such as Excel, supports automatically formatting tables based on rules. Unfortunately, writing such formatting rules can be challenging for users as it requires knowledge of the underlying rule language and data logic. We present Cornet, a system that tackles the novel problem of automatically learning such formatting rules from user-provided formatted cells. Cornet takes inspiration from advances in inductive programming and combines symbolic rule enumeration with a neural ranker to learn conditional formatting rules. To motivate and evaluate our approach, we extracted tables with over 450K unique formatting rules from a corpus of over 1.8M real worksheets. Since we are the first to introduce the task of automatically learning conditional formatting rules, we compare Cornet to a wide range of symbolic and neural baselines adapted from related domains. Our results show that Cornet accurately learns rules across varying setups. Additionally, we show that in some cases Cornet can find rules that are shorter than those written by users and can also discover rules in spreadsheets that users have manually formatted. Furthermore, we present two case studies investigating the generality of our approach by extending Cornet to related data tasks (e.g., filtering) and generalizing to conditional formatting over multiple columns.

References

[1]

Sergei Abramovich, Stephen Sugden, Sergei Abramovich, and Stephen J Sugden. 2004. Spreadsheet Conditional Formatting: An Untapped Resource for Mathematics Education. Spreadsheets in Education 1 (2004), 85105.

[2]

Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-Hill. 2015. Fuse: a reproducible, extendable, internet-scale corpus of spreadsheets. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. IEEE, IEEE/ACM, Florence, Italy, 486--489.

[3]

Hendrik Blockeel and Luc De Raedt. 1998. Top-down induction of first-order logical decision trees. Artificial intelligence 101, 1--2 (1998), 285--297.

[4]

Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, and Denny Zhou. 2021. SpreadsheetCoder: Formula Prediction from Semi-structured Context. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Marina Meila and Tong Zhang (Eds.), Vol. 139. PMLR, virtual, 1661--1672. https://proceedings.mlr.press/v139/chen21m.html

[5]

Andrew Cropper and Rolf Morel. 2021. Learning Programs by Learning from Failures. Mach. Learn. 110, 4 (Apr 2021), 801--856.

Digital Library

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, USA, 4171--4186.

[7]

Haoyu Dong, Jinyu Wang, Zhouyu Fu, Shi Han, and Dongmei Zhang. 2020. Neural Formatting for Spreadsheet Tables. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (Virtual Event, Ireland) (CIKM '20). Association for Computing Machinery, New York, NY, USA, 305--314.

Digital Library

[8]

Ian Drosos, Titus Barik, Philip J. Guo, Robert DeLine, and Sumit Gulwani. 2020. Wrex: A Unified Programming-by-Example Interaction for Synthesizing Readable Code for Data Scientists. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1--12.

Digital Library

[9]

Kevin Ellis and Sumit Gulwani. 2017. Learning to Learn Programs from Examples: Going Beyond Program Structure. In IJCAI 2017 (ijcai 2017 ed.). IJCAI 2017, Melbourne, Australia, 1638--1645. www.microsoft.com/research/publication/learning-learn-programs-examples-going-beyond-program-structure/

[10]

Microsoft Excel. 2022. Excel Tech Help Forum. https://techcommunity.microsoft.com/t5/forums/searchpage/tab/message?q=conditional%20formatting. Last Accessed: 2022-06-30.

[11]

Anna Fariha and Alexandra Meliou. 2019. Example-Driven Query Intent Discovery: Abductive Reasoning Using Semantic Similarity. Proc. VLDB Endow. 12, 11 (jul 2019), 1262--1275.

Digital Library

[12]

Anna Fariha, Ashish Tiwari, Alexandra Meliou, Arjun Radhakrishna, and Sumit Gulwani. 2021. CoCo: Interactive Exploration of Conformance Constraints for Data Understanding and Data Cleaning. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD '21). Association for Computing Machinery, New York, NY, USA, 2706--2710.

Digital Library

[13]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In EMNLP 2020. Association for Computational Linguistics, Online, 1536--1547.

[14]

Marc Fisher and Gregg Rothermel. 2005. The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. In Proceedings of the first workshop on End-user software engineering. Association for Computing Machinery, New York, NY, USA, 1--5.

Digital Library

[15]

Sumit Gulwani. 2011. Automating String Processing in Spreadsheets using Input-Output Examples. In PoPL'11, January 26--28, 2011, Austin, Texas, USA. Association for Computing Machinery, New York, NY, USA, 317--330. https://www.microsoft.com/en-us/research/publication/automating-string-processing-spreadsheets-using-input-output-examples/

Digital Library

[16]

Sumit Gulwani, Vu Le, Arjun Radhakrishna, Ivan Radicek, and Mohammad Raza. 2020. Structure interpretation of text formats. In Object-Oriented Programming, Systems, Languages & Applications (OOPSLA). ACM, Association for Computing Machinery, New York, NY, USA, 29. https://www.microsoft.com/en-us/research/publication/structure-interpretation-of-text-formats/

[17]

Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. Tapas: Weakly Supervised Table Parsing via Pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Seattle, Washington, United States, 4320--4333. https://www.aclweb.org/anthology/2020.acl-main.398/

[18]

Nathan Hurst, Kim Marriott, and Peter Moulder. 2005. Toward tighter tables. In Proceedings of the 2005 ACM symposium on Document engineering. Association for Computing Machinery, New York, NY, USA, 74--83.

Digital Library

[19]

Leonard Kaufman and Peter J Rousseeuw. 2009. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, online.

[20]

Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. In 2014 Programming Language Design and Implementation. ACM, New York, NY, USA, 542--553. https://www.microsoft.com/en-us/research/publication/flashextract-framework-data-extraction-examples/

[21]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 212--228.

Digital Library

[22]

Hao Li, Chee-Yong Chan, and David Maier. 2015. Query from Examples: An Iterative, Data-Driven Approach to Query Construction. Proc. VLDB Endow. 8, 13 (sep 2015), 2158--2169.

Digital Library

[23]

Pietro Liguori, Erfan Al-Hossami, Domenico Cotroneo, Roberto Natella, Bojan Cukic, and Samira Shaikh. 2022. Can we generate shellcodes via natural language? An empirical study. Automated Software Engineering 29 (2022), 1--34.

Digital Library

[24]

Xiaofan Lin. 2006. Active layout engine: Algorithms and applications in variable data printing. Computer-Aided Design 38, 5 (2006), 444--456.

Digital Library

[25]

Davide Mottin, Matteo Lissandrini, Yannis Velegrakis, and Themis Palpanas. 2016. Exemplar Queries: A New Way of Searching. The VLDB Journal 25, 6 (dec 2016), 741--765.

Digital Library

[26]

Joseph N. 2022. Number of Google Sheets and Excel Users Worldwide. https://askwonder.com/research/number-google-sheets-users-worldwide-eoskdoxav. Last Accessed: 2022-07-30.

[27]

Nagarajan Natarajan, Danny Simmons, Naren Datha, Prateek Jain, and Sumit Gulwani. 2019. Learning Natural Programs from a Few Examples in Real-Time. In AIStats. PMLR, online, 1714--1722. https://www.microsoft.com/en-us/research/publication/learning-natural-programs-from-a-few-examples-in-real-time/

[28]

Erich Neuwirth and Deane Arganbright. 2003. The Active Modeler: Mathematical Modeling With Microsoft Excel. Duxbury Press, online.

[29]

Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, and Charles Sutton. 2021. BUSTLE: Bottom-up program-Synthesis Through Learning-guided Exploration. ArXiv abs/2007.14381 (2021).

[30]

Saswat Padhi, Prateek Jain, Daniel Perelman, Oleksandr Polozov, Sumit Gulwani, and Todd D. Millstein. 2017. FlashProfile: Interactive Synthesis of Syntactic Profiles. CoRR abs/1709.05725, Article 150 (2017), 28 pages. arXiv:1709.05725 http://arxiv.org/abs/1709.05725

[31]

Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable code generation from pre-trained language models. CoRR abs/2201.11227 (2022). arXiv:2201.11227 https://arxiv.org/abs/2201.11227

[32]

Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A Framework for Inductive Program Synthesis. SIGPLAN Not. 50, 10 (oct 2015), 107--126.

Digital Library

[33]

Mohammad Raza and Sumit Gulwani. 2020. Web data extraction using hybrid program synthesis: A combination of top-down and bottom-up inference. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, 1967--1978.

Digital Library

[34]

Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, and Lev Novik. 2014. Discovering Queries Based on Example Tuples. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 493--504.

Digital Library

[35]

Kexuan Sun, Harsha Rayudu, and Jay Pujara. 2021. A Hybrid Probabilistic Approach for Table Understanding. Proceedings of the AAAI Conference on Artificial Intelligence 35, 5 (May 2021), 4366--4374. https://ojs.aaai.org/index.php/AAAI/article/view/16562

[36]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[37]

Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schrödl. 2001. Constrained K-Means Clustering with Background Knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 577--584.

Digital Library

[38]

Zhiruo Wang, Haoyu Dong, Ran Jia, Jia Li, Zhiyi Fu, Shi Han, and Dongmei Zhang. 2021. TUTA: Tree-Based Transformers for Generally Structured Table Pre-Training. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD '21). Association for Computing Machinery, New York, USA, 1780--1790.

Digital Library

[39]

Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8413--8426.

Cited By

Singh MCambronero JGulwani SLe VNegreanu CNouri ERaza MVerbruggen G(2023)FormaT5: Abstention and Examples for Conditional Table Formatting with Natural LanguageProceedings of the VLDB Endowment10.14778/3632093.363211117:3(497-510)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.14778/3632093.3632111

Recommendations

Cornet: Learning Spreadsheet Formatting Rules by Example

Data management and analysis tasks are often carried out using spreadsheet software. A popular feature in most spreadsheet platforms is the ability to define data-dependent formatting rules. These rules can express actions such as "color red all entries ...
Parameterized formatting of an XML document by XSL rules
ADVIS'04: Proceedings of the Third international conference on Advances in Information Systems

The possibilities of formatting offered by database management systems (DBMS) are insufficient and do not allow emphasizing the various data results. It is the same for the usual browsing of an XML document without any particular rules of formatting. ...
An interval set model for learning rules from incomplete information table

A novel interval set approach is proposed in this paper to induce classification rules from incomplete information table, in which an interval-set-based model to represent the uncertain concepts is presented. The extensions of the concepts in incomplete ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 16, Issue 10

June 2023

295 pages

ISSN:2150-8097

Editors:
Georgia Koutrika
Athena Research Center
,
Jun Yang
Duke University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2023

Published in PVLDB Volume 16, Issue 10

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
71
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Singh MCambronero JGulwani SLe VNegreanu CNouri ERaza MVerbruggen G(2023)FormaT5: Abstention and Examples for Conditional Table Formatting with Natural LanguageProceedings of the VLDB Endowment10.14778/3632093.363211117:3(497-510)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.14778/3632093.3632111

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents