Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis

Published: 15 October 2021 Publication History

Abstract

Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more "complete" task description. Machine-learned pre-trained models (PTMs) are adept at handling ambiguous natural language, but struggle with generating syntactically and semantically precise code. Program synthesis techniques can generate correct code, often even from incomplete but precise specifications, such as examples, but they are unable to work with the ambiguity of natural languages. We present an approach that combines PTMs with component-based synthesis (CBS): PTMs are used to generate candidates programs from the natural language description of the task, which are then used to guide the CBS procedure to find the program that matches the precise examples-based specification. We use our combination approach to instantiate multi-modal synthesis systems for two programming domains: the domain of regular expressions and the domain of CSS selectors. Our evaluation demonstrates the effectiveness of our domain-agnostic approach in comparison to a state-of-the-art specialized system, and the generality of our approach in providing multi-modal program synthesis from natural language and examples in different programming domains.

Supplementary Material

Auxiliary Presentation Video (oopsla21main-p410-p-video.mp4)
Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more "complete" task description. Machine-learned pre-trained models (PTMs) are adept at handling ambiguous natural language, but struggle with generating syntactically and semantically precise code. Program synthesis techniques can generate correct code, often even from incomplete but precise specifications, such as examples, but they are unable to work with the ambiguity of natural languages. We present an approach that combines PTMs with component-based synthesis (CBS): PTMs are used to generate candidates programs from the natural language description of the task, which are then used to guide the CBS procedure to find the program that matches the precise examples-based specification.

References

[1]
R. Agrawal, T. Imielinski, and A. Swami. 1993. Mining Association Rules Between Sets of Items in Large Databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD ’93, Vol. 22). ACM, New York, NY, USA. 207–216. isbn:0-89791-592-5 https://doi.org/10.1145/170035.170072
[2]
R. Alur, R. Bodik, G. Juniwal, M. M. K. Martin, M. Raghothaman, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. Udupa. 2013. Syntax-guided synthesis. In 2013 Formal Methods in Computer-Aided Design. 1–8. https://doi.org/10.1109/FMCAD.2013.6679385
[3]
Rajeev Alur, Pavol Cerny, and Arjun Radhakrishna. 2015. Synthesis Through Unification. In Computer Aided Verification (CAV). https://www.microsoft.com/en-us/research/publication/synthesis-through-unification/
[4]
Rajeev Alur, Arjun Radhakrishna, and Abhishek Udupa. 2017. Scaling Enumerative Program Synthesis via Divide and Conquer. In Tools and Algorithms for the Construction and Analysis of Systems, Axel Legay and Tiziana Margaria (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 319–336. isbn:978-3-662-54577-5
[5]
Dana Angluin. 1978. On the complexity of minimum inference of regular sets. Information and Control, 39, 3 (1978), 337–350. issn:0019-9958 https://doi.org/10.1016/S0019-9958(78)90683-6
[6]
Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples. Inf. Comput., 75, 2 (1987), Nov., 87–106. issn:0890-5401 https://doi.org/10.1016/0890-5401(87)90052-6
[7]
Matej Balog, Alexander Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2017. DeepCoder: Learning to Write Programs. In Proceedings of ICLR’17 (proceedings of iclr’17 ed.). https://www.microsoft.com/en-us/research/publication/deepcoder-learning-write-programs/
[8]
Paul E. Black. 1999. Dictionary of Algorithms and Data Structures [online]. https://www.nist.gov/dads/HTML/Levenshtein.html (Accessed, March 2021)
[9]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). 33, Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[10]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. CoRR, abs/2107.03374 (2021), arxiv:2107.03374. arxiv:2107.03374
[11]
Qiaochu Chen, Xinyu Wang, Xi Ye, Greg Durrett, and Isil Dillig. 2020. Multi-Modal Synthesis of Regular Expressions. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2020). Association for Computing Machinery, New York, NY, USA. 487–502. isbn:9781450376136 https://doi.org/10.1145/3385412.3385988
[12]
Yanju Chen, Ruben Martins, and Yu Feng. 2019. Maximal multi-layer specification synthesis. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 602–612.
[13]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota. 4171–4186. https://doi.org/10.18653/v1/N19-1423
[14]
Dana Drachsler-Cohen, Sharon Shoham, and Eran Yahav. 2017. Synthesis with Abstract Examples. In Computer Aided Verification, Rupak Majumdar and Viktor Kunčak (Eds.). Springer International Publishing, Cham. 254–278. isbn:978-3-319-63387-9
[15]
Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017. Component-Based Synthesis of Table Consolidation and Transformation Tasks from Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). Association for Computing Machinery, New York, NY, USA. 422–436. isbn:9781450349888 https://doi.org/10.1145/3062341.3062351
[16]
Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W. Reps. 2017. Component-based synthesis for complex APIs. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages. ACM. https://doi.org/10.1145/3009837.3009851
[17]
Sumit Gulwani. 2011. Automating String Processing in Spreadsheets using Input-Output Examples. In PoPL’11, January 26-28, 2011, Austin, Texas, USA. https://www.microsoft.com/en-us/research/publication/automating-string-processing-spreadsheets-using-input-output-examples/
[18]
Sumit Gulwani, Susmit Jha, Ashish Tiwari, and Ramarathnam Venkatesan. 2011. Synthesis of Loop-Free Programs. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11). Association for Computing Machinery, New York, NY, USA. 62–73. isbn:9781450306638 https://doi.org/10.1145/1993498.1993506
[19]
Sumit Gulwani and Mark Marron. 2014. NLyze: Interactive Programming by Natural Language for SpreadSheet Data Analysis and Manipulation. In SIGMOD ’14 Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (sigmod ’14 proceedings of the 2014 acm sigmod international conference on management of data ed.). Association for Computing Machinery, 803–814. https://www.microsoft.com/en-us/research/publication/nlyze-interactive-programming-by-natural-language-for-spreadsheet-data-analysis-and-manipulation/
[20]
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. CoRR, abs/2105.09938 (2021), arxiv:2105.09938. arxiv:2105.09938
[21]
Kangjing Huang, Xiaokang Qiu, Peiyuan Shen, and Yanjun Wang. 2020. Reconciling Enumerative and Deductive Program Synthesis. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2020). Association for Computing Machinery, New York, NY, USA. 1159–1174. isbn:9781450376136 https://doi.org/10.1145/3385412.3386027
[22]
Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He. 2018. Natural Language to Structured Query Generation via Meta-Learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana. 732–738. https://doi.org/10.18653/v1/N18-2115
[23]
Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. 2010. Oracle-Guided Component-Based Program Synthesis. In ICSE ’10, May 2-8 2010, Cape Town, South Africa (icse ’10, may 2-8 2010, cape town, south africa ed.). https://www.microsoft.com/en-us/research/publication/oracle-guided-component-based-program-synthesis/
[24]
K. Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Documentation, 60 (1972), 493–502. https://doi.org/10.1108/eb026526
[25]
Nate Kushman and Regina Barzilay. 2013. Using Semantic Unification to Generate Regular Expressions from Natural Language. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Atlanta, Georgia. 826–836. https://www.aclweb.org/anthology/N13-1103
[26]
Vu Le, Sumit Gulwani, and Zhendong Su. 2013. SmartSynth: Synthesizing Smartphone Automation Scripts from Natural Language. In MobiSys’13, June 25-28, 2013, Taipei, Taiwan (mobisys’13, june 25-28, 2013, taipei, taiwan ed.). https://www.microsoft.com/en-us/research/publication/smartsynth-synthesizing-smartphone-automation-scripts-natural-language/
[27]
Mina Lee, Sunbeom So, and Hakjoo Oh. 2016. Synthesizing Regular Expressions from Examples for Introductory Automata Assignments. SIGPLAN Not., 52, 3 (2016), Oct., 70–80. issn:0362-1340 https://doi.org/10.1145/3093335.2993244
[28]
Yeting Li, Shuaimin Li, Zhiwu Xu, Jialun Cao, Zixuan Chen, Yun Hu, Haiming Chen, and Shing-Chi Cheung. 2020. TransRegex: Multi-modal Regular Expression Synthesis by Generate-and-Repair. CoRR, abs/2012.15489 (2020), arxiv:2012.15489. arxiv:2012.15489
[29]
Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. 2018. NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1491
[30]
Nicholas Locascio, Karthik Narasimhan, Eduardo DeLeon, Nate Kushman, and Regina Barzilay. 2016. Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge. In EMNLP (emnlp ed.). https://www.microsoft.com/en-us/research/publication/neural-generation-regular-expressions-natural-language-minimal-domain-knowledge/
[31]
Mehdi Manshadi, Daniel Gildea, and James Allen. 2013. Integrating programming by example and natural language programming. In Proceedings of the AAAI Conference on Artificial Intelligence. 27.
[32]
Richard Meyes, Melanie Lu, Constantin Waubert de Puiseau, and Tobias Meisen. 2019. Ablation Studies in Artificial Neural Networks. CoRR, abs/1901.08644 (2019), arxiv:1901.08644. arxiv:1901.08644
[33]
OpenAI. 2021. GPT-3 powers the next generation of apps. https://openai.com/blog/gpt-3-apps/
[34]
Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D’Antoni. 2019. Automatic Repair of Regular Expressions. Proc. ACM Program. Lang., 3, OOPSLA (2019), Article 139, Oct., 29 pages. https://doi.org/10.1145/3360565
[35]
Jian Pei, Jiawei Han, and Runying Mao. 2000. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. 21–30. http://dblp.uni-trier.de/db/conf/dmkd/dmkd2000.html##PeiHM00
[36]
Nadia Polikarpova, Ivan Kuraj, and Armando Solar-Lezama. 2016. Program Synthesis from Polymorphic Refinement Types. SIGPLAN Not., 51, 6 (2016), June, 522–538. issn:0362-1340 https://doi.org/10.1145/2980983.2908093
[37]
Kia Rahmani, Mohammad Raza, Sumit Gulwani, Vu Le, Daniel Morris, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari. 2021. Multi-modal Program Inference: a Marriage of Pre-trainedLanguage Models and Component-based Synthesis. arxiv:2109.02445.
[38]
Mohammad Raza and Sumit Gulwani. 2017. Automated Data Extraction Using Predictive Program Synthesis. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17). AAAI Press, 882–890.
[39]
Mohammad Raza and Sumit Gulwani. 2020. Web Data Extraction Using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA. 1967–1978. isbn:9781450367356 https://doi.org/10.1145/3318464.3380608
[40]
Mohammad Raza, Sumit Gulwani, and Natasa Milic-Frayling. 2015. Compositional Program Synthesis from Natural Language and Examples. In IJCAI 2015 (ijcai 2015 ed.). https://www.microsoft.com/en-us/research/publication/compositional-program-synthesis-natural-language-examples/
[41]
Armando Solar-Lezama. 2008. Program Synthesis by Sketching. Ph.D. Dissertation. USA. isbn:9781109097450 AAI3353225.
[42]
Saurabh Srivastava, Sumit Gulwani, and Jeffrey S. Foster. 2010. From Program Verification to Program Synthesis. In Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’10). Association for Computing Machinery, New York, NY, USA. 313–326. isbn:9781605584799 https://doi.org/10.1145/1706299.1706337
[43]
W3C. 2020. CSS Snapshot 2020 [online]. https://www.w3.org/TR/CSS/
[44]
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online. 7567–7578. https://doi.org/10.18653/v1/2020.acl-main.677
[45]
Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. SQLizer: Query Synthesis from Natural Language. Proc. ACM Program. Lang., 1, OOPSLA (2017), Article 63, Oct., 26 pages. https://doi.org/10.1145/3133887
[46]
Zexuan Zhong, Jiaqi Guo, Wei Yang, Jian Peng, Tao Xie, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2018. SemRegex: A Semantics-Based Approach for Generating Regular Expressions from Natural Language Specifications. In EMNP’18. ACL. https://www.microsoft.com/en-us/research/publication/semregex-a-semantics-based-approach-for-generating-regular-expressions-from-natural-language-specifications-2/

Cited By

View all
  • (2024)PyDex: Repairing Bugs in Introductory Python Assignments using LLMsProceedings of the ACM on Programming Languages10.1145/36498508:OOPSLA1(1100-1124)Online publication date: 29-Apr-2024
  • (2024)Facilitating Function Application in Code Building Genetic ProgrammingProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654068(887-895)Online publication date: 14-Jul-2024
  • (2024)Programming-by-Demonstration for Long-Horizon Robot TasksProceedings of the ACM on Programming Languages10.1145/36328608:POPL(512-545)Online publication date: 5-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 5, Issue OOPSLA
October 2021
2001 pages
EISSN:2475-1421
DOI:10.1145/3492349
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2021
Published in PACMPL Volume 5, Issue OOPSLA

Check for updates

Author Tags

  1. GPT-3
  2. Natural Language Models
  3. Program Inference

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)292
  • Downloads (Last 6 weeks)41
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PyDex: Repairing Bugs in Introductory Python Assignments using LLMsProceedings of the ACM on Programming Languages10.1145/36498508:OOPSLA1(1100-1124)Online publication date: 29-Apr-2024
  • (2024)Facilitating Function Application in Code Building Genetic ProgrammingProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654068(887-895)Online publication date: 14-Jul-2024
  • (2024)Programming-by-Demonstration for Long-Horizon Robot TasksProceedings of the ACM on Programming Languages10.1145/36328608:POPL(512-545)Online publication date: 5-Jan-2024
  • (2024)Generating Function Names to Improve Comprehension of Synthesized Programs2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)10.1109/VL/HCC60511.2024.00035(248-259)Online publication date: 2-Sep-2024
  • (2024)LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical EvaluationIEEE Transactions on Software Engineering10.1109/TSE.2024.342897250:9(2254-2268)Online publication date: 22-Jul-2024
  • (2024)Structure and design of multimodal dataset for automatic regex synthesis methods in Roman UrduInternational Journal of Data Science and Analytics10.1007/s41060-024-00612-yOnline publication date: 23-Jul-2024
  • (2024)Automatic regex synthesis methods for english: a comparative analysisKnowledge and Information Systems10.1007/s10115-024-02232-1Online publication date: 3-Oct-2024
  • (2023)FormaT5: Abstention and Examples for Conditional Table Formatting with Natural LanguageProceedings of the VLDB Endowment10.14778/3632093.363211117:3(497-510)Online publication date: 1-Nov-2023
  • (2023)Data Extraction via Semantic Regular Expression SynthesisProceedings of the ACM on Programming Languages10.1145/36228637:OOPSLA2(1848-1877)Online publication date: 16-Oct-2023
  • (2023)FlashFill++: Scaling Programming by Example by Cutting to the ChaseProceedings of the ACM on Programming Languages10.1145/35712267:POPL(952-981)Online publication date: 11-Jan-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media