research-article

Multi-modal synthesis of regular expressions

Authors:

Isil DilligAuthors Info & Claims

PLDI 2020: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 487 - 502

https://doi.org/10.1145/3385412.3385988

Published: 11 June 2020 Publication History

Abstract

In this paper, we propose a multi-modal synthesis technique for automatically constructing regular expressions (regexes) from a combination of examples and natural language. Using multiple modalities is useful in this context because natural language alone is often highly ambiguous, whereas examples in isolation are often not sufficient for conveying user intent. Our proposed technique first parses the English description into a so-called hierarchical sketch that guides our programming-by-example (PBE) engine. Since the hierarchical sketch captures crucial hints, the PBE engine can leverage this information to both prioritize the search as well as make useful deductions for pruning the search space.

We have implemented the proposed technique in a tool called Regel and evaluate it on over three hundred regexes. Our evaluation shows that Regel achieves 80 % accuracy whereas the NLP-only and PBE-only baselines achieve 43 % and 26 % respectively. We also compare our proposed PBE engine against an adaptation of AlphaRegex, a state-of-the-art regex synthesis tool, and show that our proposed PBE engine is an order of magnitude faster, even if we adapt the search algorithm of AlphaRegex to leverage the sketch. Finally, we conduct a user study involving 20 participants and show that users are twice as likely to successfully come up with the desired regex using Regel compared to without it.

References

[1]

2016. Class: Regexp (Ruby 2.4.0). https://ruby-doc.org/core-2.4.0/Regexp.html. 2019. Pattern (Java Platform SE 8 ). https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html.

[2]

Aws Albarghouthi, Sumit Gulwani, and Zachary Kincaid. 2013. Recursive program synthesis. In International conference on computer aided verification. Springer, 934–950.

[3]

R. Alquezar and A. Sanfeliu. 1994. Incremental Grammatical Inference From Positive And Negative Data Using Unbiased Finite State Automata. In In Proceedings of the ACLâĂŹ02 Workshop on Unsupervised Lexical Acquisition. 291–300.

[4]

Dana Angluin. 1978. On the complexity of minimum inference of regular sets. Information and Control 39, 3 (1978), 337 – 350.

[5]

Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples. Inf. Comput. 75, 2 (1987), 87–106.

Digital Library

[6]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1533–1544.

[7]

James Bornholt, Emina Torlak, Dan Grossman, and Luis Ceze. 2016. Optimizing synthesis with metasketches. In ACM SIGPLAN Notices, Vol. 51. ACM, 775–788.

Digital Library

[8]

Bob Carpenter. 1998. Type-logical Semantics. MIT Press, Cambridge, MA, USA.

[9]

Qiaochu Chen, Xinyu Wang, Xi Ye, Greg Durrett, and Isil Dillig. 2019. Multi-modal Synthesis of Regular Expressions. arXiv: cs.PL/1908.03316

[10]

Yanju Chen, Ruben Martins, and Yu Feng. 2019. Maximal Multi-layer Specification Synthesis. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). ACM, New York, NY, USA, 602–612.

Digital Library

[11]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS’08/ETAPS’08). Springer-Verlag, 337–340.

Digital Library

[12]

Yu Feng, Ruben Martins, Osbert Bastani, and Isil Dillig. 2018. Program Synthesis Using Conflict-driven Learning. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, 420–435.

Digital Library

[13]

Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017. Component-based Synthesis of Table Consolidation and Transformation Tasks from Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). New York, NY, USA, 422–436.

Digital Library

[14]

John K. Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing Data Structure Transformations from Input-output Examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’15). ACM, 229–239.

Digital Library

[15]

Laura Firoiu, Tim Oates, and Paul R. Cohen. 1998. Learning Regular Languages from Positive Evidence. In Proceedings of the Twentieth Annual Conference of the Cognitive Science Society. 350–355.

[16]

E Mark Gold. 1978. Complexity of automaton identification from given data. Information and Control 37, 3 (1978), 302 – 320.

[17]

Sumit Gulwani. 2011. Automating String Processing in Spreadsheets Using Input-output Examples. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’11). ACM, 317–330.

Digital Library

[18]

Sumit Gulwani, Susmit Jha, Ashish Tiwari, and Ramarathnam Venkatesan. 2011. Synthesis of Loop-free Programs. SIGPLAN Not. 46, 6 (June 2011), 62–73.

Digital Library

[19]

Sumit Gulwani and Mark Marron. 2014. NLyze: Interactive Programming by Natural Language for Spreadsheet Data Analysis and Manipulation. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD ’14). ACM, 803–814.

Digital Library

[20]

Tihomir Gvero and Viktor Kuncak. 2015. Synthesizing Java Expressions from Free-form Queries. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015). ACM, 416–432.

Digital Library

[21]

Tihomir Gvero, Viktor Kuncak, Ivan Kuraj, and Ruzica Piskac. 2013. Complete Completion Using Types and Weights. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, USA, 27–38.

Digital Library

[22]

Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He. 2018. Natural Language to Structured Query Generation via Meta-Learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, 732–738.

[23]

Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. 2010. Oracle-guided Component-based Program Synthesis. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1 (ICSE ’10). ACM, New York, NY, USA, 215–224.

Digital Library

[24]

Nate Kushman and Regina Barzilay. 2013. Using Semantic Unification to Generate Regular Expressions from Natural Language. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 826–836.

[25]

Vu Le and Sumit Gulwani. 2014. FlashExtract: A Framework for Data Extraction by Examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’14). ACM, 542–553.

Digital Library

[26]

Mina Lee, Sunbeom So, and Hakjoo Oh. 2016. Synthesizing Regular Expressions from Examples for Introductory Automata Assignments. In Proceedings of the 2016 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE 2016). ACM, 70–80.

Digital Library

[27]

A Solar Lezama. 2008. Program synthesis by sketching. Ph.D. Dissertation.

[28]

Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. 2018. NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). European Language Resource Association. http://aclweb.org/anthology/L18-1491

[29]

Nicholas Locascio, Karthik Narasimhan, Eduardo De Leon, Nate Kushman, and Regina Barzilay. 2016. Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1918–1923.

[30]

Bill Maccartney. 2009. Natural Language Inference. Ph.D. Dissertation. Stanford, CA, USA. Advisor(s) Manning, Christopher D. AAI3364139.

[31]

Mehdi Manshadi, Daniel Gildea, and James Allen. 2013. Integrating programming by example and natural language programming. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. AAAI Press, 661–667.

Digital Library

[32]

Anders Møller. 2017. dk.brics.automaton – Finite-State Automata and Regular Expressions for Java. http://www.brics.dk/automaton/.

[33]

Arvind Neelakantan, Quoc V. Le, Martín Abadi, Andrew McCallum, and Dario Amodei. 2016. Learning a Natural Language Interface with Neural Programmer. CoRR abs/1611.08945 (2016). arXiv: 1611.08945 http://arxiv.org/abs/1611.08945

[34]

Maxwell I. Nye, Luke B. Hewitt, Joshua B. Tenenbaum, and Armando Solar-Lezama. 2019. Learning to Infer Program Sketches. CoRR abs/1902.06349 (2019). arXiv: 1902.06349 http://arxiv.org/abs/1902.

[35]

06349 PLDI ’20, June 15–20, 2020, London, UK Qiaochu Chen, Xinyu Wang, Xi Ye, Greg Durrett, and Isil Dillig

[36]

Peter-Michael Osera and Steve Zdancewic. 2015. Type-and-exampledirected program synthesis. In ACM SIGPLAN Notices, Vol. 50. ACM, 619–630.

[37]

Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D’Antoni. 2019. Automatic Repair of Regular Expressions. Proc. ACM Program. Lang. 3, OOPSLA, Article 139 (Oct. 2019), 29 pages.

Digital Library

[38]

Rajesh Parekh and Vasant Honavar. 1996. An incremental interactive algorithm for regular grammar inference. In Grammatical Interference: Learning Syntax from Sentences, Laurent Miclet and Colin de la Higuera (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 238–249.

[39]

Rajesh Parekh and Vasant Honavar. 2001. Learning DFA from Simple Examples. Machine Learning 44, 1 (01 Jul 2001), 9–35. 10.1023/A:1010822518073

Digital Library

[40]

Chris Quirk, Raymond Mooney, and Michel Galley. 2015. Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 878–888.

[41]

Mohammad Raza, Sumit Gulwani, and Natasa Milic-Frayling. 2015. Compositional Program Synthesis from Natural Language and Examples. In IJCAI.

[42]

R. L. Rivest and R. E. Schapire. 1989. Inference of Finite Automata Using Homing Sequences. In Proceedings of the Twenty-first Annual ACM Symposium on Theory of Computing (STOC ’89). ACM, 411–420.

[43]

Ashish Tiwari, Adrià Gascón, and Bruno Dutertre. 2015. Program Synthesis Using Dual Interpretation. In Automated Deduction - CADE- 25, Amy P. Felty and Aart Middeldorp (Eds.). Springer International Publishing, 482–497.

[44]

Xinyu Wang, Sumit Gulwani, and Rishabh Singh. 2016. FIDEX: Filtering Spreadsheet Data Using Examples. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2016). ACM, 195–213.

Digital Library

[45]

Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing transformations on hierarchically structured data. In ACM SIGPLAN Notices, Vol. 51. ACM, 508–521.

Digital Library

[46]

Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. SQLizer: Query Synthesis from Natural Language. Proc. ACM Program. Lang. 1, OOPSLA, Article 63 (Oct. 2017), 26 pages.

Digital Library

[47]

John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2 (AAAI’96). AAAI Press, 1050–1055.

[48]

Luke S. Zettlemoyer and Michael Collins. 2005. Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.

Cited By

Souza MRibeiro SLima VCardoso FGomes R(2024)Combining Regular Expressions and Machine Learning for SQL Injection Detection in Urban ComputingJournal of Internet Services and Applications10.5753/jisa.2024.379915:1(103-111)Online publication date: 2-Jul-2024
https://doi.org/10.5753/jisa.2024.3799
Xia JLiu JBrown NChen YFeng YFilkov VRay BZhou M(2024)Refinement Types for VisualizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695550(1871-1881)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695550
Chida NTerauchi TFilkov VRay BZhou M(2024)Repairing Regex-Dependent String FunctionsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695005(294-305)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695005
Show More Cited By

Index Terms

Multi-modal synthesis of regular expressions
1. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Automatic programming
2. Theory of computation
  1. Formal languages and automata theory
    1. Regular languages

Recommendations

Synthesizing regular expressions from examples for introductory automata assignments
GPCE 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences

We present a method for synthesizing regular expressions for introductory automata assignments. Given a set of positive and negative examples, the method automatically synthesizes the simplest possible regular expression that accepts all the positive ...
Synthesizing regular expressions from examples for introductory automata assignments
GPCE '16

We present a method for synthesizing regular expressions for introductory automata assignments. Given a set of positive and negative examples, the method automatically synthesizes the simplest possible regular expression that accepts all the positive ...
Repairing Regular Expressions for Extraction

While synthesizing and repairing regular expressions (regexes) based on Programming-by-Examples (PBE) methods have seen rapid progress in recent years, all existing works only support synthesizing or repairing regexes for membership testing, and the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI 2020: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2020

1174 pages

ISBN:9781450376136

DOI:10.1145/3385412

General Chair:
Alastair F. Donaldson
Imperial College London, UK
,
Program Chair:
Emina Torlak
University of Washington, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

PLDI '20

Sponsor:

SIGPLAN

PLDI '20: 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation

June 15 - 20, 2020

London, UK

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
606
Total Downloads

Downloads (Last 12 months)79
Downloads (Last 6 weeks)8

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Souza MRibeiro SLima VCardoso FGomes R(2024)Combining Regular Expressions and Machine Learning for SQL Injection Detection in Urban ComputingJournal of Internet Services and Applications10.5753/jisa.2024.379915:1(103-111)Online publication date: 2-Jul-2024
https://doi.org/10.5753/jisa.2024.3799
Xia JLiu JBrown NChen YFeng YFilkov VRay BZhou M(2024)Refinement Types for VisualizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695550(1871-1881)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695550
Chida NTerauchi TFilkov VRay BZhou M(2024)Repairing Regex-Dependent String FunctionsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695005(294-305)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695005
Petridis SWedin BWexler JPushkarna MDonsbach AGoyal NCai CTerry M(2024)ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into PrinciplesProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645144(853-868)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645144
Li XZhou XDong RZhang YWang X(2024)Efficient Bottom-Up Synthesis for Programs with Local VariablesProceedings of the ACM on Programming Languages10.1145/36328948:POPL(1540-1568)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632894
Tariq SRana T(2024)Structure and design of multimodal dataset for automatic regex synthesis methods in Roman UrduInternational Journal of Data Science and Analytics10.1007/s41060-024-00612-yOnline publication date: 23-Jul-2024
https://doi.org/10.1007/s41060-024-00612-y
Tariq SRana T(2024)Automatic regex synthesis methods for english: a comparative analysisKnowledge and Information Systems10.1007/s10115-024-02232-1Online publication date: 3-Oct-2024
https://doi.org/10.1007/s10115-024-02232-1
Tang ZYan YLi RDong HChen HGao H(2024)Enhancing Multi-modal Regular Expression Synthesis via Large Language Models and Semantic Manipulations of Sub-expressionsDependable Software Engineering. Theories, Tools, and Applications10.1007/978-981-96-0602-3_7(122-141)Online publication date: 25-Nov-2024
https://doi.org/10.1007/978-981-96-0602-3_7
Miltner AWang ZChaudhuri SDillig I(2024)Relational Synthesis of Recursive Programs via Constraint Annotated Tree AutomataComputer Aided Verification10.1007/978-3-031-65633-0_3(41-63)Online publication date: 26-Jul-2024
https://doi.org/10.1007/978-3-031-65633-0_3
Nkongolo Wa Nkongolo M(2023)News Classification and Categorization with Smart Function Sentiment AnalysisInternational Journal of Intelligent Systems10.1155/2023/17843942023:1Online publication date: 13-Nov-2023
https://doi.org/10.1155/2023/1784394
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents