Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Regex matching with counting-set automata

Published: 13 November 2020 Publication History

Abstract

We propose a solution to the problem of efficient matching regular expressions (regexes) with bounded repetition, such as (ab){1,100}, using deterministic automata. For this, we introduce novel counting-set automata (CsAs), automata with registers that can hold sets of bounded integers and can be manipulated by a limited portfolio of constant-time operations. We present an algorithm that compiles a large sub-class of regexes to deterministic CsAs. This includes (1) a novel Antimirov-style translation of regexes with counting to counting automata (CAs), nondeterministic automata with bounded counters, and (2) our main technical contribution, a determinization of CAs that outputs CsAs. The main advantage of this workflow is that the size of the produced CsAs does not depend on the repetition bounds used in the regex (while the size of the DFA is exponential to them). Our experimental results confirm that deterministic CsAs produced from practical regexes with repetition are indeed vastly smaller than the corresponding DFAs. More importantly, our prototype matcher based on CsA simulation handles practical regexes with repetition regardless of sizes of counter bounds. It easily copes with regexes with repetition where state-of-the-art matchers struggle.

Supplementary Material

Auxiliary Presentation Video (oopsla20main-p480-p-video.mp4)
We propose a solution to the problem of efficient matching regular expressions with bounded repetition using deterministic automata. For this, we introduce novel counting-set automata (CsAs), automata with registers that can hold sets of bounded integers and can be manipulated by a limited portfolio of constant-time operations. The main advantage of this workflow is that the size of the produced CsAs does not depend on the repetition bounds used in the regex. Our experimental results confirm that deterministic CsAs produced from practical regexes with repetition are indeed vastly smaller than the corresponding DFAs. More importantly, our prototype matcher based on CsA simulation handles practical regexes with repetition regardless of sizes of counter bounds. It easily copes with regexes with repetition where state-of-the-art matchers struggle.

References

[1]
Parosh Aziz Abdulla, Pavel Krčál, and Wang Yi. 2008. R-Automata. In CONCUR'08 (LNCS, Vol. 5201 ). Springer, 67-81. https://doi.org/10.1007/978-3-540-85361-9_9
[2]
Cyril Allauzen and Mehryar Mohri. 2006. A Uni ed Construction of the Glushkov, Follow, and Antimirov Automata. In Mathematical Foundations of Computer Science 2006. Springer Berlin Heidelberg, Berlin, Heidelberg, 110-121. https: //doi.org/10.1007/11821069_10
[3]
Valentin Antimirov. 1996. Partial derivatives of regular expressions and nite automaton constructions. Theoretical Computer Science 155, 2 ( 1996 ), 291-319. https://doi.org/10.1016/ 0304-3975 ( 95 ) 00182-4
[4]
Adam Baldwin. 2016. Regular Expression Denial of Service a ecting Express.js. http://web.archive.org/web/20170116160113/ https://medium.com/node-security/ regular-expression-denial-of-service-a ecting-express-js-9c397c164c43
[5]
Sébastien Bardin, Alain Finkel, Jérôme Leroux, and Laure Petrucci. 2008. FAST: acceleration from theory to practice. STTT 10, 5 ( 2008 ), 401-424. https://doi.org/10.1007/s10009-008-0064-3
[6]
Gerard Berry and Ravi Sethi. 1986. From regular expressions to deterministic automata. Theoretical Computer Science 48, 3 ( 1986 ), 117-126. https://doi.org/10.1016/ 0304-3975 ( 86 ) 90088-5
[7]
Jean Berstel and Jean-Éric Pin. 1996. Local languages and the Berry-Sethi algorithm. Theoret. Comput. Sci. 155, 2 ( 1996 ), 439-446. https://doi.org/10.1016/ 0304-3975 ( 95 ) 00104-2
[8]
Henrik Björklund, Wim Martens, and Thomas Timm. 2015. E cient Incremental Evaluation of Succinct Regular Expressions. In CIKM' 15 (ACM). https://doi.org/10.1145/2806416.2806434
[9]
Robert S. Boyer and J. Strother Moore. 1977. A Fast String Searching Algorithm. Commun. ACM 20, 10 (Oct. 1977 ), 762-772. https://doi.org/10.1145/359842.359859
[10]
Anne Brüggemann-Klein and Derick Wood. 1998. One-unambiguous regular languages. Information and Computation 140, 2 ( 1998 ), 229-253. https://doi.org/10.1006/inco. 1997.2695
[11]
Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions. J. ACM 11, 4 ( 1964 ), 481-494. https://doi.org/10.1145/ 321239.321249
[12]
Pascal Caron, Jean-Marc Champarnaud, and Ludovic Mignot. 2011. Partial Derivatives of an Extended Regular Expression. In Language and Automata Theory and Applications. Springer Berlin Heidelberg, Berlin, Heidelberg, 179-191. https: //doi.org/10.1007/978-3-642-21254-3_13
[13]
Jean-Marc Champarnaud and Djelloul Ziadi. 2001. Computing the equation automaton of a regular expression in O (s 2) space and time. In Proceedings of CPM 2001 (LNCS, Vol. 2089 ). Springer, 157-168. https://doi.org/10.1007/3-540-48194-X_15
[14]
Haiming Chen and Ping Lu. 2015. Checking determinism of regular expressions with counting. Information and Computation 241 ( 2015 ), 302-320. https://doi.org/10.1016/j.ic. 2014. 12.001
[15]
Kwang-Ting Cheng and A. S. Krishnakumar. 1993. Automatic Functional Test Generation Using the Extended Finite State Machine Model. In Proceedings of the 30th Design Automation Conference. Dallas, Texas, USA, June 14-18, 1993. ACM Press, 86-91. https://doi.org/10.1145/157485.164585
[16]
Wikipedia contributors. 2019. Regular expression-Wikipedia. https://en.wikipedia.org/w/index.php?title=Regular_expression&%20oldid= 852858998
[17]
Russ Cox. 2010. Regular Expression Matching in the Wild. https://swtch.com/~rsc/regexp/regexp3.html.
[18]
Loris D'Antoni and Margus Veanes. 2020. Automata Modulo Theories. Commun. ACM ( 2020 ).
[19]
James C. Davis. 2019. Rethinking Regex Engines to Address ReDoS. In Proceedings of ESEC/FSE'19 (Tallinn, Estonia) (ESEC/FSE 2019). ACM, New York, NY, USA, 1256-1258. https://doi.org/10.1145/3338906.3342509
[20]
James C. Davis, Christy A. Coghlan, Francisco Servant, and Dongyoon Lee. 2018. The Impact of Regular Expression Denial of Service (ReDoS) in Practice: An Empirical Study at the Ecosystem Scale. In Proceedings of ESEC/FSE'18 ( Lake Buena Vista, FL, USA) ( ESEC/FSE 2018). ACM, New York, NY, USA, 246-256. https://doi.org/10.1145/3236024.3236027
[21]
James C. Davis, Louis G. Michael IV, Christy A. Coghlan, Francisco Servant, and Dongyoon Lee. 2019. Why Aren't Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions. In Proceedings of ESEC/FSE'19 (Tallinn, Estonia) (ESEC/FSE 2019). ACM, New York, NY, USA, 1256-1258. https://doi.org/10.1145/3338906. 3338909
[22]
Stack Exchange. 2016. Outage Postmortem. http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016
[23]
Sebastian Fischer, Frank Huch, and Thomas Wilke. 2010. A Play on Regular Expressions: Functional Pearl. SIGPLAN Not. 45, 9 ( 2010 ), 357-368. https://doi.org/10.1145/1863543.1863594
[24]
Wouter Gelade, Marc Gyssens, and Wim Martens. 2012. Regular Expressions with Counting: Weak versus Strong Determinism. SIAM J. Comput. 41, 1 ( 2012 ), 160-190. https://doi.org/10.1137/100814196 Extended version of paper in MFCS' 09.
[25]
Wouter Gelade, Wim Martens, and Frank Neven. 2007. Optimizing schema languages for XML: Numerical constraints and interleaving. In Proceedings of ICDT'07 (LNCS, Vol. 4353 ). Springer, 269-283. https://doi.org/10.1007/11965893_19
[26]
V. M. Glushkov. 1961. The abstract theory of automata. Russian Math. Surveys 16 ( 1961 ), 1-53. https://doi.org/10.1070/ RM1961v016n05ABEH004112
[27]
Google. [n.d.]. RE2. https://github.com/google/re2.
[28]
John Graham-Cumming. 2019. Details of the Cloud are outage on July 2, 2019. https://blog.cloud are. com/details-of-thecloud are-outage-on-july-2-2019/
[29]
Mike Haertel. [n.d.]. why GNU grep is fast. https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html.
[30]
Dag Hovland. 2009. Regular Expressions with Numerical Constraints and Automata with Counters. In ICTAC (LNCS, Vol. 5684 ). Springer, 231-245. https://doi.org/10.1007/978-3-642-03466-4_15
[31]
Dag Hovland. 2012. The Membership Problem for Regular Expressions with Unordered Concatenation and Numerical Constraints. In Language and Automata Theory and Applications. Springer Berlin Heidelberg, Berlin, Heidelberg, 313-324. https://doi.org/10.1007/978-3-642-28332-1_27
[32]
Lucian Ilie and Sheng Yu. 2003. Follow automata. Information and Computation 186, 1 ( 2003 ), 146-162. https://doi.org/10. 1016/S0890-5401 ( 03 ) 00090-7
[33]
Pekka Kilpeläinen and Rauno Tuhkanen. 2003. Regular Expressions with Numerical Occurrence Indicators-preliminary results. In Proceedings of the Eighth Symposium on Programming Languages and Software Tools, SPLST'03, Kuopio, Finland, June 17-18, 2003. University of Kuopio, Department of Computer Science, 163-173.
[34]
Pekka Kilpeläinen and Rauno Tuhkanen. 2007. One-unambiguity of regular expressions with numeric occurrence indicators. Information and Computation 205, 6 ( 2007 ), 890-916. https://doi.org/10.1016/j.ic. 2006. 12.003
[35]
Sylvain Lombardy and Jacques Sakarovitch. 2005. Derivatives of rational expressions with multiplicity. Theoretical Computer Science 332, 1 ( 2005 ), 141-177. https://doi.org/10.1016/j.tcs. 2004. 10.016
[36]
Microsoft. 2020. https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.match
[37]
Scott Owens, John Reppy, and Aaron Turon. 2009. Regular-expression Derivatives Re-examined. J. Funct. Program. 19, 2 ( 2009 ), 173-190. https://doi.org/10.1017/S0956796808007090
[38]
Olli Saarikivi, Margus Veanes, Tiki Wan, and Eric Xu. 2019. Symbolic Regex Matcher. In TACAS'2019 (LNCS, Vol. 11427 ), Tomás Vojnar and Lijun Zhang (Eds.). Springer, 372-378. https://doi.org/10.1007/978-3-030-17462-0_24
[39]
Thomas R. Shiple, James H. Kukula, and Rajeev K. Ranjan. 1998. A Comparison of Presburger Engines for EFSM Reachability. In Computer Aided Veri cation, 10th International Conference, CAV ' 98, Vancouver, BC, Canada, June 28-July 2, 1998, Proceedings (Lecture Notes in Computer Science, Vol. 1427 ). Springer, 280-292. https://doi.org/10.1007/BFb0028752
[40]
Michael Sipser. 2006. Introduction to Theory of Computation. Vol. 2. Thomson Course Technology Boston. https://doi.org/10. 1145/230514.571645
[41]
Randy Smith, Cristian Estan, and Somesh Jha. 2008a. XFA: Faster Signature Matching with Extended Automata. In IEEE Symposium on Security and Privacy. IEEE. https://doi.org/10.1109/SP. 2008.14
[42]
Randy Smith, Cristian Estan, Somesh Jha, and Ida Siahaan. 2008b. Fast Signature Matching Using Extended Finite Automaton (XFA). In ICISS'08 (LNCS, Vol. 5352 ). Springer, 158-172. https://doi.org/10.1007/978-3-540-89862-7_15
[43]
Henry Spencer. 1994. Software Solutions in C. Academic Press Professional, Inc., San Diego, CA, USA, Chapter A Regularexpression Matcher, 35-71. http://dl.acm.org/citation.cfm?id= 156626. 184689
[44]
Michael Sperberg-McQueen. [n.d.]. Notes on nite state automata with counters. https://www.w3.org/XML/ 2004 /05/msmcfa.html Accessed: 2018-08-08.
[45]
Ken Thompson. 1968. Programming Techniques: Regular Expression Search Algorithm. Commun. ACM 11, 6 ( June 1968 ), 419-422. https://doi.org/10.1145/363347.363387
[46]
Margus Veanes, Peli de Halleux, and Nikolai Tillmann. 2010. Rex: Symbolic Regular Expression Explorer. In Third International Conference on Software Testing, Veri cation and Validation, ICST 2010, Paris, France, April 7-9, 2010. 498-507. https: //doi.org/10.1109/ICST. 2010.15
[47]
Liu Yang, Rezwana Karim, Vinod Ganapathy, and Randy Smith. 2010. Improving NFA-Based Signature Matching Using Ordered Binary Decision Diagrams. In Recent Advances in Intrusion Detection. Springer Berlin Heidelberg, Berlin, Heidelberg, 58-78. https://doi.org/10.1007/978-3-642-15512-3_4

Cited By

View all
  • (2024)Linear Matching of JavaScript Regular ExpressionsProceedings of the ACM on Programming Languages10.1145/36564318:PLDI(1336-1360)Online publication date: 20-Jun-2024
  • (2024)One Automaton to Rule Them All: Beyond Multiple Regular Expressions ExecutionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444810(193-206)Online publication date: 2-Mar-2024
  • (2024)Algebraic Reasoning Meets Automata in Solving Linear Integer ArithmeticComputer Aided Verification10.1007/978-3-031-65627-9_3(42-67)Online publication date: 24-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 4, Issue OOPSLA
November 2020
3108 pages
EISSN:2475-1421
DOI:10.1145/3436718
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2020
Published in PACMPL Volume 4, Issue OOPSLA

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Antimirov's derivatives
  2. ReDos
  3. bounded repetition
  4. counting automata
  5. counting-set automata
  6. determinization
  7. regular expression matching

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)335
  • Downloads (Last 6 weeks)34
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Linear Matching of JavaScript Regular ExpressionsProceedings of the ACM on Programming Languages10.1145/36564318:PLDI(1336-1360)Online publication date: 20-Jun-2024
  • (2024)One Automaton to Rule Them All: Beyond Multiple Regular Expressions ExecutionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444810(193-206)Online publication date: 2-Mar-2024
  • (2024)Algebraic Reasoning Meets Automata in Solving Linear Integer ArithmeticComputer Aided Verification10.1007/978-3-031-65627-9_3(42-67)Online publication date: 24-Jul-2024
  • (2023)Derivative Based Nonbacktracking Real-World Regex Matching with Backtracking SemanticsProceedings of the ACM on Programming Languages10.1145/35912627:PLDI(1026-1049)Online publication date: 6-Jun-2023
  • (2023)Regular Expression Matching using Bit Vector AutomataProceedings of the ACM on Programming Languages10.1145/35860447:OOPSLA1(492-521)Online publication date: 6-Apr-2023
  • (2023)Improving Developers’ Understanding of Regex Denial of Service Tools through Anti-Patterns and Fix Strategies2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179442(1238-1255)Online publication date: May-2023
  • (2023)Effective ReDoS Detection by Principled Vulnerability Modeling and Exploit Generation2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179328(2427-2443)Online publication date: May-2023
  • (2023)String Constraints with Regex-Counting and String-Length Solved More EfficientlyDependable Software Engineering. Theories, Tools, and Applications10.1007/978-981-99-8664-4_1(1-20)Online publication date: 27-Nov-2023
  • (2023)Fast Matching of Regular Patterns with Synchronizing CountingFoundations of Software Science and Computation Structures10.1007/978-3-031-30829-1_19(392-412)Online publication date: 21-Apr-2023
  • (2023)Automata with Bounded Repetition in RE2Computer Aided Systems Theory – EUROCAST 202210.1007/978-3-031-25312-6_27(232-239)Online publication date: 10-Feb-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media