Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Regular Expression Matching using Bit Vector Automata

Published: 06 April 2023 Publication History

Abstract

Regular expressions (regexes) are ubiquitous in modern software. There is a variety of implementation techniques for regex matching, which can be roughly categorized as (1) relying on backtracking search, or (2) being based on finite-state automata. The implementations that use backtracking are often chosen due to their ability to support advanced pattern-matching constructs. Unfortunately, they are known to suffer from severe performance problems. For some regular expressions, the running time for matching can be exponential in the size of the input text. In order to provide stronger guarantees of matching efficiency, automata-based regex matching is the preferred choice. However, even these regex engines may exhibit severe performance degradation for some patterns. The main reason for this is that regexes used in practice are not exclusively built from the classical regular constructs, i.e., concatenation, nondeterministic choice and Kleene's star. They involve additional constructs that provide succinctness and convenience of expression. The most common such construct is bounded repetition (also called counting), which describes the repetition of the pattern a fixed number of times.
In this paper, we propose a new algorithm for the efficient matching of regular expressions that involve bounded repetition. Our algorithms are based on a new model of automata, which we call nondeterministic bit vector automata (NBVA). This model is chosen to be expressively equivalent to nondeterministic counter automata with bounded counters, a very natural model for expressing patterns with bounded repetition. We show that there is a class of regular expressions with bounded repetition that can be matched in time that is independent from the repetition bounds. Our algorithms are general enough to cover the vast majority of challenging bounded repetitions that arise in practice. We provide an implementation of our approach in a regex engine, which we call BVA-Scan. We compare BVA-Scan against state-of-the-art regex engines on several real datasets.

References

[1]
Alfred V. Aho and Margaret J. Corasick. 1975. Efficient String Matching: An Aid to Bibliographic Search. Commun. ACM, 18, 6 (1975), 333–340. https://doi.org/10.1145/360825.360855
[2]
Valentin Antimirov. 1996. Partial Derivatives of Regular Expressions and Finite Automaton Constructions. Theoretical Computer Science, 155, 2 (1996), 291–319. https://doi.org/10.1016/0304-3975(95)00182-4
[3]
GNU Awk. 2022. GNU Awk. https://www.gnu.org/software/gawk/ Accessed: March 11, 2023.
[4]
Backreferences. 2022. Back Reference in PCRE. https://www.pcre.org/original/doc/html/pcrepattern.html#SEC19 Accessed: March 11, 2023.
[5]
Ricardo Baeza-Yates and Gaston H. Gonnet. 1992. A New Approach to Text Searching. Commun. ACM, 35, 10 (1992), 74–82. https://doi.org/10.1145/135239.135243
[6]
Howard Barringer, Allen Goldberg, Klaus Havelund, and Koushik Sen. 2004. Rule-Based Runtime Verification. In VMCAI 2004 (LNCS, Vol. 2937). Springer, Heidelberg. 44–57. https://doi.org/10.1007/978-3-540-24622-0_5
[7]
Ezio Bartocci, Jyotirmoy Deshmukh, Alexandre Donzé, Georgios Fainekos, Oded Maler, Dejan Ničković, and Sriram Sankaranarayanan. 2018. Specification-Based Monitoring of Cyber-Physical Systems: A Survey on Theory, Tools and Applications. In Lectures on Runtime Verification: Introductory and Advanced Topics, Ezio Bartocci and Yliès Falcone (Eds.) (LNCS, Vol. 10457). Springer, Cham. 135–175. https://doi.org/10.1007/978-3-319-75632-5_5
[8]
Michela Becchi and Patrick Crowley. 2008. Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions. In Proceedings of the 2008 ACM CoNEXT Conference (CoNEXT ’08). ACM, New York, NY, USA. Article 25, 12 pages. https://doi.org/10.1145/1544012.1544037
[9]
Joao Bispo, Ioannis Sourdis, Joao M. P. Cardoso, and Stamatis Vassiliadis. 2006. Regular Expression Matching for Reconfigurable Packet Inspection. In 2006 IEEE International Conference on Field Programmable Technology. IEEE, USA. 119–126. https://doi.org/10.1109/FPT.2006.270302
[10]
Chunkun Bo, Vinh Dang, Elaheh Sadredini, and Kevin Skadron. 2018. Searching for Potential gRNA Off-Target Sites for CRISPR/Cas9 Using Automata Processing Across Different Platforms. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, USA. 737–748. https://doi.org/10.1109/HPCA.2018.00068
[11]
Robert S. Boyer and J. Strother Moore. 1977. A Fast String Searching Algorithm. Commun. ACM, 20, 10 (1977), 762–772. https://doi.org/10.1145/359842.359859
[12]
Benjamin C. Brodie, David E. Taylor, and Ron K. Cytron. 2006. A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA ’06). IEEE Computer Society, USA. 191–202. https://doi.org/10.1109/ISCA.2006.7
[13]
Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions. J. ACM, 11, 4 (1964), 481–494. https://doi.org/10.1145/321239.321249
[14]
Agnishom Chattopadhyay and Konstantinos Mamouras. 2020. A Verified Online Monitor for Metric Temporal Logic with Quantitative Semantics. In RV 2020, Jyotirmoy Deshmukh and Dejan Ničković (Eds.) (LNCS, Vol. 12399). Springer, Cham. 383–403. https://doi.org/10.1007/978-3-030-60508-7_21
[15]
ClamAV. 2023. ClamAV - Open Source Antivirus Engine. Website. https://www.clamav.net/ Accessed: March 11, 2023.
[16]
Beate Commentz-Walter. 1979. A String Matching Algorithm Fast on the Average. In ICALP 1979, Hermann A. Maurer (Ed.) (LNCS, Vol. 71). Springer, Berlin, Heidelberg. 118–132. https://doi.org/10.1007/3-540-09510-1_10
[17]
James C. Davis. 2019. Rethinking Regex Engines to Address ReDoS. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). ACM, New York, NY, USA. 1256–1258. https://doi.org/10.1145/3338906.3342509
[18]
Paul Dlugosch, Dave Brown, Paul Glendenning, Michael Leventhal, and Harold Noyes. 2014. An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing. IEEE Transactions on Parallel and Distributed Systems, 25, 12 (2014), 3088–3098. https://doi.org/10.1109/TPDS.2014.8
[19]
Wouter Gelade, Marc Gyssens, and Wim Martens. 2009. Regular Expressions with Counting: Weak versus Strong Determinism. In MFCS 2009, Rastislav Královič and Damian Niwiński (Eds.) (LNCS, Vol. 5734). Springer, Berlin, Heidelberg. 369–381. https://doi.org/10.1007/978-3-642-03816-7_32
[20]
Victor Mikhaylovich Glushkov. 1961. The Abstract Theory of Automata. Russian Mathematical Surveys, 16, 5 (1961), 1–53. https://doi.org/10.1070/RM1961v016n05ABEH004112
[21]
Jan Goyvaerts. 2021. Runaway Regular Expressions: Catastrophic Backtracking. https://www.regular-expressions.info/catastrophic.html accessed March 11, 2023.
[22]
GNU Grep. 2022. GNU Grep - Global Regular Expression Print. https://www.gnu.org/software/grep/ Accessed: March 11, 2023.
[23]
Philip Hazel and Zoltan Herczeg. 2022. PCRE2 - Perl Compatible Regular Expressions v2. https://www.pcre.org/ Accessed: March 11, 2023.
[24]
Lukáš Holík, Ondřej Lengál, Olli Saarikivi, Lenka Turoňová, Margus Veanes, and Tomáš Vojnar. 2019. Succinct Determinisation of Counting Automata via Sphere Construction. In APLAS 2019, Anthony Widjaja Lin (Ed.) (LNCS, Vol. 11893). Springer, Cham. 468–489. https://doi.org/10.1007/978-3-030-34175-6_24
[25]
Dag Hovland. 2009. Regular Expressions with Numerical Constraints and Automata with Counters. In ICTAC 2009, Martin Leucker and Carroll Morgan (Eds.) (LNCS, Vol. 5684). Springer, Berlin, Heidelberg. 231–245. https://doi.org/10.1007/978-3-642-03466-4_15
[26]
Posix Syntax in PCRE. 2022. Posix Syntax in PCRE. https://www.pcre.org/original/doc/html/pcrepattern.html Accessed: March 11, 2023.
[27]
Richard M. Karp and Michael O. Rabin. 1987. Efficient Randomized Pattern-Matching Algorithms. IBM Journal of Research and Development, 31, 2 (1987), 249–260. https://doi.org/10.1147/rd.312.0249
[28]
Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. 1977. Fast Pattern Matching in Strings. SIAM J. Comput., 6, 2 (1977), 323–350. https://doi.org/10.1137/0206024
[29]
Lingkun Kong, Qixuan Yu, Agnishom Chattopadhyay, Alexis Le Glaunec, Yi Huang, Konstantinos Mamouras, and Kaiyuan Yang. 2022. Software-Hardware Codesign for Efficient In-Memory Regular Pattern Matching. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2022). ACM, New York, NY, USA. 733–748. https://doi.org/10.1145/3519939.3523456
[30]
CsA Automata Library. 2021. CsA Automata Library. https://pajda.fit.vutbr.cz/ituronova/countingautomata
[31]
Konstantinos Mamouras, Agnishom Chattopadhyay, and Zhifu Wang. 2021. Algebraic Quantitative Semantics for Efficient Online Temporal Monitoring. In TACAS 2021, Jan Friso Groote and Kim Guldstrand Larsen (Eds.) (LNCS, Vol. 12651). Springer, Cham. 330–348. https://doi.org/10.1007/978-3-030-72016-2_18
[32]
Konstantinos Mamouras, Agnishom Chattopadhyay, and Zhifu Wang. 2021. A Compositional Framework for Quantitative Online Monitoring over Continuous-Time Signals. In RV 2021, Lu Feng and Dana Fisman (Eds.) (LNCS, Vol. 12974). Springer, Cham. 142–163. https://doi.org/10.1007/978-3-030-88494-9_8
[33]
Konstantinos Mamouras and Zhifu Wang. 2020. Online Signal Monitoring with Bounded Lag. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39, 11 (2020), 3868–3880. https://doi.org/10.1109/TCAD.2020.3013053
[34]
Albert R. Meyer and Michael J. Fischer. 1971. Economy of Description by Automata, Grammars, and Formal Systems. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, USA. 188–191. https://doi.org/10.1109/SWAT.1971.11
[35]
Albert R. Meyer and Larry J. Stockmeyer. 1972. The Equivalence Problem for Regular Expressions with Squaring Requires Exponential Space. In 13th Annual Symposium on Switching and Automata Theory (SWAT 1972). IEEE Computer Society, Los Alamitos, CA, USA. 125–129. https://doi.org/10.1109/SWAT.1972.29
[36]
RE2. 2023. RE2: Google’s regular expression library. Website. https://github.com/google/re2 Accessed: March 11, 2023.
[37]
RegexLib. 2023. Regular Expression Library. https://regexlib.com/ Accessed: March 11, 2023.
[38]
Martin Roesch. 1999. Snort - Lightweight Intrusion Detection for Networks. In Proceedings of the 13th USENIX Conference on System Administration (LISA ’99). USENIX Association, USA. 229–238. https://www.usenix.org/legacy/publications/library/proceedings/lisa99/full_papers/roesch/roesch.pdf
[39]
Indranil Roy and Srinivas Aluru. 2016. Discovering Motifs in Biological Sequences Using the Micron Automata Processor. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13, 1 (2016), 99–111. https://doi.org/10.1109/TCBB.2015.2430313
[40]
Olli Saarikivi, Margus Veanes, Tiki Wan, and Eric Xu. 2019. Symbolic Regex Matcher. In TACAS 2019 (LNCS, Vol. 11427). Springer, Cham. 372–378. https://doi.org/10.1007/978-3-030-17462-0_24
[41]
Christian J. A. Sigrist, Lorenzo Cerutti, Edouard de Castro, Petra S. Langendijk-Genevaux, Virginie Bulliard, Amos Bairoch, and Nicolas Hulo. 2009. PROSITE, A Protein Domain Database for Functional Characterization and Annotation. Nucleic Acids Research, 38, suppl_1 (2009), D161–D166. https://doi.org/10.1093/nar/gkp885
[42]
Randy Smith, Cristian Estan, and Somesh Jha. 2008. XFA: Faster Signature Matching with Extended Automata. In Proceedings of the 2008 IEEE Symposium on Security and Privacy (SP ’08). IEEE Computer Society, USA. 187–201. isbn:9780769531687 https://doi.org/10.1109/SP.2008.14
[43]
Snort. 2023. Snort - Network Intrusion Detection & Prevention System. https://www.snort.org/ Accessed: March 11, 2023.
[44]
Apache SpamAssassin. 2022. Apache SpamAssassin. https://spamassassin.apache.org/ Accessed: March 11, 2023.
[45]
Larry J. Stockmeyer and Albert R. Meyer. 1973. Word Problems Requiring Exponential Time (Preliminary Report). In Proceedings of the Fifth Annual ACM Symposium on Theory of Computing (STOC ’73). ACM, New York, NY, USA. 1–9. https://doi.org/10.1145/800125.804029
[46]
Suricata. 2023. Suricata - Open Source Intrusion Detection and Prevention Engine. https://suricata.io/ Accessed: March 11, 2023.
[47]
Ken Thompson. 1968. Programming Techniques: Regular Expression Search Algorithm. Commun. ACM, 11, 6 (1968), 419–422. https://doi.org/10.1145/363347.363387
[48]
Lenka Turoňová, Lukáš Holík, Ondřej Lengál, Olli Saarikivi, Margus Veanes, and Tomáš Vojnar. 2020. Regex Matching with Counting-Set Automata. Proceedings of the ACM on Programming Languages, 4, OOPSLA (2020), Article 218, 30 pages. https://doi.org/10.1145/3428286
[49]
Xiang Wang, Yang Hong, Harry Chang, KyoungSoo Park, Geoff Langdale, Jiayu Hu, and Heqing Zhu. 2019. Hyperscan: A Fast Multi-Pattern Regex Matcher for Modern CPUs. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’19). USENIX Association, Boston, MA. 631–648. https://www.usenix.org/conference/nsdi19/presentation/wang-xiang
[50]
Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H. Katz. 2006. Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection. In Proceedings of the 2006 ACM/IEEE Symposium on Architecture for Networking and Communications Systems (ANCS ’06). ACM, New York, NY, USA. 93–102. https://doi.org/10.1145/1185347.1185360

Cited By

View all
  • (2024)Enhancing Regular Expression Processing through Field-Programmable Gate Array-Based Multi-Character Non-Deterministic Finite AutomataElectronics10.3390/electronics1309163513:9(1635)Online publication date: 24-Apr-2024
  • (2024)HybridSA: GPU Acceleration of Multi-pattern Regex Matching using Bit ParallelismProceedings of the ACM on Programming Languages10.1145/36897718:OOPSLA2(1699-1728)Online publication date: 8-Oct-2024
  • (2024)Static Analysis for Checking the Disambiguation Robustness of Regular ExpressionsProceedings of the ACM on Programming Languages10.1145/36564618:PLDI(2073-2097)Online publication date: 20-Jun-2024
  • Show More Cited By

Index Terms

  1. Regular Expression Matching using Bit Vector Automata

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Programming Languages
      Proceedings of the ACM on Programming Languages  Volume 7, Issue OOPSLA1
      April 2023
      901 pages
      EISSN:2475-1421
      DOI:10.1145/3554309
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution 4.0 International License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 06 April 2023
      Published in PACMPL Volume 7, Issue OOPSLA1

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      Author Tags

      1. automata theory
      2. bounded repetition
      3. counter automata
      4. regex

      Qualifiers

      • Research-article

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)481
      • Downloads (Last 6 weeks)60
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Enhancing Regular Expression Processing through Field-Programmable Gate Array-Based Multi-Character Non-Deterministic Finite AutomataElectronics10.3390/electronics1309163513:9(1635)Online publication date: 24-Apr-2024
      • (2024)HybridSA: GPU Acceleration of Multi-pattern Regex Matching using Bit ParallelismProceedings of the ACM on Programming Languages10.1145/36897718:OOPSLA2(1699-1728)Online publication date: 8-Oct-2024
      • (2024)Static Analysis for Checking the Disambiguation Robustness of Regular ExpressionsProceedings of the ACM on Programming Languages10.1145/36564618:PLDI(2073-2097)Online publication date: 20-Jun-2024
      • (2024)Linear Matching of JavaScript Regular ExpressionsProceedings of the ACM on Programming Languages10.1145/36564318:PLDI(1336-1360)Online publication date: 20-Jun-2024
      • (2024)Efficient Matching of Regular Expressions with Lookaround AssertionsProceedings of the ACM on Programming Languages10.1145/36329348:POPL(2761-2791)Online publication date: 5-Jan-2024
      • (2024)BVAP: Energy and Memory Efficient Automata Processing for Regular Expressions with Bounded RepetitionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640412(151-166)Online publication date: 27-Apr-2024
      • (2024)One Automaton to Rule Them All: Beyond Multiple Regular Expressions ExecutionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444810(193-206)Online publication date: 2-Mar-2024
      • (2024)Efficient Offline Monitoring for Dynamic Metric Temporal LogicRuntime Verification10.1007/978-3-031-74234-7_8(128-149)Online publication date: 14-Oct-2024
      • (2023)Algorithms for Checking Intersection Non-emptiness of Regular ExpressionsTheoretical Aspects of Computing – ICTAC 202310.1007/978-3-031-47963-2_14(216-235)Online publication date: 4-Dec-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media