Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Deducing Matching Strings for Real-World Regular Expressions

  • Conference paper
  • First Online:
Dependable Software Engineering. Theories, Tools, and Applications (SETTA 2023)

Abstract

Real-world regular expressions (regexes for short) have a wide range of applications in software. However, the support for regexes in test generation is insufficient. For example, existing works lack support for some important features such as lookbehind, are not resilient to subtle semantic differences (such as partial/full matching), fall short of Unicode support, leading to loss of test coverage or missed bugs. To address these challenges, in this paper, we propose a novel semantic model for comprehensively modeling the extended features in regexes, with an awareness of different matching semantics (i.e. partial/full matching) and matching precedence (i.e. greedy/lazy matching). To the best of our knowledge, this is the first attempt to consider partial/full matching semantics in modeling and to support lookbehind. Leveraging this model we then develop PowerGen, a tool for deducing matching strings for regexes, which randomly generates matching strings from the input regex effectively. We evaluate PowerGen against nine related state-of-the-art tools. The evaluation results show the high effectiveness and efficiency of PowerGen.

Y. Yan and W. Su—These authors contributed equally.

Zixuan Chen is currently employed at Kuaishou Technology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    To facilitate error identification, we simplify lengthy regexes by isolating the problematic fragment.

  2. 2.

    https://cdn.jsdelivr.net/npm/dataset2023/.

References

  1. Aho, A.V.: Algorithms for finding patterns in strings. In: Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, pp. 255–300. Elsevier and MIT Press (1990)

    Google Scholar 

  2. Arcaini, P., Gargantini, A., Riccobene, E.: MUTREX: a mutation-based generator of fault detecting strings for regular expressions. In: ICST Workshops 2017, pp. 87–96 (2017)

    Google Scholar 

  3. Bartoli, A., Lorenzo, A.D., Medvet, E., Tarlao, F.: Inference of regular expressions for text extraction from examples. IEEE Trans. Knowl. Data Eng. 28(5), 1217–1230 (2016)

    Article  Google Scholar 

  4. Berglund, M., Bester, W., van der Merwe, B.: Formalising boost POSIX regular expression matching. In: Fischer, B., Uustalu, T. (eds.) ICTAC 2018. LNCS, vol. 11187, pp. 99–115. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02508-3_6

    Chapter  Google Scholar 

  5. Berglund, M., van der Merwe, B.: Re-examining regular expressions with backreferences. Theor. Comput. Sci. 940, 66–80 (2023)

    Article  MathSciNet  Google Scholar 

  6. Berglund, M., van der Merwe, B., van Litsenborgh, S.: Regular expressions with lookahead. J. Univers. Comput. Sci. 27(4), 324–340 (2021)

    Article  Google Scholar 

  7. Brzozowski, J.A.: Derivatives of regular expressions. J. ACM 11(4), 481–494 (1964)

    Article  MathSciNet  Google Scholar 

  8. Câmpeanu, C., Salomaa, K., Yu, S.: A formal study of practical regular expressions. Int. J. Found. Comput. Sci. 14(6), 1007–1018 (2003)

    Article  MathSciNet  Google Scholar 

  9. Câmpeanu, C., Santean, N.: On the intersection of regex languages with regular languages. Theor. Comput. Sci. 410(24–25), 2336–2344 (2009)

    Article  MathSciNet  Google Scholar 

  10. Câmpeanu, C., Yu, S.: Pattern expressions and pattern automata. Inf. Process. Lett. 92(6), 267–274 (2004)

    Article  MathSciNet  Google Scholar 

  11. Caron, P., Champarnaud, J.-M., Mignot, L.: Partial derivatives of an extended regular expression. In: Dediu, A.-H., Inenaga, S., Martín-Vide, C. (eds.) LATA 2011. LNCS, vol. 6638, pp. 179–191. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21254-3_13

    Chapter  Google Scholar 

  12. Chapman, C., Stolee, K.T.: Exploring regular expression usage and context in python. In: ISSTA 2016, pp. 282–293 (2016)

    Google Scholar 

  13. Chapman, C., Wang, P., Stolee, K.T.: Exploring regular expression comprehension. In: ASE 2017, pp. 405–416 (2017)

    Google Scholar 

  14. Chen, T., Flores-Lamas, A., Hague, M., Han, Z., Hu, D., Kan, S., Lin, A.W., Rümmer, P., Wu, Z.: Solving string constraints with regex-dependent functions through transducers with priorities and variables. POPL 6, 1–31 (2022)

    Google Scholar 

  15. Chida, N., Terauchi, T.: On lookaheads in regular expressions with backreferences. In: FSCD 2022. LIPIcs, vol. 228, pp. 15:1–15:18 (2022)

    Google Scholar 

  16. Chris, K.: Regex posix - HaskellWiki. https://wiki.haskell.org/Regex_Posix

  17. D’Antoni, L., Veanes, M.: Automata modulo theories. Commun. ACM 64, 86–95 (2021)

    Article  Google Scholar 

  18. Davis, J.C., Coghlan, C.A., Servant, F., Lee, D.: The impact of regular expression denial of service (ReDoS) in practice: an empirical study at the ecosystem scale. In: ESEC/FSE 2018, pp. 246–256 (2018)

    Google Scholar 

  19. Davis, J.C., IV, L.G.M., Coghlan, C.A., Servant, F., Lee, D.: Why aren’t regular expressions a lingua franca? An empirical study on the re-use and portability of regular expressions. In: ESEC/FSE 2019, pp. 443–454 (2019)

    Google Scholar 

  20. ECMA: ES2018. https://262.ecma-international.org/9.0

  21. Ellul, K., Krawetz, B., Shallit, J.O., Wang, M.W.: Regular expressions: new results and open problems. J. Autom. Lang. Comb. 10(4), 407–437 (2005)

    MathSciNet  Google Scholar 

  22. Fent: Randexp.js. https://github.com/fent/randexp.js

  23. Glushkov, V.M.: The abstract theory of automata. Russ. Math. Surv. 16, 1–53 (1961)

    Article  MathSciNet  Google Scholar 

  24. Hooimeijer, P., Veanes, M.: An evaluation of automata algorithms for string analysis. In: Jhala, R., Schmidt, D. (eds.) VMCAI 2011. LNCS, vol. 6538, pp. 248–262. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18275-4_18

    Chapter  Google Scholar 

  25. Larson, E., Kirk, A.: Generating evil test strings for regular expressions. In: ICST 2016, pp. 309–319 (2016)

    Google Scholar 

  26. Li, N., Xie, T., Tillmann, N., de Halleux, J., Schulte, W.: Reggae: automated test generation for programs using complex regular expressions. In: ASE 2009, pp. 515–519 (2009)

    Google Scholar 

  27. Liu, X., Jiang, Y., Wu, D.: A lightweight framework for regular expression verification. In: HASE 2019, pp. 1–8 (2019)

    Google Scholar 

  28. Loring, B., Mitchell, D., Kinder, J.: ExpoSE: practical symbolic execution of standalone JavaScript. In: SPIN 2017, pp. 196–199 (2017)

    Google Scholar 

  29. Loring, B., Mitchell, D., Kinder, J.: Sound regular expression semantics for dynamic symbolic execution of javascript. In: PLDI 2019, pp. 425–438 (2019)

    Google Scholar 

  30. Luo, B., Feng, Y., Wang, Z., Huang, S., Yan, R., Zhao, D.: Marrying up regular expressions with neural networks: A case study for spoken language understanding. In: ACL 2018, pp. 2083–2093 (2018)

    Google Scholar 

  31. Michael, L.G., Donohue, J., Davis, J.C., Lee, D., Servant, F.: Regexes are hard: decision-making, difficulties, and risks in programming regular expressions. In: ASE 2019, pp. 415–426 (2019)

    Google Scholar 

  32. Miller, F.P., Vandome, A.F., McBrewster, J.: Apache maven (2010). https://repo1.maven.org/maven2/

  33. Miyazaki, T., Minamide, Y.: Derivatives of regular expressions with lookahead. J. Inf. Process. 27, 422–430 (2019)

    Google Scholar 

  34. de Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78800-3_24

    Chapter  Google Scholar 

  35. Møller, A.: dk.brics.automaton. https://www.brics.dk/automaton/

  36. npm Inc: npm. https://www.npmjs.com/

  37. O’Connor, C.: Crdoconnor/xeger. https://github.com/crdoconnor/xeger

  38. Okui, S., Suzuki, T.: Disambiguation in regular expression matching via position automata with augmented transitions. In: Domaratzki, M., Salomaa, K. (eds.) CIAA 2010. LNCS, vol. 6482, pp. 231–240. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18098-9_25

    Chapter  Google Scholar 

  39. Python Software Foundation: Python package index - pypi. https://pypi.org/

  40. Rampersad, N., Shallit, J.: Detecting patterns in finite regular and context-free languages. Inf. Process. Lett. 110(3), 108–112 (2010)

    Article  MathSciNet  Google Scholar 

  41. Salomaa, K., Yu, S.: NFA to DFA transformation for finite languages over arbitrary alphabets. J. Autom. Lang. Comb. 2(3), 177–186 (1998)

    MathSciNet  Google Scholar 

  42. Saxena, P., Akhawe, D., Hanna, S., Mao, F., McCamant, S., Song, D.: A symbolic execution framework for JavaScript. In: S &P 2010, pp. 513–528 (2010)

    Google Scholar 

  43. Shen, Y., Jiang, Y., Xu, C., Yu, P., Ma, X., Lu, J.: ReScue: crafting regular expression DoS attacks. In: ASE 2018, pp. 225–235 (2018)

    Google Scholar 

  44. Spishak, E., Dietl, W., Ernst, M.D.: A type system for regular expressions. In: FTfJP 2012, pp. 20–26 (2012)

    Google Scholar 

  45. Stanford, C., Veanes, M., Bjørner, N.: Symbolic Boolean derivatives for efficiently solving extended regular expression constraints. In: PLDI 2021, pp. 620–635 (2021)

    Google Scholar 

  46. Stockmeyer, L.J.: The complexity of decision problems in automata theory and logic. Ph.D. thesis, Massachusetts Institute of Technology, USA (1974)

    Google Scholar 

  47. Su, W., Chen, H., Li, R., Chen, Z.: Modeling regex operators for solving regex crossword puzzles. In: Hermanns, H., et al. (eds.) SETTA 2023, LNCS, vol. 14464, pp. 206–225. Springer, Cham (2023)

    Google Scholar 

  48. Sulzmann, M., Lu, K.Z.M.: POSIX regular expression parsing with derivatives. In: Codish, M., Sumii, E. (eds.) FLOPS 2014. LNCS, vol. 8475, pp. 203–220. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07151-0_13

    Chapter  Google Scholar 

  49. Tauber, A.: EXREX. https://github.com/asciimoo/exrex

  50. Trinh, M., Chu, D., Jaffar, J.: S3: a symbolic string solver for vulnerability detection in web applications. In: CCS 2014, pp. 1232–1243 (2014)

    Google Scholar 

  51. Unicode: Unicode 15.0.0. https://unicode.org/versions/Unicode15.0.0/

  52. Veanes, M., de Halleux, P., Tillmann, N.: Rex: symbolic regular expression explorer. In: ICST 2010, pp. 498–507 (2010)

    Google Scholar 

  53. Wang, P., Stolee, K.T.: How well are regular expressions tested in the wild? In: ESEC/FSE 2018, pp. 668–678 (2018)

    Google Scholar 

  54. Youssef, M.: Generex. https://github.com/mifmif/Generex

  55. Yu, S.: Regular languages. In: Handbook of Formal Languages, Vol. 1: Word, Language, Grammar, pp. 41–110 (1997)

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their helpful comments and suggestions. Work supported by the Natural Science Foundation of Beijing, China (Grant No. 4232038) and the National Natural Science Foundation of China (Grant No. 62372439).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiming Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yan, Y. et al. (2024). Deducing Matching Strings for Real-World Regular Expressions. In: Hermanns, H., Sun, J., Bu, L. (eds) Dependable Software Engineering. Theories, Tools, and Applications. SETTA 2023. Lecture Notes in Computer Science, vol 14464. Springer, Singapore. https://doi.org/10.1007/978-981-99-8664-4_19

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8664-4_19

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8663-7

  • Online ISBN: 978-981-99-8664-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics