Regexe
Regexe
Regexe
| HOME
Python supports Regex via module re. Python also uses backslash (\) for escape sequences (i.e., you need to write \\ for \, \\d for \d), but it supports raw string in the form of r'...', which ignore the interpretation of escape sequences - great for writing
regex.
# Try substitute with count: re.subn(regexStr, replacementStr, inStr) -> (outStr, count)
>>> re.subn(r'[0-9]+', r'*', 'abc00123xyz456_0')
('abc*xyz*_*', 3) # Return a tuple of output string and count
1 import java.util.regex.Pattern;
2 import java.util.regex.Matcher;
3
4 public class TestRegexNumbers {
5 public static void main(String[] args) {
6
7 String inputStr = "abc00123xyz456_0"; // Input String for matching
8 String regexStr = "[0-9]+"; // Regex to be matched
9
10 // Step 1: Compile a regex via static method Pattern.compile(), default is case-sensitive
11 Pattern pattern = Pattern.compile(regexStr);
12 // Pattern.compile(regex, Pattern.CASE_INSENSITIVE); // for case-insensitive matching
13
14 // Step 2: Allocate a matching engine from the compiled regex pattern,
15 // and bind to the input string
16 Matcher matcher = pattern.matcher(inputStr);
17
18 // Step 3: Perform matching and Process the matching results
19 // Try Matcher.find(), which finds the next match
20 while (matcher.find()) {
21 System.out.println("find() found substring \"" + matcher.group()
22 + "\" starting at index " + matcher.start()
23 + " and ending at index " + matcher.end());
24 }
25
26 // Try Matcher.matches(), which tries to match the ENTIRE input (^...$)
27 if (matcher.matches()) {
28 System.out.println("matches() found substring \"" + matcher.group()
29 + "\" starting at index " + matcher.start()
30 + " and ending at index " + matcher.end());
31 } else {
32 System.out.println("matches() found nothing");
33 }
34
35 // Try Matcher.lookingAt(), which tries to match from the START of the input (^...)
36 if (matcher.lookingAt()) {
37 System.out.println("lookingAt() found substring \"" + matcher.group()
38 + "\" starting at index " + matcher.start()
39 + " and ending at index " + matcher.end());
40 } else {
41 System.out.println("lookingAt() found nothing");
42 }
43
44 // Try Matcher.replaceFirst(), which replaces the first match
45 String replacementStr = "**";
46 String outputStr = matcher.replaceFirst(replacementStr); // first match only
47 System.out.println(outputStr);
48
49 // Try Matcher.replaceAll(), which replaces all matches
50 replacementStr = "++";
51 outputStr = matcher.replaceAll(replacementStr); // all matches
52 System.out.println(outputStr);
53 }
54 }
Perl makes extensive use of regular expressions with many built-in syntaxes and operators. In Perl (and JavaScript), a regex is delimited by a pair of forward slashes (default), in the form of /regex/. You can use built-in operators:
m/regex/modifier or /regex/modifier: Match against the regex. m is optional.
s/regex/replacement/modifier: Substitute matched substring(s) by the replacement.
In Perl, you can use single-quoted non-interpolating string '....' to write regex to disable interpretation of backslash (\) by Perl.
1 #!/usr/bin/env perl
2 use strict;
3 use warnings;
4
5 my $inStr = 'abc00123xyz456_0'; # input string
6 my $regex = '[0-9]+'; # regex pattern string in non-interpolating string
7
8 # Try match /regex/modifiers (or m/regex/modifiers)
9 my @matches = ($inStr =~ /$regex/g); # Match $inStr with regex with global modifier
10 # Store all matches in an array
11 print "@matches\n"; # Output: 00123 456 0
12
13 while ($inStr =~ /$regex/g) {
14 # The built-in array variables @- and @+ keep the start and end positions
15 # of the matches, where $-[0] and $+[0] is the full match, and
16 # $-[n] and $+[n] for back references $1, $2, etc.
17 print substr($inStr, $-[0], $+[0] - $-[0]), ', '; # Output: 00123, 456, 0,
18 }
19 print "\n";
20
21 # Try substitute s/regex/replacement/modifiers
22 $inStr =~ s/$regex/**/g; # with global modifier
23 print "$inStr\n"; # Output: abc**xyz**_**
In JavaScript (and Perl), a regex is delimited by a pair of forward slashes, in the form of /.../. There are two sets of methods, issue via a RegEx object or a String object.
1 <!DOCTYPE html>
2 <!-- JSRegexNumbers.html -->
3 <html lang="en">
4 <head>
5 <meta charset="utf-8">
6 <title>JavaScript Example: Regex</title>
7 <script>
8 var inStr = "abc123xyz456_7_00";
9
10 // Use RegExp.test(inStr) to check if inStr contains the pattern
11 console.log(/[0-9]+/.test(inStr)); // true
12
13 // Use String.search(regex) to check if the string contains the pattern
14 // Returns the start position of the matched substring or -1 if there is no match
15 console.log(inStr.search(/[0-9]+/)); // 3
16
17 // Use String.match() or RegExp.exec() to find the matched substring,
18 // back references, and string index
19 console.log(inStr.match(/[0-9]+/)); // ["123", input:"abc123xyz456_7_00", index:3, length:"1"]
20 console.log(/[0-9]+/.exec(inStr)); // ["123", input:"abc123xyz456_7_00", index:3, length:"1"]
21
22 // With g (global) option
23 console.log(inStr.match(/[0-9]+/g)); // ["123", "456", "7", "00", length:4]
24
25 // RegExp.exec() with g flag can be issued repeatedly.
26 // Search resumes after the last-found position (maintained in property RegExp.lastIndex).
27 var pattern = /[0-9]+/g;
28 var result;
29 while (result = pattern.exec(inStr)) {
30 console.log(result);
31 console.log(pattern.lastIndex);
32 // ["123"], 6
33 // ["456"], 12
34 // ["7"], 14
35 // ["00"], 17
36 }
37
38 // String.replace(regex, replacement):
39 console.log(inStr.replace(/\d+/, "**")); // abc**xyz456_7_00
40 console.log(inStr.replace(/\d+/g, "**")); // abc**xyz**_**_**
41 </script>
42 </head>
43 <body>
44 <h1>Hello,</h1>
45 </body>
46 </html>
Exercise: Interpret this regex, which provide another representation of email address: ^[\w\-\.\+]+\@[a-zA-Z0-9\.\-]+\.[a-zA-z0-9]{2,4}$.
$ python3
>>> re.findall(r'^(\S+)\s+(\S+)$', 'apple orange')
[('apple', 'orange')] # A list of tuples if the pattern has more than one back references
# Back references are kept in \1, \2, \3, etc.
>>> re.sub(r'^(\S+)\s+(\S+)$', r'\2 \1', 'apple orange') # Prefix r for raw string which ignores escape
'orange apple'
>>> re.sub(r'^(\S+)\s+(\S+)$', '\\2 \\1', 'apple orange') # Need to use \\ for \ for regular string
'orange apple'
1 import java.util.regex.Pattern;
2 import java.util.regex.Matcher;
3
4 public class TestRegexSwapWords {
5 public static void main(String[] args) {
6 String inputStr = "apple orange";
7 String regexStr = "^(\\S+)\\s+(\\S+)$"; // Regex pattern to be matched
8 String replacementStr = "$2 $1"; // Replacement pattern with back references
9
10 // Step 1: Allocate a Pattern object to compile a regex
11 Pattern pattern = Pattern.compile(regexStr);
12
13 // Step 2: Allocate a Matcher object from the Pattern, and provide the input
14 Matcher matcher = pattern.matcher(inputStr);
15
16 // Step 3: Perform the matching and process the matching result
17 String outputStr = matcher.replaceFirst(replacementStr); // first match only
18 System.out.println(outputStr); // Output: orange apple
19 }
20 }
A regex consists of a sequence of characters, metacharacters (such as ., \d, \D, \s, \S, \w, \W) and operators (such as +, *, ?, |, ^). They are constructed by combining many smaller sub-expressions.
Non-alphanumeric characters without special meaning in regex also matches itself. For example, = matches "="; @ matches "@".
Escape Sequences
The characters listed above have special meanings in regex. To match these characters, we need to prepend it with a backslash (\), known as escape sequence. For examples, \+ matches "+"; \[ matches "["; and \. matches ".".
Regex also recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for a up to 3-digit octal number, \xhh for a two-digit hex code, \uhhhh for a 4-digit Unicode, \uhhhhhhhh for a 8-digit Unicode.
Sub-Expressions
A regex is constructed by combining many smaller sub-expressions or atoms. For example, the regex Friday matches the string "Friday". The matching, by default, is case-sensitive, but can be set to case-insensitive via modifier.
Instead of listing all characters, you could use a range expression inside the bracket. A range expression consists of two characters separated by a hyphen (-). It matches any single character that sorts between the two characters, inclusive. For example, [a-
d] is the same as [abcd]. You could include a caret (^) in front of the range to invert the matching. For example, [^a-d] is equivalent to [^abcd].
Most of the special regex characters lose their meaning inside bracket list, and can be used as they are; except ^, -, ] or \.
To include a ], place it first in the list, or use escape \].
To include a ^, place it anywhere but first, or use escape \^.
To include a - place it last, or use escape \-.
To include a \, use escape \\.
No escape needed for the other characters such as ., +, *, ?, (, ), {, }, and etc, inside the bracket list
You can also include metacharacters (to be explained in the next section), such as \w, \W, \d, \D, \s, \S inside the bracket list.
For example, [[:alnum:]] means [0-9A-Za-z]. (Note that the square brackets in these class names are part of the symbolic names, and must be included in addition to the square brackets delimiting the bracket list.)
Examples:
Take note that in many programming languages (C, Java, Python), backslash (\) is also used for escape sequences in string, e.g., "\n" for newline, "\t" for tab, and you also need to write "\\" for \. Consequently, to write regex pattern \\ (which matches
one \) in these languages, you need to write "\\\\" (two levels of escape!!!). Similarly, you need to write "\\d" for regex metacharacter \d. This is cumbersome and error-prone!!!
For example: The regex xy{2,4} accepts "xyy", "xyyy" and "xyyyy".
2.9 Modifiers
You can apply modifiers to a regex to tailor its behavior, such as global, case-insensitive, multiline, etc. The ways to apply modifiers differ among languages.
In Perl, you can attach modifiers after a regex, in the form of /.../modifiers. For examples:
In Java, you apply modifiers when compiling the regex Pattern. For example,
Laz y Quantifiers *?, +?, ??, {m,n}?, {m,}? , : You can put an extra ? after the repetition operators to curb its greediness (i.e., stop at the shortest match). For example,
Backtracking : If a regex reaches a state where a match cannot be completed, it backtracks by unwinding one character from the greedy match. For example, if the regex z*zzz is matched against the string "zzzz", the z* first matches "zzzz"; unwinds to
match "zzz"; unwinds to match "zz"; and finally unwinds to match "z", such that the rest of the patterns can find a match.
Possessive Quantifiers *+, ++, ?+, {m,n}+, {m,}+ : You can put an extra + to the repetition operators to disable backtracking, even it may result in match failure. e.g, z++z will not match "zzzz". This feature might not be supported in some
languages.
\b and \B: The \b matches the boundary of a word (i.e., start-of-word or end-of-word); and \B matches inverse of \b, or non-word-boundary. For examples,
\< and \>: The \< and \> match the start-of-word and end-of-word, respectively (compared with \b, which can match both the start and end of a word).
\A and \Z: The \A matches the start of the input. The \Z matches the end of the input.
They are different from ^ and $ when it comes to matching input with multiple lines. ^ matches at the start of the string and after each line break, while \A only matches at the start of the string. $ matches at the end of the string and before each line break,
while \Z only matches at the end of the string. For examples,
$ python3
# Using ^ and $ in multiline mode
>>> p1 = re.compile(r'^.+$', re.MULTILINE) # . for any character except newline
>>> p1.findall('testing\ntesting')
['testing', 'testing']
>>> p1.findall('testing\ntesting\n')
['testing', 'testing']
# ^ matches start-of-input or after each line break at start-of-line
# $ matches end-of-input or before line break at end-of-line
# newlines are NOT included in the matches
2.12 Capturing Matches via Parenthesized Back-References & Matched Variables $1, $2, ...
Parentheses ( ) serve two purposes in regex:
1. Firstly, parentheses ( ) can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example, (abc)+ (accepts abc, abcabc, abcabcabc, ...) is different from abc+ (accepts abc, abcc, abccc, ...).
2. Secondly, parentheses are used to provide the so called back-references. A back-reference contains the matched substring. For examples, the regex (\S+) creates one back-reference (\S+), which contains the first word (consecutive non-spaces) of the
input string; the regex (\S+)\s+(\S+) creates two back-references: (\S+) and another (\S+), containing the first two words, separated by one or more spaces \s+.
The back-references are stored in special variables $1, $2, … (or \1, \2, ... in Python), where $1 contains the substring matched the first pair of parentheses, and so on. For example, (\S+)\s+(\S+) creates two back-references which matched with the first
two words. The matched words are stored in $1 and $2 (or \1 and \2), respectively.
Back-references are important to manipulate the string. For example, the following Perl expression swap the first and second words separate by a space:
s/(\S+) (\S+)/$2 $1/; # Swap the first and second words separated by a single space
The (?=pattern) is known as positive lookahead. It performs the match, but does not capture the match, returning only the result: match or no match. It is also called assertion as it does not consume any characters in matching. For example, the following
complex regex is used to match email addresses by AngularJS:
^(?=.{1,254}$)(?=.{1,64}@)[-!#$%&'*+/0-9=?A-Z^_`a-z{|}~]+(\.[-!#$%&'*+/0-9=?A-Z^_`a-z{|}~]+)*@[A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?(\.[A-Za-z0-9]([A-Za-z0-9-]{0,61}[A-Za-z0-9])?)*$
The first positive lookahead patterns ^(?=.{1,254}$) sets the maximum length to 254 characters. The second positive lookahead ^(?=.{1,64}@) sets maximum of 64 characters before the '@' sign for the username.
Inverse of (?=pattern). Match if pattern is missing. For example, a(?=b) matches 'a' in 'abc' (not consuming 'b'); but not 'acc'. Whereas a(?!b) matches 'a' in 'acc', but not abc.
[TODO]
[TODO]
Recall that you can use Parenthesized Back-References to capture the matches. To disable capturing, use ?: inside the parentheses in the form of (?:pattern). In other words, ?: disables the creation of a capturing group, so as not to create an
unnecessary capturing group.
Example: [TODO]
Conditional (?(Cond)then|else)
[TODO]
2.14 Unicode
The metacharacters \w, \W, (word and non-word character), \b, \B (word and non-word boundary) recongize Unicode characters.
[TODO]
PHP : [Link]
C/C++ : [Link]
Feedback, comments, corrections, and errata can be sent to Chua Hock-Chuan (ehchua@ntu.edu.sg) | HOME