Regex Operator
The below guide summarizes the different syntaxes that are typically used in regex.
- . (Dot): This matches any character except a newline.
- ^ : Matches the start of the string, and in MULTILINE mode also matches immediately after each new line.
- $ : Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. E.g. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in ‘foo1\nfoo2\n’ matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in ‘foo\n’ will find two (empty) matches: one just before the newline, and one at the end of the string.
- * : Causes the resulting regular expression to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
- + : Causes the resulting regular expression to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
- ? : Causes the resulting regular expression to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
- *?, +?, ?? : The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the regular expression < .* > is matched against < a > b < c >, it will match the entire string and not just < a >. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the regular expression < .* ? > will match only < a >.
- {m} : Specifies that exactly m copies of the previous regular expression should be matched; fewer matches cause the entire regular expression not to match. For example, a{6} will match exactly six ‘a’ characters, but not five.
- {m,n} : Causes the resulting regular expression to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 ‘a’ characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand ‘a’ characters followed by a b, but not aaab. The comma may not be omitted or the modifier would be confused with the previously described form.
- {m,n}? : Causes the resulting regular expression to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string ‘aaaaaa’, a{3,5} will match 5 ‘a’ characters, while a{3,5}? will only match 3 characters.
- ‘\’ : Either escapes special characters (permitting you to match characters like ‘*’, ‘?’, and so forth), or signals a special sequence; special sequences are discussed below.
- '|' : a|b - Matches either what comes before or after the '|', in this case 'a' or 'b'.
- []: Used to indicate a set of characters. In a set:
- Characters can be listed individually, e.g. [amk] will match ‘a’, ‘m’, or ‘k’.
- Ranges of characters can be indicated by giving two characters and separating them by a ‘-‘, for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all two-digit numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If – is escaped (e.g. [a-z]) or if it’s placed as the first or last character (e.g. [a-]), it will match a literal ‘-‘.
- Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters ‘(‘, ‘+’, ‘*’, or ‘)’.
- Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether LOCALE or UNICODE mode is in force.
- Characters that are not within a range can be matched by complementing the set. If the first character of the set is ‘^’, all the characters that are not in the set will be matched. For example, [^5] will match any character except ‘5’, and [^^] will match any character except ‘^’. ^ has no special meaning if it’s not the first character in the set.
- To match a literal ‘]’ inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[]{}] and [{}] will both match a parenthesis.
Updated about 1 year ago