Backreferences

Backreferences provide a convenient way to find repeating groups of characters. They can be thought of as a shorthand instruction to match the same string again.

For instance, to find repeating adjacent characters such as the two Ls in the word "tall", you would use the regular expression (?<char>\w)\k<char>, which uses the metacharacter \w to find any single-word character. The grouping construct (?<char> ) encloses the metacharacter to force the regular expression engine to remember a subexpression match (which in this case will be any single character) and save it under the name "char". The backreference construct \k<char> causes the engine to compare the current character to the previously matched character stored under "char". The entire regular expression successfully finds a match wherever a single character is the same as the preceding character.

To find repeating whole words, you can modify the grouping subexpression to search for any group of characters preceded by a space instead of simply searching for any single character. You can substitute the subexpression \w+, which matches any group of characters, for the metacharacter \w and use the metacharacter \s to match a space preceding the character group. This yields the regular expression (?<char>\s\w+)\k<char>, which finds any repeating whole words such as " the the" but also matches other repetitions of the specified string, as in the phrase "the theory."

To verify that the second match is on a word boundary, add the metacharacter \b after the repeat match. The resulting regular expression, (?<char>\s\w+)\k<char>\b, finds only repeating whole words that are preceded by white space.

Parsing Backreferences

The expressions \1 through \9 always refer to backreferences, not octal codes. Multidigit expressions \11 and up are considered backreferences if there is a backreference corresponding to that number; otherwise, they are interpreted as octal codes (unless the starting digits are 8 or 9, in which case they are treated as literal "8" and "9"). If a regular expression contains a backreference to an undefined group number, it is considered a parsing error. If the ambiguity is a problem, you can use the \k<n> notation, which is unambiguous and cannot be confused with octal character codes; similarly, hexadecimal codes such as \xdd are unambiguous and cannot be confused with backreferences.

Backreference behavior is slightly different when the ECMAScript option flag is enabled. For more information, see ECMAScript vs. Canonical Matching Behavior.

Matching Backreferences

A backreference refers to the most recent definition of a group (the definition most immediately to the left, when matching left to right). Specifically, when a group makes multiple captures, a backreference refers to the most recent capture. For example, (?<1>a)(?<1>\1b)* matches aababb, with the capturing pattern (a)(ab)(abb). Looping quantifiers do not clear group definitions.

If a group has not captured any substring, a backreference to that group is undefined and never matches. For example, the expression \1() never matches anything, but the expression ()\1 matches the empty string.

See Also

.NET Framework Regular Expressions