Anchors in Regular Expressions
Anchors, or atomic zero-width assertions, specify a position in the string where a match must occur. When you use an anchor in your search expression, the regular expression engine does not advance through the string or consume characters; it looks for a match in the specified position only. For example, ^ specifies that the match must start at the beginning of a line or string. Therefore, the regular expression ^http: matches "http:" only when it occurs at the beginning of a line. The following table lists the anchors supported by the regular expressions in the .NET Framework.
Anchor | Description |
|---|---|
^ | The match must occur at the beginning of the string or line. For more information, see Start of String or Line. |
$ | The match must occur at the end of the string or line, or before \n at the end of the string or line. For more information, see End of String or Line. |
\A | The match must occur at the beginning of the string only (no multiline support). For more information, see Start of String Only. |
\Z | The match must occur at the end of the string, or before \n at the end of the string. For more information, see End of String or Before Ending Newline. |
\z | The match must occur at the end of the string only. For more information, see End of String Only. |
\G | The match must start at the position where the previous match ended. For more information, see Contiguous Matches. |
\b | The match must occur on a word boundary. For more information, see Word Boundary. |
\B | The match must not occur on a word boundary. For more information, see Non-Word Boundary. |
The ^ anchor specifies that the following pattern must begin at the first character position of the string. If you use ^ with the RegexOptions::Multiline option (see Regular Expression Options), the match must occur at the beginning of each line.
The following example uses the ^ anchor in a regular expression that extracts information about the years during which some professional baseball teams existed. The example calls two overloads of the Regex::Matches() method:
The call to the Matches(String, String) overload finds only the first substring in the input string that matches the regular expression pattern.
The call to the Matches(String, String, RegexOptions) overload with the options parameter set to RegexOptions::Multiline finds all five substrings.
The regular expression pattern ^((\w+(\s*)){2,}),\s(\w+\s\w+),(\s\d{4}(-(\d{4}|present))*,*)+ is defined as shown in the following table.
Pattern | Description |
|---|---|
^ | Begin the match at the beginning of the input string (or the beginning of the line if the method is called with the RegexOptions::Multiline option). |
((\w+(\s*)){2,} | Match one or more word characters followed either by zero or by one space exactly two times. This is the first capturing group. This expression also defines a second and third capturing group: The second consists of the captured word, and the third consists of the captured spaces. |
,\s | Match a comma followed by a white-space character. |
(\w+\s\w+) | Match one or more word characters followed by a space, followed by one or more word characters. This is the fourth capturing group. |
, | Match a comma. |
\s\d{4} | Match a space followed by four decimal digits. |
(-(\d{4}|present))* | Match zero or one occurrence of a hyphen followed by four decimal digits or the string "present". This is the sixth capturing group. It also includes a seventh capturing group. |
,* | Match zero or one occurrence of a comma. |
(\s\d{4}(-(\d{4}|present))*,*)+ | Match one or more occurrences of the following: a space, four decimal digits, zero or one occurrence of a hyphen followed by four decimal digits or the string "present", and zero or one comma. This is the fifth capturing group. |
The $ anchor specifies that the preceding pattern must occur at the end of the input string, or before \n at the end of the input string.
If you use $ with the RegexOptions::Multiline option, the match can also occur at the end of a line. Note that $ matches \n but does not match \r\n (the combination of carriage return and newline characters, or CR/LF). To match the CR/LF character combination, include \r?$ in the regular expression pattern.
The following example adds the $ anchor to the regular expression pattern used in the example in the Start of String or Line section. When used with the original input string, which includes five lines of text, the Regex::Matches(String, String) method is unable to find a match, because the end of the first line does not match the $ pattern. When the original input string is split into a string array, the Regex::Matches(String, String) method succeeds in matching each of the five lines. When the Regex::Matches(String, String, RegexOptions) method is called with the options parameter set to RegexOptions::Multiline, no matches are found because the regular expression pattern does not account for the carriage return element (\u+000D). However, when the regular expression pattern is modified by replacing $ with \r?$, calling the Regex::Matches(String, String, RegexOptions) method with the options parameter set to RegexOptions::Multiline again finds five matches.
The \A anchor specifies that a match must occur at the beginning of the input string. It is identical to the ^ anchor, except that \A ignores the RegexOptions::Multiline option. Therefore, it can only match the start of the first line in a multiline input string.
The following example is similar to the examples for the ^ and $ anchors. It uses the \A anchor in a regular expression that extracts information about the years during which some professional baseball teams existed. The input string includes five lines. The call to the Regex::Matches(String, String, RegexOptions) method finds only the first substring in the input string that matches the regular expression pattern. As the example shows, the Multiline option has no effect.
The \Z anchor specifies that a match must occur at the end of the input string, or before \n at the end of the input string. It is identical to the $ anchor, except that \Z ignores the RegexOptions::Multiline option. Therefore, in a multiline string, it can only match the end of the last line, or the last line before \n.
Note that \Z matches \n but does not match \r\n (the CR/LF character combination). To match CR/LF, include \r?\Z in the regular expression pattern.
The following example uses the \Z anchor in a regular expression that is similar to the example in the Start of String or Line section, which extracts information about the years during which some professional baseball teams existed. The subexpression \r?\Z in the regular expression ^((\w+(\s*)){2,}),\s(\w+\s\w+),(\s\d{4}(-(\d{4}|present))*,*)+\r?\Z matches the end of a string, and also matches a string that ends with \n or \r\n. As a result, each element in the array matches the regular expression pattern.
The \z anchor specifies that a match must occur at the end of the input string. Like the $ language element, \z ignores the RegexOptions::Multiline option. Unlike the \Z language element, \z does not match a \n character at the end of a string. Therefore, it can only match the last line of the input string.
The following example uses the \z anchor in a regular expression that is otherwise identical to the example in the previous section, which extracts information about the years during which some professional baseball teams existed. The example tries to match each of five elements in a string array with the regular expression pattern ^((\w+(\s*)){2,}),\s(\w+\s\w+),(\s\d{4}(-(\d{4}|present))*,*)+\r?\z. Two of the strings end with carriage return and line feed characters, one ends with a line feed character, and two end with neither a carriage return nor a line feed character. As the output shows, only the strings without a carriage return or line feed character match the pattern.
The \G anchor specifies that a match must occur at the point where the previous match ended. When you use this anchor with the Regex::Matches or Match::NextMatch method, it ensures that all matches are contiguous.
The following example uses a regular expression to extract the names of rodent species from a comma-delimited string.
The regular expression \G(\w+\s?\w*),? is interpreted as shown in the following table.
Pattern | Description |
|---|---|
\G | Begin where the last match ended. |
\w+ | Match one or more word characters. |
\s? | Match zero or one space. |
\w* | Match zero or more word characters. |
(\w+\s?\w*) | Match one or more word characters followed by zero or one space, followed by zero or more word characters. This is the first capturing group. |
,? | Match zero or one occurrence of a literal comma character. |
The \b anchor specifies that the match must occur on a boundary between a word character (the \w language element) and a non-word character (the \W language element). Word characters consist of alphanumeric characters and underscores; a non-word character is any character that is not alphanumeric or an underscore. (For more information, see Character Classes.) The match may also occur on a word boundary at the beginning or end of the string.
The \b anchor is frequently used to ensure that a subexpression matches an entire word instead of just the beginning or end of a word. The regular expression \bare\w*\b in the following example illustrates this usage. It matches any word that begins with the substring "are". The output from the example also illustrates that \b matches both the beginning and the end of the input string.
The regular expression pattern is interpreted as shown in the following table.
Pattern | Description |
|---|---|
\b | Begin the match at a word boundary. |
are | Match the substring "are". |
\w* | Match zero or more word characters. |
\b | End the match at a word boundary. |
The \B anchor specifies that the match must not occur on a word boundary. It is the opposite of the \b anchor.
The following example uses the \B anchor to locate occurrences of the substring "qu" in a word. The regular expression pattern \Bqu\w+ matches a substring that begins with a "qu" that does not start a word and that continues to the end of the word.
The regular expression pattern is interpreted as shown in the following table.
Pattern | Description |
|---|---|
\B | Do not begin the match at a word boundary. |
qu | Match the substring "qu". |
\w+ | Match one or more word characters. |