One of the most important features of regular expressions is the ability to store part of a matched pattern for later reuse. As you may recall, placing parentheses around a regular expression pattern or part of a pattern causes that part of the expression to be stored into a temporary buffer. You can override the capture by using the non-capturing metacharacters ?:, ?=, or ?!.
Each captured submatch is stored as it is encountered from left to right in a regular expressions pattern. The buffer numbers begin at one and continue up to a maximum of 99 captured subexpressions. Each buffer can be accessed using \n where n is one or two decimal digits identifying a specific buffer.
One of the simplest, most useful applications of back references provides the ability to locate the occurrence of two identical, adjacent words in text. Take the following sentence:
Is is the cost of of gasoline going up up?
The above sentence clearly has several duplicated words. It would be nice to devise a way to fix that sentence without looking for duplicates of every single word. The following regular expression uses a single subexpression to do that:
The captured expression, as specified by [a-z]+, includes one or more alphabetic characters. The second part of the regular expression is the reference to the previously captured submatch, that is, the second occurrence of the word just matched by the parenthetical expression. \1 specifies the first submatch. The word boundary metacharacters ensure that only whole words are detected. Otherwise, a phrase such as "is issued" or "this is" would be incorrectly identified by this expression.
The global flag (g) following the regular expression indicates that the expression is applied to as many matches as it can find in the input string. The case insensitivity (i) flag at the end of the expression specifies case insensitivity. The multiline flag specifies that potential matches may occur on either side of a newline character.
Using the above regular expression, the following code can use the submatch information to replace an occurrence of two consecutive identical words in a string of text with a single occurrence of the same word:
var ss = "Is is the cost of of gasoline going up up?.\n"; var re = /\b([a-z]+) \1\b/gim; //Create regular expression pattern. var rv = ss.replace(re,"$1"); //Replace two occurrences with one.
The use of the $1 within the replace method refers to the first saved submatch. If you had more than one submatch, you would refer to them consecutively by using $2, $3, and so on.
Back references can also break down a Universal Resource Indicator (URI) into its component parts. Assume that you want to break down the following URI to the protocol (ftp, http, and so on), the domain address, and the page/path:
The following regular expressions provide that functionality:
The first parenthetical subexpression captures the protocol part of the Web address. That subexpression matches any word that precedes a colon and two forward slashes. The second parenthetical subexpression captures the domain address part of the address. That subexpression matches any sequence of characters that does not include / or : characters. The third parenthetical subexpression captures a port number if one is specified. That subexpression matches zero or more digits following a colon. Finally, the fourth parenthetical subexpression captures the path and/or page information specified by the Web address. That subexpression matches one or more characters other than # or the space character.
Applying the regular expression to the above URI, the submatches contain the following:
- RegExp.$1 contains "http"
- RegExp.$2 contains "msdn.microsoft.com"
- RegExp.$3 contains ":80"
- RegExp.$4 contains "/scripting/default.htm"