Regular Expressions as a Language

Updated: May 2010

The regular expression language is designed and optimized to manipulate text. The language comprises two basic character types: literal (normal) text characters and metacharacters. The set of metacharacters gives regular expressions their processing power.

You are probably familiar with the ? and * metacharacters used with the DOS file system to represent any single character or group of characters. The DOS file command COPY *.DOC A: commands the file system to copy any file with a .DOC file name extension to the disk in drive A. The metacharacter * stands in for any file name in front of the file name extension .DOC. Regular expressions extend this basic idea many times over, providing a large set of metacharacters that make it possible to describe very complex text-matching expressions with relatively few characters.

For example, the regular expression \s2000, when applied to a body of text, matches all occurrences of the string "2000" that are preceded by any white-space character, such as a space or a tab.


   If you are using C++, C#, or JScript, special character escapes, such as \s, must be preceded by an additional backslash (for example, "\\s2000") to signal that the backslash in the character escape is a literal character and is not part of an escape sequence supported by these programming languages. If the escape sequence is invalid in a particular programming language (for example, as \s is), a compiler error results. If it is valid, the character represented by that escape sequence rather than by the regular expression character escape is included in the regular expression. For example, the string \t2000 results in a string that includes an embedded tab character followed by "2000". You do not have to add the backslash if you are using Visual Basic 2005, because it does not support escape sequences. If you are using C#, you can use C# literal strings, which are prefixed with @ and disable escaping (for example, @"\s2000").

Regular expressions can also perform searches that are more complex. For example, the regular expression (?<char>\w)\k<char>, using named groups and backreferencing, searches for adjacent paired characters. When applied to the string "I'll have a small coffee" it finds matches in the words "I'll", "small", and "coffee". (For details on this regular expression, see Backreferences.)

The following sections detail the set of metacharacters that define the .NET Framework regular expression language and show how to use the regular expression classes to implement regular expressions in your applications.




May 2010

Revised the note on escape sequences.

Customer feedback.

Community Additions