Netting C++

Introducing Regular Expressions

Stanley B. Lippman

Code download available at:NettingC++2006_11.exe(159 KB)

One of the most significant benefits of moving from ISO-C++ to C++/CLI is the enormous Microsoft® .NET Framework Class Library (FCL) that immediately becomes available. Most readers are aware of the more glamorous domains supported by the FCL, such as Web services, networking, threads, ASP.NET, and so on. In this column, I introduce the support for regular expressions in the .NET Framework.

Because ISO-C++ is without standard support for regular expressions, my TQL application (introduced in previous Netting C++ columns) uses the standard string class for all pattern matching. This occurs, for example, with the need to normalize word occurrences within a text, such as recognizing "tried", "tries", and "trying" as instances of "try". With the move to C++/CLI, I'd like to replace the string handling with the more concise and powerful .NET regular expression support. That's what I hope to accomplish in the next column, after making my way through a brief regular expression tutorial here.

Before I look at how the FCL supports regular expressions, it would be a good idea to get some feel for what they are and how you can use them.

A regular expression is a pattern of characters and symbols representing a character sequence of some arbitrary length. For example, let's say you need to find all lines of text that begin with a certain sequence of characters. The line must begin with the numeral 5. The number can be of any length, but must be followed by a dash (5- would be the minimal matching string, but 51- and 510- and so on would also match). Following the dash must be one of the letters a, b, or c—for example, 51-a, 511100101-b, and so on. This must be followed by one or more letters or characters and must end with the sequence 2001. (I know, it seems bizarre, but these are just the kinds of sequences that a regular expression denotes.)

In order to express the rules of this sequence using a regular expression, you need a set of symbols to do the following:

First, indicate that you want to begin the search at the beginning of the line. You do this with the caret (^). So, for example, ^5 means that you want the line to begin with a literal value of 5.

Second, indicate that you want to match on a particular flavor of character. So, for example, \d means that you want to match a single digit between 0-9, and \D means that you want to match against a single non-digit character. (Technically, \d and \D also work with international digits, as pointed out in Raymond Chen's blog post at blogs.msdn.com/86555.aspx.) \s means that you want to match a single white space character, while \S means that you want to match a single character that is not white space. \w matches any alpha-numeric character [a-z, A-Z, 0-9], and \W matches any character that is not alphanumeric.

Third, indicate that you want to match on any character, regardless of its type. For example, the period (.) matches any non-newline character.

Fourth, indicate that you want to match on multiple (or no) instances of a character type. The plus operator (+) means that you want to match one or more characters of the same type. \d+, for example, matches "2", "22", "1217", and so on. The following regular expression matches any line that begins with 5 followed by one or more additional digits, followed by one or more non-digit characters followed by the literal 2001:

^5\d+\D+2001

What if you were not sure if the 5 were going to be followed by one or more digits? You'd like the flexibility to indicate that you will accept any number of digits, or no digits at all. You can do that by using the asterisk (*). For example, let's turn the previous example into one in which the 5 can be followed by zero or more digits followed by zero or more non-digit characters, but ending with the literal 2001:

^5\d*\D*2001

Fifth, indicate that you want to match on a fixed number of characters. Therefore the following regular expression requires three digits followed by a hyphen followed by four digits (like a telephone number), such as 375-4128:

\d{3}-\d{4}

Finally, indicate that you want to match against one of a set of different characters. You do this by placing a set of alternative characters within parentheses, separated by an OR-bar (|). For example, the expression (a|e|i|o|u) means that you want to match against one of the five English vowels. Adding the plus operator means that you want to match against one or more consecutive occurrences of the five English vowels. Following it with the asterisk means you want to allow for no matches as well.

Regular expressions take some getting used to. In the beginning, they seem quite complicated because they offer such compact notation. Figure 1 shows the run of a small regular expression tester-outer program (note that my Console input is highlighted in red).

Figure 1 Regular Expression Tester

Would you like to enter a string to match against? (Y/N/?) y Please enter a string, or ‘quit’ to exit. ==> 5abc2001 Would you like to change regular expressions? (Y/N/?) y Please enter regular expression: **> ^5\d*(a|d|e)\w+2001 original string: 5abc2001 attempt to match: ^5\d*(a|d|e)\w+2001 The characters 5abc2001 match beginning at position 0 Would you like to enter a string to match against? (Y/N/?) y Please enter a string, or ‘quit’ to exit. ==> 527ar2001 Would you like to change regular expressions? (Y/N/?) n original string: 527ar2001 attempt to match: ^5\d*(a|d|e)\w+2001 The characters 527ar2001 match beginning at position 0

Of course, there can be multiple matches as well. For example:

original string: r24d2 attempt to match: \d+ The characters 24 match beginning at position 1 The characters 2 match beginning at position 4

So let's try our hand at programming regular expressions. First, let me show you the code that does the actual matching, and then I'll explain what's going on. In Figure 2, System::Text::RegularExpressions is the namespace within which the regular expression support is contained.

Figure 2 Building Regular Expressions

using namespace System::Text::RegularExpressions; void doTheMatch( String^ inputString, String^ filter ) { Console::WriteLine( "original string: {0}", inputString ); Console::WriteLine( "attempt to match: {0}", filter ); Regex^ regex = gcnew Regex( filter ); Match^ match = regex->Match( inputString ); if ( ! match->Success ) { Console::WriteLine( "Sorry, no match of {0} in {1}", filter, inputString ); return; } for ( ; match->Success; match = match->NextMatch() ) { Console.WriteLine( "The characters {0} match beginning at position {1}", match->ToString(), match->Index ); } }

The Regex class represents the regular expression. You pass its constructor the string representation of the expression. Once the Regex object is constructed, its associated regular expression is immutable—that is, you cannot change it. So each regular expression within your program requires its own Regex object.

The Match method performs the actual matching algorithm of the regular expression against its string argument. It returns a Match class object that holds the results of the pattern matching. The Match object is also immutable.

To discover whether the match succeeded, query the Success property of the Match class. Each match is spoken of as a capture. The Index property returns the position in the original string where the first character of the captured substring was found. Length returns the length of the captured substring. The ToString method returns the captured substring.

Here is a typical for-loop to collect and process the collection of matching patterns:

for ( Match^ match = regex.Match( inputString ); match->Success; match = match->NextMatch() ) { ... }

The Match object holds the results of the first capture. If the regular expression captures multiple substrings, use NextMatch to access the second and each subsequent capture. Before you actually manipulate the next object, you need to test that it represents a success. A sentinel Match object for which Success evaluates to false marks the end of the captured substrings. Consider the following three lines:

5040 bez( 99, -3.194, 43.8, 85 ) 4930.7823 bez( 10.7, 19.59, -20, -20.48 ) -5123 bez( -3.5, 2.46, 89, 0.02 )

These represent samples of lines that you need to match. First, you have to come up with a regular expression that can match each of these lines.

As you can see, each line begins with a number. The number can be either positive or negative and can represent either a scalar or floating point value. The number is followed by a space, then the literal substring "bez". Four comma-separated numbers follow that enclosed within parentheses. The numbers can be negative or positive and they can be either integers or floating point values. Before you look at my solution, try your hand at coming up with a regular expression that captures each of these lines in its entirety.

Once you have the regular expression, you're still not finished. The next problem is, how do you gain access to the individual parts of the line? That is, the regular expression captures the entire string and you now need to pick it apart in order to access the five numeric fields.

The regular expression syntax supports a grouping mechanism in which you assign index numbers to particular subfields of the match.You can subsequently use these numbers to access the subfields. For example, the following identifies a group associated with the index 1 using the special ?<1> syntax:

(?<1>(-?\d+\.\d+)|(-?\d+))

Can you read this? It represents an alternate pair of regular expressions. The first one matches a floating point number that may or may not be negative (the ? character is used to specify zero or one matches):

-?\d+\.\d+

The second matches an integer value that also may or may not be negative:

-?\d+

The entire regular expression, with five identified groups is shown in Figure 3. For clarity, I've broken it up and identified each subfield.

Figure 3 The Assembled Regular Expression

String^ filter = // the digit before the bez literal "(?<1>(-?\\d+\\.\\d+)|(-?\\d+))" // arbitrary white space, bez literal and open paren "\\s*bez\\(" // the four internal numeric values and literal comma "(?<2>(-?\\d+\\.\\d+)|(-?\\d+))," "(?<3>(-?\\d+\\.\\d+)|(-?\\d+))," "(?<4>(-?\\d+\\.\\d+)|(-?\\d+))," "(?<5>(-?\\d+\\.\\d+)|(-?\\d+))" ;

Now attempt the match on the line of text:

Regex^ regex = gcnew Regex( filter ); Match^ match = regex->Match( line );

If the match is successful, you then need to grab each of the five numeric subfields and translate them into values of type float:

float loc = match->Group(1)->ToString()->ToSingle(); float m_xoffset1 = match->Group(2)->ToString()->ToSingle(); float m_yoffset1 = match->Group(3)->ToString()->ToSingle(); float m_xoffset2 = match->Group(4)->ToString()->ToSingle(); float m_yoffset2 = match->Group(5)->ToString()->ToSingle();

The Group class represents a capturing group within the returned Match class object. You can access each Group object through its associated index. The ToString method returns the captured substring. In this case, you invoke the ToSingle conversion method on each string to transform the value into type float.

A useful Regex class method is Split. Just like the String class Split method, it returns a string array. Unlike the String method, it separates the input string based on a regular expression rather than a set of characters:

String^ textLine = "Danny%Lippman%%Point Guard%Shooting Guard%%floater"; String^ splitMe = "%+"; Regex^ regex = gcnew Regex( splitMe ); for each ( String^ capture in regex->Split( textLine )) Console::WriteLine( "capture: {0}", capture );

In this example, the textLine is being split at each point where one or more percent characters (%) appear. When executed, this generates the following output:

capture: Danny capture: Lippman capture: Point Guard capture: Shooting Guard capture: floater

Another useful Regex class method is Replace, which allows you to replace captured substrings with alternative text. Figure 4 shows a simple example.

When compiled and executed, this generates the following output—I've reformatted it slightly to display better:

original text: XP.109 is currently in alpha. XP.109 represents a staggering leap forward regular expression : XP.\d+ replacement text: ToonShooter is currently in alpha. ToonShooter represents a staggering leap forward

The programs used to generate this output are available for download. The only thing left to accomplish is to replace all the uses of the String class within TQL with the .NET Framework regular expression support. In the next column, I'll do just that. Until then, may all your programs run optimally. Cheers.

Send your questions and comments for Stanley to purecpp@microsoft.com.

Stanley B. Lippman began working on C++ with its inventor, Bjarne Stroustrup, in 1984 at Bell Laboratories. Later, Stan worked in feature animation both at Disney and DreamWorks and served as a Software Technical Director on Fantasia 2000. He has since served as Distinguished Consultant with JPL, and an Architect with the Visual C++ team at Microsoft.