Export (0) Print
Expand All

Specification: .NET Framework, Character Class Subtraction in RegEx

 

Kit George

May 2004

Applies to:
   Microsoft® Visual Studio® .NET

Summary: Character class subtraction has been identified as a key feature for supporting RegEx, and meeting compliance goals and requirements of the XSD specification. This specification outlines the introduction of this feature. (5 printed pages)

Contents

Introduction
Functional Design
Compatibility

Introduction

There exists an XSD specification which has guidelines for numerous regular expression syntaxes, including character class subtraction. This specification may be found at the following website, section F.1, item 16:

http://www.w3.org/TR/xmlschema-2/

The concept which is unsupported today in the framework regular expressions, and which is specified in the XSD specification, is the ability to support the subtraction of a group of characters from a previously identified character group. This is referred to as Character class subtraction. The goal of this specification is to fulfill this specification requirement.

Scenarios

Scenarios include any situation where you wish to utilize Character class subtraction. There is no specific new capability exposed by this feature, simply a significantly easier way to do what you could do before.

Reference Material

The XSD specification for this topic can be found under section F.1, item 16 of this document:

http://www.w3.org/TR/xmlschema-2/

Requirements

It is a requirement for this scenario that the XSD specification be met. Note that if this specification is contradicted in any way, the Perl 5.6 implementation, then the Per 5.6 implementation, takes precedence over the XSD specification, since the existing Regular Expression implementation is specifically designed to meet the Perl 5.6 standard.

Functional Design

As they stand today, the regular expression libraries do not support the concept of character class subtraction, as specified in the XSD specification.

Note   There are no public application programming interface (API) additions as part of this work. The proposed changes will only affect the existing RegEx class, and any static members which accept a pattern in the RegEx space. Specifically, any API which accepts a pattern parameter (which is how this feature is exposed) may be affected.

Consider the following character expression:

[a-e]

This will match any character in the specified range: a, b, c, d, or e. Character subtraction allows you specify this same range, followed by a set of characters to exclude from the match, such as:

[a-e-[bd]]

This would match a, c, and e, but not b or d, because they were specified as exclusions.

The following defines what we need to support.

Standard Subtraction

When defined within a character class (defined by a region between an open and close square bracket), character class subtraction will be supported. A subtracted character class is a character class directly preceded by a minus sign. If specified, a subtracted character class must follow an existing character class specification. For example:

[a-e-[bd]] Valid. The subtracted character class (-[bd]) follows a standard character class.
[a-e-[mn]] Valid. Even though the subtracted characters are not in the range of the standard character class, this can simply be interpreted as an empty subtraction.
[a-e-[dn]] Valid. This is a variant of the previous example. In this situation, some of the subtractions are accurate, and others are not. Any characters in the range will be subtracted, and any not in the range will be ignored.
[a-e-[ dn]] Valid. The space in this context is simply treated as a valid subtraction character
[-[e-f]] Valid, not a subtraction. Although this is a valid expression, it is not a subtraction. The elements are not being subtracted from any specific set, and standard regular expression rules will apply. The expression is 'any -, [, or e through f character, followed by a ]"
[-e-f] Valid, not a subtraction. Although this is a valid expression, it is not a subtraction. Instead, this is an expression looking for the minus sign, or any letter e through f.
[-[e-f]a-z] Valid, not a subtraction. As above, this is not subtraction, but it is a valid expression.
[a-e[-e-f]] Valid, not a subtraction. As above, this is not subtraction, but it is a valid expression.
[a-e - m-s] Valid, not a subtraction. As above, this is not subtraction, but it is a valid expression. In this situation, the expression is 'any a through e character, any space, minus sign, or m through s character'.

Embedded Subtraction

Embedded subtraction will be supported. Embedded subtraction simply allows a second subtraction to be applied to a subtracted class. Consider the following:

[a-m-[c-k-[f-g]]]

This class is simply asking for the range a through m, except for any character in the range c through k, with the exception of f or g. In this specific example, the following characters would successfully match: a, b, f, g, l, or m.

Embedded subtractions follow the same rules as standard subtractions. Any amount of embedded subtractions may be supported.

Embedded subtractions are always performed as a complete operation, before extending outwards to the wider subtraction. Therefore, any information other than the characters remaining as part of the subtraction is thrown away. Consider this example:

[a-m-[b-l-[d-i-[a-d]]]]

Working from the innermost character class, and moving outwards:

  • We subtract the range a-d from d-i, which simply leaves the range e-i
  • We subtract e-i from b-l, which leaves the ranges b-d, and j-l
  • We subtract the ranges b-d and j-l, from a-m. This leaves a, e-i, and m

To step through (the most immediate operations are bolded):

[a-m-[b-l-[d-i-[a-d]]]] =>
[a-m-[b-l-[e-i]]] =>
[a-m-[b-dj-l]] =>
[ae-im] =>
[aefghim]  

Compatibility

This change will affect the behavior of certain existing search patterns. It is believed that these search patterns will NOT be common in any usage of regular expressions, and therefore it is acceptable to change this behavior.

Impact on Current Behavior

By introducing this new behavior, there are existing behaviors that would have worked, but will now be different. Consider this expression:

[a-e-[bd]]

Today, this will be interpreted as 'any a through e character, minus sign, open square bracket, b, or d, immediately followed by a close square bracket.'

Note   The closing square bracket is not a part of the character class itself, since the class is defined to the first closing bracket found.

For example, the above expression would match "b]," since this is a b, immediately followed by a close square bracket.

This behavior will change with character subtraction, and will instead be significantly different. The string "b]" would never be matched by the above expression, since as a result of this change it will instead be interpreted as a subtraction.

We do need to be acutely aware of this change, and document it well for customers. The workaround, if they need to get the old behavior, is to simply modify their expression slightly, so it won't be interpreted as a subtraction. In this specific situation, either one of these example modifications would be sufficient:

[a-e[-bd]] Move the minus symbol inside the brackets.
[a-e-b[d]] Move the b before the second open bracket, so it won't be a subtraction anymore

Examples of altered behavior include:

  1. Existing behavior which would get a match, which simply changes the behavior:
    Regex r = new Regex("[a-e-[bd]]");
    Match m = r.Match("this is some string a-[bd]");
    

    Before: The expression is actually asking for 'any a through e character, dash, open square bracket, b, or d, followed immediately by a close square bracket.' The original expression is actually taken in the current implementation as "[a-e-[bd]," which is why the condition 'followed immediately by a close square bracket' holds.

       In the example string used, this would return a successful Match of "d]."

    After: The expression becomes a subtraction asking for 'any a through e character, except for b or d,' which in turn means, 'any a, c, or e character.'

    In the example string used, this would return a successful Match of "e", since that is the first a, c, or e found.

  2. Existing behavior which would get a match, no longer gets a Match:
    Regex r = new Regex("[a-e-[bd]]");
    Match m = r.Match("this is so much dud]");
    

    Before: The requested expression is exactly as above. As it turns out, the result of the Match is also the same.

       In the example string used, this would return a successful Match of "d]."

    After: The requested subtraction is as above. You will notice that the match string has changed, however, and it now has none of a, c, or e. In this situation, there IS NO MATCH. The behavior of the API has changed completely.

    In the example string used, no successful Match would be returned.

The example matching expressions used in these scenarios are exceptionally rare, and it is believed to be an acceptable change to alter this functional behavior. The change is therefore being introduced inline to the existing pattern.

Show:
© 2014 Microsoft