token

Article
05/26/2009

[This is prerelease documentation and is subject to change in future releases. Blank topics are included as placeholders.]

Token rules are the first rules processed against the input character stream to create a sequence of named tokens that are subsequently processed by syntax and interleave rules.

A token rule must contain a name, and one or more productions. Each production can consist of multiple terms.

The rule can optionally specify rule parameters, and the final keyword modifier. Attributes can also be applied.

Each term can also specify attributes.

[[Optional Attributes]]  [[optional  final]]  token  RuleName [[Optional Parameters]] = Productions ;

Productions : 
    Production  [[OR]] 
    Productions | Production

Production : Terms

Terms :
    Term [[OR]] 
    Terms Term

Term :
    [[Optional Attributes]] TextPattern

TextPattern :
    TextLiteral [[OR]]
    CharacterRange [[OR]]
    Character + Kleene Operator [[OR]]
    Token Rule Reference [[OR]]
    In-line rules [[OR]]
    any

Rule Name

RuleName is any valid “M” identifier.

Productions

A rule contains one or more productions, separated by the "or" (|) operator. Each production consists of one or more terms.

Terms

A term can consist of one of the following:

A reference to a token rule.
A text literal.
A range of characters.
An in-line rule, which is a rule with a range operator applied.
Characters with Kleene operators applied.
The literal any, which is a wildcard that matches any text value of length 1.

These terms can be combined into expressions using the difference, intersection, and inverse set operators.

Remarks

The difference between token and syntax rules is when they are processed. Token rules are processed first against the raw character stream to produce a sequence of named tokens. The “M” processor then processes the language’s syntax rules against the token stream to determine whether the input is valid and optionally to produce structured data as output.

A token rule contains one or more productions and each production consists of multiple terms. The terms can reference another token rule, thus allowing the specification of a hierarchical tree structure. Tokens tend to occur in the leaves and lower nodes of a tree and can consist of text literals, character ranges, or token rule references.

The token rules are the first used to access input before interleave or syntax rules. The interleave rules filter out certain characters (such as line feeds and spaces.) before the syntax rules examine the input stream. But because token rules are applied prior to interleave rules, they can override their behavior if necessary.

Because token rules are applied first, they cannot reference any other kind of rule.

If an input text value conforms to more than one production in a token rule and one of the productions is marked with the final keyword modifier, then that production is used.

Token rules can be negated, intersected and subtracted, which is not the case for syntax rules.

Example

The following code recognizes lines of text that contain a type name and an access type.

Name=System.String Access=public 
Name=System.Integer32 Access=private

Note there are many syntax rules with terms that are rule references. The Types rule is a common pattern used to express one or more occurrences of something.

Note that each of the field names is a token, as are the restricted set of Access values. Other token rules include the allowable characters in a Type name, as well as the interleave characters.

module Types 
{
   language Parser
   {
        syntax Main = Types;
        syntax Types = Type
                     | Types Type;
        syntax Type = Name Access;
        syntax Name = NameLit NameValue;
        syntax NameValue = chs;
        syntax Access = AccessLit AccessValue;
        
        interleave Whitespace = Space | LF | CR;
        
        token NameLit = "Name=";
        token AccessLit = "Access=";
        token AccessValue = "public" | "private" | "internal" | "protected";
        
        token Char =  "A".."Z" 
                    | "a".."z" 
                    | "0".."9" 
                    | ".";
        token chs = Char+;     
        token Space = "\u0020";
        token LF = "\u000A";
        token CR = "\u000D";

   } 
}

Fill out a survey about this topic for Microsoft.

Share via