Textual Domain Specific Languages for Developers - Part 2

Article
09/23/2010

[This content is no longer valid. For the latest information on "M", "Quadrant", SQL Server Modeling Services, and the Repository, see the Model Citizen blog.]**

Shawn Wildermuth
Agilitrain

Updated February 2010 for PDC 2009 CTP

I will make the outlandish assumption that by this time you are convinced that building a textual domain specific language (DSL) is a useful tool in your development toolbox. But how do you go about creating a language? It is important that you avoid feeling overwhelmed. Unfortunately too many developers come to the world of DSLs assuming that they are going to build something as large and complex as a programming language like C#, VB or Ruby. The fact is that domain specific languages should be small and easy. In this article you will build a complete language that can be parsed and can be authored by a fairly non-technical person.

Designing Your First Textual Domain Specific Language

As you read in the first part of this series, a textual DSL should be not only readable by non-technical people but those same people should also be able to author the textual DSL. Returning to the fictional Litware Inc., assume that you need to make it easy for daily reports to be run and e-mailed to a set of users. You could define this in XML or in a database data but that would not make it easy for the non-technical folks to define what reports are to be run and whom to send them to. This is a job for a textual DSL.

Mocking up the DSL

You can start by creating an exemplar of the language then create a grammar to consume it. You should start simple and have a small, English-like language to bundle up reports to send it to a set of recipients. You may want to have a set of report URLs and people to send them to. Your first attempt at the language might be something as simple as this:

ReportBundle

"https://reportserver/reports/dailystatus.aspx"

"https://reportserver/reports/checklist.aspx"

SentTo "management@litware.org"

End ReportBundle

This exemplar is very simple but it does not give us the flexibility we need as it only sends reports to a single e-mail address. Instead of a loose list of URLs, let’s encapsulate them in their own section called Reports and do the same for Recipients:

ReportBundle

Reports

"https://reportserver/reports/dailystatus.aspx"

"https://reportserver/reports/checklist.aspx"

End Reports

Recipients

"management@litware.org"

End Recipients

End ReportBundle

This version of the language is closer but it seems to require a single file for every set of reports. Refactoring the exemplar a last time will help us create groups of reports:

ReportBundle

Group "Status Reports"

Reports

"https://reportserver/reports/dailystatus.aspx"

"https://reportserver/reports/checklist.aspx"

End Reports

Recipients

"management@litware.org"

End Recipients

End Group

Group "Problem Reports"

Reports

"https://reportserver/reports/problemreports.aspx"

"https://reportserver/reports/systemstatus.aspx"

End Reports

Recipients

"itstaff@litware.org"

"devmanagers@litware.org"

End Recipients

End Group

End ReportBundle

Now that you have an exemplar you can save this file (as dailyreports.bundle). This may not be the end of your exemplar of your DSL but it’s a good place to start building a grammar. This exemplar allows groups of reports to be sent to a group of recipients and it still is fairly simple for users to add, edit, remove or even creating a new group of reports. Now that you have an exemplar, a tool to build your grammar is next.

Using Intellipad to Create MGrammar

An MGrammar is simply the rules for your new language. The best environment for building a grammar is to use Oslo’s Intellipad tool. You can find Intellipad in the Oslo SDK in the Intellipad folder (typically C:\Program Files\Microsoft Oslo\1.0\bin\ipad.exe). When you launch Intellipad it looks like a slicker version of Notepad, but there is a lot of magic under the covers as we will see.

The first step to creating your Grammar for our DSL is to create an MGrammar file that defines the rules. With Intellipad open, create a module for our language first called Litware.Data.Reporting as show in Figure 1.

Figure 1: Starting our Grammar

Notice that our module starts with the keyword “module” and then takes a set of curly braces into which the rest of our grammar rules will exist. So far we are not getting any syntax coloring in Intellipad and that’s because it does not know what kind of file we are creating. We can change this by saving the file as ReportBundle.mg. The .mg tells Intellipad that we are creating an MGrammar file and changes the mode (in the upper right hand part of the UI) to “DSL Grammar Mode” as shown in Figure 2.

Figure 2: MGrammar Mode

Now that we’re in DSL Grammar mode we can start building our grammar properly. Intellipad’s DSL Grammar mode allows for interactive development of your grammar. To enter this interactive mode, use the DSL menu and pick “Open File as Input and Show Output…” to open the file as an exemplar and the output of your grammar.

Using this menu option will prompt you to select an exemplar file for your grammar (e.g. dailyreports.bundle). Once you pick the exemplar it will change Intellipad to a four pane preview of your language as shown in Figure 3.

Figure 3: Intellipad's MGrammar Mode

The left-hand pane (#1) is the exemplar of your language. The middle pane (#2) is the MGrammar file you are editing. The right pane is a preview of the syntax tree that the MGrammar is going to make. In Figure 3 the preview is empty because we do not yet have a grammar to parse our exemplar. Lastly, the bottom pane (#4) is where any parsing errors will show up (with matching underlines in the first and second panes). This will be the Interactive Development Environment (IDE) for creating your MGrammar file. Now you are ready to create your grammar.

Creating a Grammar

When you think of grammar you might think back to the days in primary education where you learned about sentence structure and the rules of language. Using this as a basis you could imagine that creating a grammar is a complex task. In fact MGrammar is built to simplify this task. Unlike a spoken or written language, DSLs are meant to be simple in their construction. That simplicity means that the grammars should be simple too.

In simple terms, a grammar is the set of rules in a parser. Parsing text using a grammar is meant to end up with something called a syntax tree. It is easy to overcomplicate what a syntax tree really is. If you think about the syntax tree as being a hierarchy of data the represents the intent of the author of the text that was parsed. In this case we want to end up with a syntax tree that has a bundle with groups of reports and recipients. This tree can be used in a variety of ways including storing it (e.g. in the Repository or a simple database) or using it at runtime by iterating the syntax tree.

There are several types of language elements that you will use to create your grammar, but the two most common are syntaxes and tokens. As a broad analogy you can think of tokens as being words and syntaxes as sentences.

Previously you created a module declaration as the starting point for your MGrammar file. A module is a logical container for languages (in which you will create the grammar for that language). The module defines a namespace for languages. Languages must be contained inside a module declaration so usually MGrammar files start with a module definition. Next you create a language declaration. The language declaration simply takes a name and a scope defined by curly braces (like the module did above). For example if you named your new language ReportBundle, your MGrammar file would look like so:

module Litware.Data.Reporting

{

language ReportBundle

{

}

The name of this language is actually the name of the module then the language. That would make this language’s full name Litware.Data.Reporting.ReportBundle. This works semantically similar to namespaces in the Common Language Runtime.

Now is where the fun begins: the actual grammar. All grammars you create will be composed of a set of rules that define what a language looks like. Each of these rules are named and called a syntax:

language ReportBundle

{

syntax Main = "ReportBundle";

}

The syntax is defined by a name (e.g. Main) then rules are assigned to the syntax. In this case we have defined that the syntax is looking for a literal called ReportBundle. Each language defines a primary rule that is the starting point for the grammar rules. This rule is called the Main syntax. For the input to a language to be valid, it must pass the Main syntax. It’s the starting point for your grammar, though you will likely divide up the actual rules into smaller rules to compose the entire parser.

With just this Main syntax defined you should notice that the first part of our file is now being parsed and it is reporting an error starting after ReportBundle as we haven’t defined what happens after ReportBundle as shown in Figure 4.

Figure 4: Parsing Errors

Notice that it is parsing the beginning of our Main syntax but then complaining because it found more data than our rule described. We can temporarily fix our rule by making it:

syntax Main = "ReportBundle" any* "End ReportBundle";

This change says that our file begins with ReportBundle, ends with End ReportBundle and then that it finds anything in the middle. The “any” keyword is valid for any single character. You can use the standard Kleene operators (e.g., ?, *, and +) to specify repetition. In this case, the any* indicates that characters (one or more) that take place between ReportBundle and End ReportBundle are legal. At this point you should be able to see the tree on the right side of Intellipad attempt to build a parse tree for you as shown in Figure 5.

Figure 5: Initial Parsed Tree

In creating a syntax you can also break a rule into multiple lines to make it more readable. For example:

syntax Main = "ReportBundle"

any*

"End ReportBundle";

The any* rule has matched everything inside the ReportBundle and is treating every character separately. It is time for your next rule. Inside each ReportBundle is a Group, so create the group rule in the same way but call the rule GroupRule like so:

syntax GroupRule = "Group"

any*

"End Group"

Then you can use the new GroupRule rule instead of the any* in your Main syntax like so:

language ReportBundle

{

syntax Main = "ReportBundle"

GroupRule*

"End ReportBundle";

syntax GroupRule = "Group"

any*

"End Group";

}

This is defining a rule that is being used in other rules. Notice that you can use the same Kleene operators on GroupRule. But Intellipad won’t parse this correctly yet. The reason is that it expects Group directly after the ReportBundle. You don’t want to have to specify whitespace so you want to have a way of specifying what is not to be considered when you parse the file. You can do this with the interleave statement:

language ReportBundle

{

syntax Main = "ReportBundle"

GroupRule*

"End ReportBundle";

syntax GroupRule = "Group"

any*

"End Group";

interleave whitespace = (" " | "\r" | "\n" | "\t")+;

}

The interleave looks for one or more space, carriage returns, tabs or line feeds. The parentheses wrap the group and the pipes (|) are used as a logical OR. With interleave in place you should see that ReportBundle and Group are now creating some structure to your parsed language while ignoring the whitespace between the other elements:

Main[

"ReportBundle",

[

GroupRule[

"Group",

[

...

If you look at the exemplar of the DSL you’ll see the next part of the language to parse is a quoted identifier that names the Group. To handle this quoted identifier, we are going to create a token. Remember, a token is simply a ‘word’ in our grammar (much like a syntax is a ‘sentence’). This token will be reused in a couple of different places in our language:

token QuotedIdentifier = '"' !('\r' | '\n' | '"')+ '"';

This token looks for anything that starts and ends with a double quote character and contains anything inside the quotes except for carriage returns, line feeds and embedded quotes. With the token created you can use it in your GroupRule to define the identifier for your Group:

language ReportBundle

{

syntax Main = "ReportBundle"

GroupRule*

"End ReportBundle";

syntax GroupRule = "Group" QuotedIdentifier

"End Group";

token QuotedIdentifier = '"' !('\r' | '\n' | '"')+ '"';

interleave whitespace = (" " | "\r" | "\n")+;

}

Next create a new syntax for the Reports section called ReportsRule that takes a list of quoted identifiers:

syntax ReportsRule = "Reports"

QuotedIdentifier+

"End Reports";

Like the GroupRule, this syntax specifies what the section starts and ends with. But instead of using a single QuotedIdentifier, it specifies one or more QuotedIdentifierstatements that represent the URLs to the reports. If your language was going to limit what could be placed in a report URL you could make this a token too, but for this example just allowing anything is probably fine. Now that you have a ReportsRule, you can use it inside the GroupRule to specify a single Reports section:

syntax GroupRule = "Group" QuotedIdentifier

ReportsRule

"End Group";

You can repeat this for the Recipients as well:

syntax RecipientsRule = "Recipients"

QuotedIdentifier+

"End Recipients";

And use it after the ReportsRule in the GroupRule:

syntax GroupRule = "Group" QuotedIdentifier

ReportsRule

RecipientsRule

"End Group";

At this point you have a DSL that is successfully parsing. So far your new language looks like this:

module Litware.Data.Reporting

{

language ReportBundle

{

syntax Main = "ReportBundle"

GroupRule*

"End ReportBundle";

syntax GroupRule = "Group" QuotedIdentifier

ReportsRule

RecipientsRule

"End Group";

syntax ReportsRule = "Reports"

QuotedIdentifier+

"End Reports";

syntax RecipientsRule = "Recipients"

QuotedIdentifier+

"End Recipients";

token QuotedIdentifier = '"' !('\r' | '\n' | '"')+ '"';

interleave whitespace = (" " | "\r" | "\n")+;

}

This grammar satisfies the requirement to parse the language but it would be better if we could refactor the language to provide a better experience.

Improving Your Grammar

The goal of any DSL is to be easy to author and read. Since our DSL is meant for non-technical people to author and/or edit we should add some elements to the grammar to make it easier. One of the ways you can make your language easier to use MGrammar attributes. MGrammar supports a number of attributes. The first one you might consider is the CaseInsensitive attribute:

@{CaseInsensitive{}}

language ReportBundle

{

...

}

This attribute tells the parser to perform matches in the language without regard to case. Depending on your language, this may be useful in making it easier to author your language.

Next you might decide that having the ability to create comments will help your DSL authors. You can do this by creating a new token for comments:

token Comment = "//" !("\r" | "\n")+;

This token specifies that a comment will start with a “//” and include everything until a carriage return or linefeed will be considered a Comment. Now you have to use the token. In this case you can just add it to the interleave statement since the comment should just be ignored:

interleave whitespace = (" " | "\r" | "\n")+ | Comment;

Notice that if you add a comment to the bundle file it’s ignored in the parsed tree. This certainly helps your DSL authors, but you should take it one step further.

MGrammar supports another attribute which can be used to create hints about different parts of your language. This is primarily used to help provide syntax coloring for text editors, similar to Visual Studio’s code coloring. In MGrammar you can use the Classification attribute to tell the editor how to color that part of the language. For example you can add the Classification attribute to the Comment token:

@{Classification["Comment"]}

token Comment = "//" !("\r" | "\n")+;

Because Intellipad looks for these attributes at runtime, any comments it finds will be colored like a comment as shown in Figure 6:

Figure 6: Classification Attribute in Action

You can use the Classification attribute on language elements as well to color them in Intellipad. The Classification attribute is applied to tokens (like you saw on the Comments token) so you need to refactor some of the elements of your grammar to use tokens instead of inline text. For example you can create two new tokens for the ReportBundle and EndReportBundle language elements:

token ReportBundleToken = "ReportBundle";

token EndReportBundleToken = "End ReportBundle";

If you replace the text in the Main syntax with these tokens, you will see that the language still parses (changes are in bold):

syntax Main = ReportBundleToken

GroupRule*

EndReportBundleToken;

Now you can add the Classification attribute specifying the type of classification (e.g. Keyword) to the new tokens:

@{Classification["Keyword"]}

token ReportBundleToken = "ReportBundle";

@{Classification["Keyword"]}

token EndReportBundleToken = "End ReportBundle";

This causes Intellipad to color the tokens as keywords:

Figure 7: Classified as Keywords

In Intellipad’s Settings folder is a file called ClassificationTypes.xcml. This file defines the different types of classifications that it uses. This file indicates a number of different classifications:

Keyword
Identifier
Whitespace
Comment
Operator
Delimiter
Literal
String
Number
Unknown

In the screenshots you probably noticed that the coloring you are getting in your Intellipad does not look exactly like the screenshots. Intellipad actually allows you to modify how the syntax coloring happens in a file called ClassificationFormats.xcml also located in the Settings folder of Intellipad. The source code for this article includes an example of this file.

With these classification types known, you can refactor the language to add some more interesting syntax coloring by creating new tokens and adding classifications. . For example:

module Litware.Data.Reporting

{

@{CaseInsensitive}

language ReportBundle

{

syntax Main = ReportBundleToken

GroupRule*

EndReportBundleToken;

syntax GroupRule = GroupToken QuotedIdentifier

ReportsRule

RecipientsRule

EndGroupToken;

syntax ReportsRule = ReportsToken

QuotedIdentifier+

EndReportsToken;

syntax RecipientsRule = RecipientsToken

QuotedIdentifier+

EndRecipientsToken;

@{Classification["String"]}

token QuotedIdentifier = '"' !('\r' | '\n' | '"')+ '"';

@{Classification["Keyword"]}

token ReportBundleToken = "ReportBundle";

@{Classification["Keyword"]}

token EndReportBundleToken = "End ReportBundle";

@{Classification["Keyword"]}

token GroupToken = "Group";

@{Classification["Keyword"]}

token EndGroupToken = "End Group";

@{Classification["Keyword"]}

token ReportsToken = "Reports";

@{Classification["Keyword"]}

token EndReportsToken = "End Reports";

@{Classification["Keyword"]}

token RecipientsToken = "Recipients";

@{Classification["Keyword"]}

token EndRecipientsToken = "End Recipients";

@{Classification["Comment"]}

token Comment = "//" !("\r" | "\n")+;

interleave whitespace = (" " | "\r" | "\n")+ | Comment;

}

With all your classifications in place the language now is displayed in full color in Intellipad as shown in Figure 8:

Figure 8: Syntax Coloring

Now you have improved your language to help the authors. Now we can improve the syntax tree to help the developers.

Productions

At this point your language is parsing and producing a syntax tree. The form of the syntax tree is based on default rules about how an MGrammar becomes a syntax tree. For our grammar, MGrammar produces the following syntax tree:

Main[

"ReportBundle",

[

GroupRule[

"Group",

"\"Status Reports\"",

ReportsRule[

"Reports",

[

"\"https://reportserver/reports/dailystatus.aspx\"",

"\"https://reportserver/reports/checklist.aspx\""

"End Reports"

RecipientsRule[

"Recipients",

[

"\"management@litware.org\""

"End Recipients"

"End Group"

GroupRule[

"Group",

"\"Problem Reports\"",

ReportsRule[

"Reports",

[

"\"https://reportserver/reports/problemreports.aspx\"",

"\"https://reportserver/reports/systemstatus.aspx\""

"End Reports"

RecipientsRule[

"Recipients",

[

"\"itstaff@litware.org\"",

"\"devmanagers@litware.org\""

"End Recipients"

"End Group"

]

"End ReportBundle"

]

This default syntax tree is not MGraph. Unfortunately it looks like MGraph but is only MGraph-like. It represents the structure of the parsed language. In fact the mgx.exe tool (included in the Oslo SDK) can take the MGrammar and a source file and produce an MGraph file, but that should not be confused with the syntax tree. Like I mentioned earlier, you could store the syntax tree in a data store (like the Repository). Creating an MGraph file would make this process much easier. In the third part of this article we will delve deeper into how to use the toolset to work with syntax trees both at runtime and stored in data stores(including mgx.exe). For example the format that you can see in the tree view when converted to M produces a more verbose format:

module dailyreports {

Main {

[0] = "ReportBundle",

[1] {

{

[0] {

GroupRule {

[0] = "Group",

[1] = "\"Status Reports\"",

[2] {

ReportsRule {

[0] = "Reports",

[1] {

{

[0] = "\"https://reportserver/reports/dailystatus.aspx\"",

[1] = "\"https://reportserver/reports/checklist.aspx\""

}

[2] = "End Reports"

}

[3] {

RecipientsRule {

[0] = "Recipients",

[1] {

{

[0] = "\"management@litware.org\""

}

[2] = "End Recipients"

}

[4] = "End Group"

}

[1] {

GroupRule {

[0] = "Group",

[1] = "\"Problem Reports\"",

[2] {

ReportsRule {

[0] = "Reports",

[1] {

{

[0] = "\"https://reportserver/reports/problemreports.aspx\"",

[1] = "\"https://reportserver/reports/systemstatus.aspx\""

}

[2] = "End Reports"

}

[3] {

RecipientsRule {

[0] = "Recipients",

[1] {

{

[0] = "\"itstaff@litware.org\"",

[1] = "\"devmanagers@litware.org\""

}

[2] = "End Recipients"

}

[4] = "End Group"

}

[2] = "End ReportBundle"

}

In the syntax tree there are two kinds of collection delimiters: curly braces (e.g. { }) and brackets (e.g. [ ]). Brackets indicate an ordered collection and curly braces indicate an unordered collection. This is why when M is produced we get the indexing ordering. This is unlikely to be the format we actually want to work with.

In addition, you should notice that there is a lot of extraneous information. The syntax/token name (e.g. Main) is listed for every element as well as the textual version of the syntax/token. In addition, you should notice that the QuotedIdentifier is embedding the extra quote which would be nice to remove. What you will really want to do in most cases is change the form of the tree. Changing the form is done using productions.

Productions are simply a way to tell the MGrammar that you want to specify how the tree is produced. In productions you add identifiers to the language elements to be used in an output pattern. For example, we can change the Main syntax as shown:

syntax Main = ReportBundleToken

g:GroupRule*

EndReportBundleToken

=> ReportBundle { g };

The first change is to add an identifier to the GroupRule by adding g:. This gives an identifier that can be used in the production. Next, the production operation (=>) is added and is immediately followed by the pattern we want to use instead of the default tree structure. This example changes the structure to be just ReportBundle with the contents of the group within a set of curly braces. This changes the top of the syntax tree to simply:

ReportBundle{

[

GroupRule[

...

GroupRule[

...

]

}

The double nested brackets is probably not useful so you can modify the production to place the GroupRules directly inside the ReportBundle by using the valuesof operator. The valuesof operator can be used to return the items in a collection will be returned as a top level collection. Changing the production to use the valuesof operator looks like this:

syntax Main = ReportBundleToken

g:GroupRule*

EndReportBundleToken

=> ReportBundle { valuesof(g) };

This results in the elimination of the extra set of brackets:

ReportBundle{

GroupRule[

...

GroupRule[

]

}

The nature of how productions work is that you can specify a production per language element (usually syntax and token). For example if you create a projection for the GroupRule syntax, you could specify how the GroupRule is projected and that projection would be used in the Main production when the GroupRule is included in that format. For example if we add a projection to the GroupRule:

syntax GroupRule = GroupToken name:QuotedIdentifier

rpt:ReportsRule

rec:RecipientsRule

EndGroupToken

=> Group {

GroupName { name },

rpt,

rec

};

In this example the production simplifies the syntax tree by creating a Group and creating a GroupName section for the name specified after the Group token. It also simply places the Reports and Recipients after the group name. This projection is used by the ReportBundle projection so we get a syntax tree like so:

ReportBundle{

Group{

GroupName{

"\"Status Reports\""

ReportsRule[

"Reports",

[

"\"https://reportserver/reports/dailystatus.aspx\"",

"\"https://reportserver/reports/checklist.aspx\""

"End Reports"

RecipientsRule[

"Recipients",

[

"\"management@litware.org\""

"End Recipients"

]

...

}

Another problem in the syntax tree is that our QuotedIdentifiers still have the embedded quotes in the strings. We can solve this by creating a projection for that token like so:

token QuotedIdentifier = '"' i:(^('\r' | '\n' | '"')+) '"' => i;

You can see here that we have surrounded the inner part of the quoted identifier with parentheses and then added a label. The projection just uses the labeled part of the language which allows us to strip off the extra quotes which leaves our syntax tree like so:

ReportBundle{

Group{

GroupName{

"Status Reports"

ReportsRule[

"Reports",

[

"https://reportserver/reports/dailystatus.aspx",

"https://reportserver/reports/checklist.aspx"

"End Reports"

RecipientsRule[

"Recipients",

[

"management@litware.org"

"End Recipients"

]

The ReportRule and RecipientsRule are still not pretty enough so you can simply add projections using the valuesof operator like so:

syntax ReportsRule = ReportsToken

r:QuotedIdentifier+

EndReportsToken

=> Reports { valuesof(r) };

syntax RecipientsRule = RecipientsToken

r:QuotedIdentifier+

EndRecipientsToken

=> Recipients { valuesof(r) };

This leaves us with a much more concise syntax tree:

ReportBundle{

Group{

GroupName{

"Status Reports"

Reports{

"https://reportserver/reports/dailystatus.aspx",

"https://reportserver/reports/checklist.aspx"

Recipients{

"management@litware.org"

}

Group{

GroupName{

"Problem Reports"

Reports{

"https://reportserver/reports/problemreports.aspx",

"https://reportserver/reports/systemstatus.aspx"

Recipients{

"itstaff@litware.org",

"devmanagers@litware.org"

}

Where we are…

Looking back at our original exemplar of your DSL, you should be able to see that to get a fairly discretely scoped language does not need to be difficult. It’s also straightforward to add value with features like syntax highlighting and case insensitivity. Finally to customize the resulting syntax tree with productions can help you create a data that you will ultimately use in a running system (as we will see in Part 3).

Now that you have a working textual DSL, you are ready to use the language. In the next part of this series you will consume the syntax tree at runtime.

Resources

Microsoft’s Data DevCenter
My Blog

Textual Domain Specific Languages for Developers - Part 2

Designing Your First Textual Domain Specific Language

Mocking up the DSL

Using Intellipad to Create MGrammar

Creating a Grammar

Improving Your Grammar

Productions

Where we are…

Resources

Additional resources