Shawn Wildermuth
MCW Technologies
First Published: March 2009
Updated June 2009 for May CTP
I will make the outlandish assumption that by this time you
are convinced that building a textual domain specific language (DSL) is a
useful tool in your development toolbox. But how do you go about creating a
language? It is important that you avoid feeling overwhelmed. Unfortunately too
many developers come to the world of DSLs assuming that they are going to build
something as large and complex as a programming language like C#, VB or Ruby.
The fact is that domain specific languages should be small and easy. In this
article you will build a complete language that can be parsed and can be
authored by a fairly non-technical person.
Designing Your First Textual Domain Specific Language
As you read in the first part of this
series, a textual DSL should be not only readable by non-technical people
but those same people should also be able to author the textual DSL. Returning
to the fictional Litware Inc., assume that you need to make it easy for daily
reports to be run and e-mailed to a set of users. You could define this in XML
or in a database data but that would not make it easy for the non-technical
folks to define what reports are to be run and whom to send them to. This is a
job for a textual DSL.
Mocking up the DSL
You can start by creating an exemplar of the language then
create a grammar to consume it. You should start simple and have a small,
English-like language to bundle up reports to send it to a set of recipients. You
may want to have a set of report URLs and people to send them to. Your first
attempt at the language might be something as simple as this:
ReportBundle
"http://reportserver/reports/dailystatus.aspx"
"http://reportserver/reports/checklist.aspx"
SentTo "management@litware.org"
End ReportBundle
This exemplar is very simple but it does not give us the
flexibility we need as it only sends reports to a single e-mail address. Instead
of a loose list of URLs, let’s encapsulate them in their own section called Reports and do the same for Recipients:
ReportBundle
Reports
"http://reportserver/reports/dailystatus.aspx"
"http://reportserver/reports/checklist.aspx"
End Reports
Recipients
"management@litware.org"
End Recipients
End ReportBundle
This version of the language is closer but it seems to
require a single file for every set of reports. Refactoring the exemplar a last
time will help us create groups of reports:
ReportBundle
Group "Status Reports"
Reports
"http://reportserver/reports/dailystatus.aspx"
"http://reportserver/reports/checklist.aspx"
End Reports
Recipients
"management@litware.org"
End Recipients
End Group
Group "Problem Reports"
Reports
"http://reportserver/reports/problemreports.aspx"
"http://reportserver/reports/systemstatus.aspx"
End Reports
Recipients
"itstaff@litware.org"
"devmanagers@litware.org"
End Recipients
End Group
End ReportBundle
Now that you have an exemplar you can save this file (as dailyreports.bundle). This
may not be the end of your exemplar of your DSL but it’s a good place to start building
a grammar. This exemplar allows groups of reports to be sent to a group of
recipients and it still is fairly simple for users to add, edit, remove or even
creating a new group of reports. Now that you have an exemplar, a tool to build
your grammar is next.
Using Intellipad to Create MGrammar
An MGrammar is simply the rules for your new language. The
best environment for building a grammar is to use Oslo’s Intellipad tool. You
can find Intellipad in the Oslo SDK in the Intellipad folder (typically C:\Program Files\Microsoft Oslo\1.0\bin\ipad.exe).
When you launch Intellipad it looks like a slicker version of Notepad, but
there is a lot of magic under the covers as we will see.
The first step to creating your Grammar for our DSL is to
create an MGrammar file that defines the rules. With Intellipad open, create a
module for our language first called Litware.Data.Reporting
as show in Figure 1.
.jpg)
Figure 1: Starting our Grammar
Notice that our module starts with the keyword “module”
and then takes a set of curly braces into which the rest of our grammar rules
will exist. So far we are not getting any syntax coloring in Intellipad and
that’s because it does not know what kind of file we are creating. We can
change this by saving the file as ReportBundle.mg.
The .mg tells Intellipad that we are creating an MGrammar file and changes the
mode (in the upper right hand part of the UI) to “DSL Grammar Mode” as shown in
Figure 2.
.jpg)
Figure 2: MGrammar Mode
Now that we’re in DSL Grammar mode we can start building
our grammar properly. Intellipad’s DSL Grammar mode allows for interactive
development of your grammar. To enter this interactive mode, hit Ctrl-Shift-T
to open an exemplar for your grammar.
Using this key combination will prompt you to select an
exemplar file for your grammar (e.g. dailyreports.bundle).
Once you pick the exemplar it will change Intellipad to a four pane preview of
your language as shown in Figure 3.
.jpg)
Figure 3: Intellipad's MGrammar
Mode
The left-hand pane (#1) is the exemplar of your language. The
middle pane (#2) is the MGrammar file you are editing. The right pane is a
preview of the syntax tree that the MGrammar is going to make. In Figure 3 the
preview is empty because we do not yet have a grammar to parse our exemplar.
Lastly, the bottom pane (#4) is where any parsing errors will show up (with
matching underlines in the first and second panes). This will be the
Interactive Development Environment (IDE) for creating your MGrammar file. Now
you are ready to create your grammar.
Creating a Grammar
When you think of grammar you might think back to the days in
primary education where you learned about sentence structure and the rules of
language. Using this as a basis you could imagine that creating a grammar is a
complex task. In fact MGrammar is built to simplify this task. Unlike a spoken
or written language, DSLs are meant to be simple in their construction. That
simplicity means that the grammars should be simple too.
In simple terms, a grammar is the set of rules in a parser.
Parsing text using a grammar is meant to end up with something called a syntax
tree. It is easy to overcomplicate what a syntax tree really is. If you think
about the syntax tree as being a hierarchy of data the represents the intent of
the author of the text that was parsed. In this case we want to end up with a
syntax tree that has a bundle with groups of reports and recipients. This tree
can be used in a variety of ways including storing it (e.g. in the Repository
or a simple database) or using it at runtime by iterating the syntax tree.
There are several types of language elements that you will
use to create your grammar, but the two most common are syntaxes and tokens. As
a broad analogy you can think of tokens
as being words and syntaxes
as sentences.
Previously you created a module declaration as the starting
point for your MGrammar file. A module is a logical container for languages (in
which you will create the grammar for that language). The module defines a
namespace for languages. Languages must be contained inside a module
declaration so usually MGrammar files start with a module definition. Next you
create a language declaration. The language declaration simply takes a name and
a scope defined by curly braces (like the module did above). For example if you
named your new language ReportBundle,
your MGrammar file would look like so:
module Litware.Data.Reporting
{
language ReportBundle
{
}
}
The name of this language is actually the name of the module
then the language. That would make this language’s full name Litware.Data.Reporting.ReportBundle.
This works semantically similar to namespaces in the Common Language Runtime.
Now is where the fun begins: the actual grammar. All grammars
you create will be composed of a set of rules that define what a language looks
like. Each of these rules are named and called a syntax:
language ReportBundle
{
syntax Main = "ReportBundle";
}
The syntax is defined by a name (e.g. Main) then rules are
assigned to the syntax. In this case we have defined that the syntax is looking
for a literal called ReportBundle.
Each language defines a primary rule that is the starting point for the grammar
rules. This rule is called the Main syntax. For the input to a language to be
valid, it must pass the Main syntax. It’s the starting point for your grammar,
though you will likely divide up the actual rules into smaller rules to compose
the entire parser.
With just this Main syntax defined you should notice that the
first part of our file is now being parsed and it is reporting an error
starting after ReportBundle
as we haven’t defined what happens after ReportBundle
as shown in Figure 4.
.jpg)
Figure 4: Parsing Errors
Notice that it is parsing the beginning of our Main syntax
but then complaining because it found more data than our rule described. We can
temporarily fix our rule by making it:
syntax Main = "ReportBundle" any* "End ReportBundle";
This change says that our file begins with ReportBundle, ends with End ReportBundle and then
that it finds anything in the middle. The “any” keyword is valid for any single
character. You can use the standard Kleene operators (e.g., ?, *, and +) to
specify repetition. In this case, the any*
indicates that characters (one or more) that take place between ReportBundle and End ReportBundle are legal. At
this point you should be able to see the tree on the right side of Intellipad
attempt to build a parse tree for you as shown in Figure 5.
.jpg)
Figure 5: Initial Parsed Tree
In creating a syntax you can also break a rule into multiple
lines to make it more readable. For example:
syntax Main = "ReportBundle"
any*
"End ReportBundle";
The any*
rule has matched everything inside the ReportBundle
and is treating every character separately. It is time for your next rule.
Inside each ReportBundle
is a Group, so create the group rule in the same way but call the rule GroupRule like so:
syntax GroupRule = "Group"
any*
"End Group"
Then you can use the new GroupRule
rule instead of the any*
in your Main syntax
like so:
language ReportBundle
{
syntax Main = "ReportBundle"
GroupRule*
"End ReportBundle";
syntax GroupRule = "Group"
any*
"End Group";
}
This is defining a rule that is being used in other rules.
Notice that you can use the same Kleene operators on GroupRule. But Intellipad won’t parse this
correctly yet. The reason is that it expects Group
directly after the ReportBundle.
You don’t want to have to specify whitespace so you want to have a way of
specifying what is not to be considered when you parse the file. You can do
this with the interleave
statement:
language ReportBundle
{
syntax Main = "ReportBundle"
GroupRule*
"End ReportBundle";
syntax GroupRule = "Group"
any*
"End Group";
interleave whitespace = (" " | "\r" | "\n" | "\t")+;
}
The interleave
looks for one or more space, carriage returns, tabs or line feeds. The parentheses
wrap the group and the pipes (|) are used as a logical OR. With interleave in place you
should see that ReportBundle
and Group are now
creating some structure to your parsed language while ignoring the whitespace
between the other elements:
Main[
"ReportBundle",
[
GroupRule[
"Group",
[
...
If you look at the exemplar of the DSL you’ll see the next
part of the language to parse is a quoted identifier that names the Group. To handle this quoted
identifier, we are going to create a token.
Remember, a token is simply a ‘word’ in our grammar (much like a syntax is a
‘sentence’). This token will be reused in a couple of different places in our
language:
token QuotedIdentifier = '"' ^('\r' | '\n' | '"')+ '"';
This token looks for anything that starts and ends with a
double quote character and contains anything inside the quotes except for
carriage returns, line feeds and embedded quotes. With the token created you
can use it in your GroupRule
to define the identifier for your Group:
language ReportBundle
{
syntax Main = "ReportBundle"
GroupRule*
"End ReportBundle";
syntax GroupRule = "Group" QuotedIdentifier
"End Group";
token QuotedIdentifier = '"' ^('\r' | '\n' | '"')+ '"';
interleave whitespace = (" " | "\r" | "\n")+;
}
Next create a new syntax for the Reports section called ReportsRule that takes a list of quoted identifiers:
syntax ReportsRule = "Reports"
QuotedIdentifier+
"End Reports";
Like the GroupRule,
this syntax specifies what the section starts and ends with. But instead of
using a single QuotedIdentifier,
it specifies one or more QuotedIdentifierstatements
that represent the URLs to the reports. If your language was going to limit
what could be placed in a report URL you could make this a token too, but for
this example just allowing anything is probably fine. Now that you have a ReportsRule, you can use it
inside the GroupRule to
specify a single Reports
section:
syntax GroupRule = "Group" QuotedIdentifier
ReportsRule
"End Group";
You can repeat this for the Recipients
as well:
syntax RecipientsRule = "Recipients"
QuotedIdentifier+
"End Recipients";
And use it after the ReportsRule
in the GroupRule:
syntax GroupRule = "Group" QuotedIdentifier
ReportsRule
RecipientsRule
"End Group";
At this point you have a DSL that is successfully parsing. So
far your new language looks like this:
module Litware.Data.Reporting
{
language ReportBundle
{
syntax Main = "ReportBundle"
GroupRule*
"End ReportBundle";
syntax GroupRule = "Group" QuotedIdentifier
ReportsRule
RecipientsRule
"End Group";
syntax ReportsRule = "Reports"
QuotedIdentifier+
"End Reports";
syntax RecipientsRule = "Recipients"
QuotedIdentifier+
"End Recipients";
token QuotedIdentifier = '"' ^('\r' | '\n' | '"')+ '"';
interleave whitespace = (" " | "\r" | "\n")+;
}
}
This grammar satisfies the requirement to parse the language
but it would be better if we could refactor the language to provide a better
experience.
Improving Your Grammar
The goal of any DSL is to be easy to author and read. Since
our DSL is meant for non-technical people to author and/or edit we should add
some elements to the grammar to make it easier. One of the ways you can make
your language easier to use MGrammar attributes. MGrammar supports a number of
attributes. The first one you might consider is the CaseInsensitive attribute:
@{CaseInsensitive{}}
language ReportBundle
{
...
}
This attribute tells the parser to perform matches in the
language without regard to case. Depending on your language, this may be useful
in making it easier to author your language.
Next you might decide that having the ability to create
comments will help your DSL authors. You can do this by creating a new token
for comments:
token Comment = "//" ^("\r" | "\n")+;
This token specifies that a comment will start with a “//”
and include everything until a carriage return or linefeed will be considered a
Comment. Now you have
to use the token. In this case you can just add it to the interleave statement since
the comment should just be ignored:
interleave whitespace = (" " | "\r" | "\n")+ | Comment;
Notice that if you add a comment to the bundle file it’s ignored
in the parsed tree. This certainly helps your DSL authors, but you should take
it one step further.
MGrammar supports another attribute which can be used to create
hints about different parts of your language. This is primarily used to help
provide syntax coloring for text editors, similar to Visual Studio’s code
coloring. In MGrammar you can use the Classification
attribute to tell the editor how to color that part of the language. For
example you can add the Classification
attribute to the Comment
token:
@{Classification{"Comment"}}
token Comment = "//" ^("\r" | "\n")+;
Because Intellipad looks for these attributes at runtime, any
comments it finds will be colored like a comment as shown in Figure 6:
.jpg)
Figure 6: Classification
Attribute in Action
You can use the Classification
attribute on language elements as well to color them in Intellipad. The Classification attribute is
applied to tokens (like you saw on the Comments
token) so you need to refactor some of the elements of your grammar to use
tokens instead of inline text. For example you can create two new tokens for
the ReportBundle and EndReportBundle language elements:
token ReportBundleToken = "ReportBundle";
token EndReportBundleToken = "End ReportBundle";
If you replace the text in the Main
syntax with these tokens, you will see that the language still parses (changes
are in bold):
syntax Main = ReportBundleToken
GroupRule*
EndReportBundleToken;
Now you can add the Classification
attribute specifying the type of classification (e.g. Keyword) to the new tokens:
@{Classification{"Keyword"}}
token ReportBundleToken = "ReportBundle";
@{Classification{"Keyword"}}
token EndReportBundleToken = "End ReportBundle";
This causes Intellipad to color the tokens as keywords:
.jpg)
Figure 7: Classified
as Keywords
In Intellipad’s Settings
folder is a file called ClassificationTypes.xcml.
This file defines the different types of classifications that it uses. This
file indicates a number of different classifications:
- Keyword
- Identifier
- Whitespace
- Comment
- Operator
- Delimiter
- Literal
- String
- Number
- Unknown
In the screenshots you probably noticed that the coloring you
are getting in your Intellipad does not look exactly like the screenshots.
Intellipad actually allows you to modify how the syntax coloring happens in a
file called ClassificationFormats.xcml
also located in the Settings
folder of Intellipad. The source code for this article includes an example of
this file.
With these classification types known, you can refactor the
language to add some more interesting syntax coloring by creating new tokens
and adding classifications. . For example:
module Litware.Data.Reporting
{
@{CaseInsensitive}
language ReportBundle
{
syntax Main = ReportBundleToken
GroupRule*
EndReportBundleToken;
syntax GroupRule = GroupToken QuotedIdentifier
ReportsRule
RecipientsRule
EndGroupToken;
syntax ReportsRule = ReportsToken
QuotedIdentifier+
EndReportsToken;
syntax RecipientsRule = RecipientsToken
QuotedIdentifier+
EndRecipientsToken;
@{Classification{"String"}}
token QuotedIdentifier = '"' ^('\r' | '\n' | '"')+ '"';
@{Classification{"Keyword"}}
token ReportBundleToken = "ReportBundle";
@{Classification{"Keyword"}}
token EndReportBundleToken = "End ReportBundle";
@{Classification{"Keyword"}}
token GroupToken = "Group";
@{Classification{"Keyword"}}
token EndGroupToken = "End Group";
@{Classification{"Keyword"}}
token ReportsToken = "Reports";
@{Classification{"Keyword"}}
token EndReportsToken = "End Reports";
@{Classification{"Keyword"}}
token RecipientsToken = "Recipients";
@{Classification{"Keyword"}}
token EndRecipientsToken = "End Recipients";
@{Classification["Comment"]}
token Comment = "//" ^("\r" | "\n")+;
interleave whitespace = (" " | "\r" | "\n")+ | Comment;
}
}
With all your classifications in place the language now is
displayed in full color in Intellipad as shown in Figure 8:
.jpg)
Figure 8: Syntax Coloring
Now you have improved your language to help the authors. Now
we can improve the syntax tree to help the developers.
Productions
At this point your language is parsing and producing a syntax
tree. The form of the syntax tree is based on default rules about how an
MGrammar becomes a syntax tree. For our grammar, MGrammar produces the
following syntax tree:
Main[
"ReportBundle",
[
GroupRule[
"Group",
"\"Status Reports\"",
ReportsRule[
"Reports",
[
"\"http://reportserver/reports/dailystatus.aspx\"",
"\"http://reportserver/reports/checklist.aspx\""
],
"End Reports"
],
RecipientsRule[
"Recipients",
[
"\"management@litware.org\""
],
"End Recipients"
],
"End Group"
],
GroupRule[
"Group",
"\"Problem Reports\"",
ReportsRule[
"Reports",
[
"\"http://reportserver/reports/problemreports.aspx\"",
"\"http://reportserver/reports/systemstatus.aspx\""
],
"End Reports"
],
RecipientsRule[
"Recipients",
[
"\"itstaff@litware.org\"",
"\"devmanagers@litware.org\""
],
"End Recipients"
],
"End Group"
]
],
"End ReportBundle"
]
This default syntax tree is not MGraph. Unfortunately it
looks like MGraph but is only MGraph-like. It represents the structure of the
parsed language. In fact the mgx.exe
tool (included in the Oslo SDK) can take the MGrammar and a source file and produce
an MGraph file, but that should not be confused with the syntax tree. Like I
mentioned earlier, you could store the syntax tree in a data store (like the
Repository). Creating an MGraph file would make this process much easier. In
the third part of this article we will delve deeper into how to use the toolset
to work with syntax trees both at runtime and stored in data stores(including mgx.exe). For example the
format that you can see in the tree view when converted to M produces a more
verbose format:
module dailyreports {
Main {
[0] = "ReportBundle",
[1] {
{
[0] {
GroupRule {
[0] = "Group",
[1] = "\"Status Reports\"",
[2] {
ReportsRule {
[0] = "Reports",
[1] {
{
[0] = "\"http://reportserver/reports/dailystatus.aspx\"",
[1] = "\"http://reportserver/reports/checklist.aspx\""
}
},
[2] = "End Reports"
}
},
[3] {
RecipientsRule {
[0] = "Recipients",
[1] {
{
[0] = "\"management@litware.org\""
}
},
[2] = "End Recipients"
}
},
[4] = "End Group"
}
},
[1] {
GroupRule {
[0] = "Group",
[1] = "\"Problem Reports\"",
[2] {
ReportsRule {
[0] = "Reports",
[1] {
{
[0] = "\"http://reportserver/reports/problemreports.aspx\"",
[1] = "\"http://reportserver/reports/systemstatus.aspx\""
}
},
[2] = "End Reports"
}
},
[3] {
RecipientsRule {
[0] = "Recipients",
[1] {
{
[0] = "\"itstaff@litware.org\"",
[1] = "\"devmanagers@litware.org\""
}
},
[2] = "End Recipients"
}
},
[4] = "End Group"
}
}
}
},
[2] = "End ReportBundle"
}
}
In the syntax tree there are two kinds of collection
delimiters: curly braces (e.g. { }) and brackets (e.g. [ ]). Brackets indicate
an ordered collection and curly braces indicate an unordered collection. This
is why when M is produced we get the indexing ordering. This is unlikely to be
the format we actually want to work with.
In addition, you should notice that there is a lot of
extraneous information. The syntax/token name (e.g. Main) is listed for every element as well as the
textual version of the syntax/token. In addition, you should notice that the QuotedIdentifier is embedding
the extra quote which would be nice to remove. What you will really want to do
in most cases is change the form of the tree. Changing the form is done using productions.
Productions are simply a way to tell the MGrammar that you
want to specify how the tree is produced. In productions you add identifiers to
the language elements to be used in an output pattern. For example, we can
change the Main syntax
as shown:
syntax Main = ReportBundleToken
g:GroupRule*
EndReportBundleToken
=> ReportBundle { g };
The first change is to add an identifier to the GroupRule by adding g:. This gives an identifier that
can be used in the production. Next, the production operation (=>) is added and is
immediately followed by the pattern we want to use instead of the default tree
structure. This example changes the structure to be just ReportBundle with the
contents of the group within a set of curly braces. This changes the top of the
syntax tree to simply:
ReportBundle{
[
GroupRule[
...
],
GroupRule[
...
]
]
}
The double nested brackets is probably not useful so you can
modify the production to place the GroupRules
directly inside the ReportBundle
by using the valuesof
operator. The valuesof
operator can be used to return the items in a collection will be returned as a
top level collection. Changing the production to use the valuesof operator looks like
this:
syntax Main = ReportBundleToken
g:GroupRule*
EndReportBundleToken
=> ReportBundle { valuesof(g) };
This results in the elimination of the extra set of brackets:
ReportBundle{
GroupRule[
...
],
GroupRule[
]
}
The nature of how productions work is that you can specify a
production per language element (usually syntax and token). For example if you
create a projection for the GroupRule
syntax, you could specify how the GroupRule
is projected and that projection would be used in the Main production when the GroupRule is included in that format. For
example if we add a projection to the GroupRule:
syntax GroupRule = GroupToken name:QuotedIdentifier
rpt:ReportsRule
rec:RecipientsRule
EndGroupToken
=> Group {
GroupName { name },
rpt,
rec
};
In this example the production simplifies the syntax tree by
creating a Group and
creating a GroupName
section for the name specified after the Group
token. It also simply places the Reports
and Recipients after
the group name. This projection is used by the ReportBundle
projection so we get a syntax tree like so:
ReportBundle{
Group{
GroupName{
"\"Status Reports\""
},
ReportsRule[
"Reports",
[
"\"http://reportserver/reports/dailystatus.aspx\"",
"\"http://reportserver/reports/checklist.aspx\""
],
"End Reports"
],
RecipientsRule[
"Recipients",
[
"\"management@litware.org\""
],
"End Recipients"
]
},
...
}
Another problem in the syntax tree is that our QuotedIdentifiers still have
the embedded quotes in the strings. We can solve this by creating a projection
for that token like so:
token QuotedIdentifier = '"' i:(^('\r' | '\n' | '"')+) '"' => i;
You can see here that we have surrounded the inner part of
the quoted identifier with parentheses and then added a label. The projection
just uses the labeled part of the language which allows us to strip off the
extra quotes which leaves our syntax tree like so:
ReportBundle{
Group{
GroupName{
"Status Reports"
},
ReportsRule[
"Reports",
[
"http://reportserver/reports/dailystatus.aspx",
"http://reportserver/reports/checklist.aspx"
],
"End Reports"
],
RecipientsRule[
"Recipients",
[
"management@litware.org"
],
"End Recipients"
]
},
The ReportRule
and RecipientsRule are
still not pretty enough so you can simply add projections using the valuesof operator like so:
syntax ReportsRule = ReportsToken
r:QuotedIdentifier+
EndReportsToken
=> Reports { valuesof(r) };
syntax RecipientsRule = RecipientsToken
r:QuotedIdentifier+
EndRecipientsToken
=> Recipients { valuesof(r) };
This leaves us with a much more concise syntax tree:
ReportBundle{
Group{
GroupName{
"Status Reports"
},
Reports{
"http://reportserver/reports/dailystatus.aspx",
"http://reportserver/reports/checklist.aspx"
},
Recipients{
"management@litware.org"
}
},
Group{
GroupName{
"Problem Reports"
},
Reports{
"http://reportserver/reports/problemreports.aspx",
"http://reportserver/reports/systemstatus.aspx"
},
Recipients{
"itstaff@litware.org",
"devmanagers@litware.org"
}
}
}
Where we are…
Looking back at our original exemplar of your DSL, you should
be able to see that to get a fairly discretely scoped language does not need to
be difficult. It’s also straightforward to add value with features like syntax
highlighting and case insensitivity. Finally to customize the resulting syntax
tree with productions can help you create a data that you will ultimately use
in a running system (as we will see in Part 3).
Now that you have a working textual DSL, you are ready to
use the language. In the next part of this series you will consume the syntax
tree at runtime.
Resources