November 2014

Volume 29 Number 11


The Working Programmer : Rise of Roslyn

Joe Hummel | November 2014

Ted NewardFor a few years now, various computer professionals, thought leaders and pundits have advocated the idea of domain-specific languages (DSLs) as a way of approaching solutions to software problems. This seems particularly appropriate if the DSL syntax is something “casual users” can use to adapt and modify the business rules in a system. This is the Holy Grail of software to many developers—building a system people can maintain on their own when the business needs change.

One of the principal criticisms of DSL, however, is the fact that writing a compiler is a “lost art.” It’s not uncommon for programmers from all walks of life to look upon creating a compiler or interpreter as some kind of Dark Art.

At the Build 2014 conference this year, Microsoft formally announced one of the worst-kept secrets in the Microsoft .NET Framework development ecosystem—the open sourcing of Roslyn. This is the revamped/rebuilt compiler system that underlies the C# and Visual Basic languages. For some, this is a chance for Microsoft to put its languages into the open source community and reap the benefits—bug fixes, enhancements, public review of new language features and so on. For developers, it’s an opportunity to take a deeper look at how compilers (and interpreters—although Roslyn is focused on compilation given the languages in question) work under the hood.

For more background (and installation tips) check out the Roslyn CodePlex page at roslyn.codeplex.com. As always with not-yet-released bits, it’s highly recommended you do this on a virtual machine or a machine you don’t care too much about.

Roslyn Fundamentals

At a high level, the goal of a compiler is to translate programmer input (source code) into executable output, such as a .NET assembly or native .exe file. While the exact names for modules within a compiler vary, we typically think of a compiler as broken into two fundamental parts: a front end and back end (see Figure 1).

High-Level Compiler Design
Figure 1 High-Level Compiler Design

One of the key responsibilities of the front end is to verify the formatting accuracy of the inbound source code. As with all programming languages, there’s a specific format programmers must follow to keep things clear and unambiguous to the machine. For example, consider the following C# statement:

if x < 0   <-- syntax error!
  x = 0;

This is syntactically incorrect, because if conditions must be surrounded by ( ), like so:

if (x < 0)
  x = 0;

Once the code is parsed, the back end is responsible for deeper validation of the source, such as type-safety violations:

string x = "123";
if (x < 0)                   <-- semantic error!
  x = 0;                     <-- semantic error!

Incidentally, these examples are deliberate design decisions by the language implementer. They’re subject to long debates as to which are “better” than others. To hear more, visit any online programming forum and type in, “D00d ur language sux.” You’ll soon find yourself embroiled in an “educational” session that will certainly be one to remember.

Assuming there are no syntactic or semantic errors, compilation continues and the back end translates the input into an equivalent program in the desired target language.

Deeper into the Depths

Although you might take the two-part approach with the simplest languages, most often a language compiler/interpreter is broken down into much more than that. At the next level of complexity, most compilers arrange themselves to act in six major phases, two in the front end and four in the back end (see Figure 2). 

The Main Phases of a Compiler
Figure 2 The Main Phases of a Compiler

The front end performs the first two phases: lexical analysis and parsing. The goal of lexical analysis is to read the input program and output the tokens—the keywords, punctuation, identifiers and so on. The location of each token is also maintained, so the format of the program isn’t lost. Suppose the following program fragment starts at the beginning of the source file:

// Comment
if (score>100)
  grade = "A++";

The output from lexical analysis would be this sequence of tokens:

IfKeyword                       @ Span=[12..14)
OpenParenToken              @ Span=[15..16)
IdentifierToken                  @ Span=[16..21), Value=score
GreaterThanToken             @ Span=[21..22)
NumericLiteralToken           @ Span=[22..25), Value=100
CloseParenToken              @ Span=[25..26)
IdentifierToken                  @ Span=[30..35), Value=grade
EqualsToken                    @ Span=[36..37)
StringLiteralToken             @ Span=[38..43), Value=A++
SemicolonToken               @ Span=[43..44)

Each token carries additional information, such as start and end position (Span) as measured from the start of the source file. Notice IfKeyword starts at position 12. This is due to the comment that spans [0..10) and the end-of-line characters that span [10..12). While technically not tokens, output from the lexical analyzer typically includes information about whitespace, including comments. In the .NET compiler, whitespace is conveyed as syntax trivia. 

The second phase of the compiler is parsing. The parser works hand-in-hand with the lexical analyzer to perform syntax analysis. The parser does the vast majority of the work, requesting tokens from the lexical analyzer as it checks the input program against the various grammar rules of the source language. For example, C# programmers all know the syntax of an if statement:

if  (  condition  )  then-part  [ else-part ]

The [ … ] symbolizes the else-part is optional. The parser enforces this rule by matching tokens, and applying additional rules for the more complex syntactic elements such as condition and then-part:

void if( )
{
  match(IfKeyword);
  match(OpenParenToken);
  condition();
  match(CloseParenToken);
  then_part();
  if (lookahead(ElseKeyword))
  else_part();
}

The function match(T) calls the lexical analyzer to get the next token, and checks to see if this token matches T. Compilation continues normally if it matches. Otherwise, it reports a syntax error. The simplest of parsers use a match function to throw an exception upon a syntax error. This effectively halts compilation. Here’s one such implementation:

void match(SyntaxToken T)
{
  var next = lexer.NextToken();
  if (next == T)
  ;  // Keep going, all is well:
  else
  throw new SyntaxError(...);
}

Lucky for us, the .NET compiler contains a much more sophisticated parser. It’s capable of continuing in the face of gross syntax errors.

Assuming there were no syntax errors, the front end is essentially done. It has but one task remaining—to convey its efforts to the back end. The form in which this is stored internally is known as its intermediate representation—or IR. (Despite the similarity in terminology, an IR has nothing to do with the .NET Common Intermediate Language.) The parser in the .NET compiler builds an Abstract Syntax Tree (AST) as the IR, and passes this tree to the back end.

Trees are a natural IR, given the hierarchical nature of C# and Visual Basic programs. A program will contain one or more classes. A class contains properties and methods, properties and methods contain statements, statements often contain blocks, and blocks contain additional statements. The goal of an AST is to represent the program based on its syntactic structure. The “abstract” in AST denotes the absence of syntactic sugar such as ; and ( ).  For example, consider the following sequence of C# statements (assume these compile without error):

sum = 0;
foreach (var x in A)   // A is an array:
  sum += x;
avg = sum / A.Length;

At a high level, the AST for this code fragment would look like Figure 3.

High-Level Abstract Syntax Tree for C# Code Fragment
Figure 3 High-Level Abstract Syntax Tree for C# Code Fragment (Details Missing for Simplicity)

The AST captures the necessary information about the program: the statements, the order of the statements, the pieces of each statement and so on. The unnecessary syntax is discarded, such as all semicolons. The key feature to understand about the AST in Figure 3 is that it captures the syntactic structure of the program.

In other words, it’s how the program is written, not how it exe­cutes. Consider the foreach statement, which loops zero or more times as it iterates through a collection. The AST captures the components of the foreach statement—the loop variable, the collection and the body. What the AST doesn’t convey is that foreach may repeat over and over again. In fact, if you look at the tree, there’s no arrow in the tree to signify how foreach executes. The only way to know is by knowing the foreach keyword == loop.

ASTs are a perfectly good IR, with one main advantage: They’re easy to build and understand. The disadvantage is that more sophisticated analyses, such as those used in the back end of the compiler, are more difficult to perform on an AST. For this reason, compilers often maintain multiple IRs, including a common alternative to the AST. This alternative is the control flow graph (CFG), which represents a program based on its flow of control: loops, if-then-else statements, exceptions and so on. (We’ll cover this more in the next column.)

The best way to learn how the AST is used in the .NET compiler is through the Roslyn Syntax Visualizer. This is installed as part of the Roslyn SDK. Once installed, open any C# or Visual Basic program in Visual Studio 2013, position your cursor at the source line of interest and open the visualizer. You’ll see the View menu, Other Windows and Roslyn Syntax Visualizer (see Figure 4).

The Roslyn Syntax Visualizer in Visual Studio 2013
Figure 4 The Roslyn Syntax Visualizer in Visual Studio 2013

As a concrete example, consider the if statement we parsed earlier:

 

// Comment
  if (score>100)
    grade = "A++";

Figure 5 shows the corresponding AST fragment built by the .NET compiler.

Abstract Syntax Tree Built by .NET Compiler for IfStatement
Figure 5 Abstract Syntax Tree Built by .NET Compiler for IfStatement

As with a lot of things, the tree seems overwhelming at first. Remember two things, though. One, the tree is simply an expansion of the earlier source statements, so it’s actually quite easy to walk through the tree and see how it maps back to the original source. And two, the AST is intended for machine consumption, not for humans. Generally the only time a human looks at the AST is to debug a parser. Bear in mind, as well, that a more complete course on lexing and parsing is well beyond the scope of the space we have here. There are lots of resources available to those who want to dive deeper into this exercise. The goal here is a gentle introduction, not a deep dive.

Wrapping Up

We’re not done with Roslyn by any stretch of the imagination, so stay tuned. It you’re interested in diving deeper into Roslyn, may we suggest installing Roslyn. Then have a look at some of the documentation, starting with the Roslyn CodePlex page.

If you want to dive more deeply into parsing and lexing, there are numerous books available. There’s the venerable “Dragon Book,” also known as “Compilers: Principles, Techniques & Tools” (Addison Wesley, 2006). If you’re interested in a more .NET-centric approach, consider “Compiling for the .NET Common Language Runtime (CLR)” by John Gough (Prentice Hall, 2001), or Ronald Mak’s “Writing Compilers and Interpeters: A Software Engineering Approach” (Wiley, 2009). Happy coding!


Joe Hummel is a research associate professor at the University of Illinois, Chicago, a content creator for Pluralsight.com, a Visual C++ MVP, and a private consultant. He earned a Ph.D. at UC Irvine in the field of high-performance computing and is interested in all things parallel. He resides in the Chicago area, and when he isn’t sailing can be reached at joe@joehummel.net.

Ted Neward is the CTO at iTrellis, a consulting services company. He has written more than 100 articles and authored a dozen books, including “Professional F# 2.0” (Wrox, 2010). He’s an F# MVP and speaks at conferences around the world. He consults and mentors regularly—reach him at ted@tedneward.com or ted@itrellis.com if you’re interested.

Thanks to the following Microsoft technical expert for reviewing this article: Dustin Campbell