December 2011

Volume 26 Number 12

The Working Programmer - Parser Combinators

By Ted Neward | December 2011

Ted NewardWith the conclusion of the multiparadigmatic series, it seemed time to venture out into new ground. As fate would have it, however, some client work recently left me with interesting material that bears discussion, relates to the design of software, acts as another example of the commonality/variability analysis at the core of the multiparadigmatic series, and … well, in the end, it’s just really cool.

The Problem

The client I have is neck-deep in the neuro-optical scientific world and asked me to work with him on a new project designed to make conducting experiments on optical tissue easier. Specifically, I’m working on a software system to control a microscope rig that will drive various stimulus devices (LEDs, lights and so on) that will trigger responses from the optical tissue, and then capture the results measured by hardware watching the optical tissue.

If it all sounds vaguely Matrix-y to you, you’re not entirely alone. When I first heard about this project, my reaction was simultaneously, “Oh, wow, that’s cool!” and, “Oh, wait, I just threw up in my mouth a little.”

At any rate, one of the key things about the rig is that it will have a fairly complex configuration associated with each experiment run, and that led us to contemplate how to specify that configuration. On the one hand, it seemed an obvious problem for an XML file. However, the people running the rig aren’t going to be computer programmers, but rather scientists and lab assistants, so it seemed a little heavy-handed to expect them to write well-formed XML files (and get it right at every turn). The thought of producing some kind of GUI-based configuration system struck us as highly over-engineered, particularly as it would quickly turn into discussions of how best to capture open-ended kinds of data.

In the end, it seemed more appropriate to give them a custom configuration format, which meant tons of parsing text on my part. (To some, this would imply that I’m building a DSL; this is a debate best left to philosophers and others involved in the serious task of alcohol consumption.) Fortunately, solutions abound in this space.

Thoughts

A parser serves two interesting and useful purposes: converting text into some other, more meaningful form, and verifying/validating the text follows a certain structure (which typically is part of helping to convert it into a more meaningful form). So, for example, a phone number, which is at its heart just a sequence of numbers, still has a structure to it that requires verification. That format varies from continent to continent, but the numbers are still numbers. In fact, a phone number is a great example of a case where the “more meaningful form” isn’t an integer value—the digits aren’t an integer value, they’re a symbolic representation that’s usually better represented as a domain type. (Treating them as “just” a number makes it difficult to extract the country code or area code, for example.)

If a phone number is made up of digits, and so are numbers (salaries, employee IDs and so on), then there’s going to be some duplication in code where we parse and verify digits, unless we somehow extend a parser. This implies, then, that we’d like whatever parser we build to be open-ended, allowing somebody using the parser/library to extend it in different ways (Canadian postal codes, for example) without having to modify the source itself. This is known as the “open-closed principle”: Software entities should be open for extension, but closed for modification.

Solution: Generative Metaprogramming

One solution is the traditional “lex/yacc” approach, known more formally as a “parser generator.” This entails specifying the syntax for the configuration file in an abstract format—usually some variation on the Backus-Naur form (BNF) syntax/grammar used to describe formal grammar, such as what most programming languages use—then running a tool to generate code to pick the string input apart and yield some kind of tree of structures or objects as a result. Generally, this involved process is split into two steps, “lexing” and “parsing,” in which the lexer first transforms the string input into tokens, validating that the characters do in fact form legitimate tokens along the way. Then the parser takes the tokens and validates that the tokens are appearing in the appropriate order and contain the appropriate values, and so on, usually transforming the tokens into some kind of abstract tree structure for further analysis.

The problems with parser generators are the same for any generative metaprogramming approach: The code generated will need to be regenerated in the event that the syntax changes. But more importantly for this kind of scenario, the code generated will be computer-generated, with all the wonderful variable naming that comes with computer-generated code (anybody ready to stand up for variables such as “integer431” and “string$$x$y$z”?), thus difficult to debug.

Solution: Functional

In a certain kind of light, parsing is fundamentally functional: It takes input, performs some kind of operation on it and produces output as a result. The critical insight, it turns out, is that a parser can be created out of lots of little parsers, each of which parses a tiny bit of the string input, then returns a token and another function to parse the next little bit of string input. These techniques, which I believe were introduced in Haskell, are formally known as parser combinators, and they turn out to be an elegant solution to “midsize” parsing problems—parsers that aren’t necessarily as complex as a programming language would require, but something beyond what String.Split (or a hacked-up series of regex scans) can do.

In the case of parser combinators, the open-for-extension requirement is achieved by creating small functions, then using functional techniques to “combine” them into larger functions (which is where we get the name “combinators”). Larger parsers can be composed by anyone with sufficient skill to understand function composition. This technique is a general one that bears exploration, but I’ll save that for a future column.

As it turns out, there are several parser combinator libraries available for the Microsoft .NET Framework, many of them based on the Parsec module written in Haskell that sort of set the standard for parser combinatory libraries. Two such libraries are FParsec, written for F#, and Sprache, written for C#. Each is open source and relatively well-documented, such that they serve the dual purpose of being both useful out of the box and as a model from which to study design ideas. I’ll also leave FParsec for a future column.

“Sprache Sie Parsing?”

Sprache, available at code.google.com/p/sprache, describes itself as a “simple, lightweight library for constructing parsers directly in C# code,” which “doesn’t compete with ‘industrial strength’ language workbenches. It fits somewhere in between regular expressions and a full-featured toolset such as ANTLR.” (ANTLR is a parser generator, fitting into the Generative Metaprogramming category, like lex/yacc.)

Getting started with Sprache is straightforward: Download the code, build the project, then copy the Sprache.dll assembly into your project’s dependency directory and add the reference to the project. From here, all the parser definition work is done by declaring Sprache.Parser instances and combining them in particular ways to create Sprache.Parser instances, which in turn may, if desired (and it usually is), return domain objects containing some or all of the parsed values.

Sprache Simple

To begin, let’s start with a parser that knows how to parse user-entered phone numbers into a PhoneNumber domain type. For simplicity, I’ll stick with the U.S.-style format—(nnn) nnn-nnnn—but we want to specifically recognize the breakdown in area codes, prefix and line, and allow for letters in the place of digits (so somebody can enter their phone number as “(800) EAT-NUTS” if they desire). Ideally, the PhoneNumber domain type will convert between alpha and all-numeric forms on demand, but that functionality will be left as an exercise to the reader (meaning, essentially, that I don’t want to bother with it).

(The pedant in me demands to point out that simply converting all the alphas to numbers isn’t a fully compliant solution, by the way. In college, it was common in my circle of friends to try to come up with phone numbers that spelled out “cool” things—one ex-roommate is still waiting for 1-800-CTHULHU to become free, in fact, so he can win the game for all eternity.)

The easiest place to start is with the PhoneNumber domain type:

class PhoneNumber
{
  public string AreaCode { get; set; }
  public string Prefix { get; set; }
  public string Line { get; set; }
}

Were this a “real” domain type, AreaCode, Prefix and Line would have validation code in their property-set methods, but that would lead to a repetition of code between the parser and the domain class (which, by the way, we’ll fix before this is all done).

Next, we need to know how to create a simple parser that knows how to parse n number of digits:

public static Parser<string> numberParser =
  Parse.Digit.AtLeastOnce().Text();

Defining the numberParser is straightforward. Begin with the primitive parser Digit (an instance of a Parser<T> defined on the Sprache.Parse class), and describe that we want at least one digit in the input stream, implicitly consuming all digits until the input stream either runs dry or the parser encounters a non-digit character. The Text method converts the stream of parsed results into a single string for our consumption.

Testing this is pretty easy—feed it a string and let ’er rip:

[TestMethod]
public void ParseANumber()
{
  string result = numberParser.Parse("101");
  Assert.AreEqual("101", result);
}
[TestMethod]
public void FailToParseANumberBecauseItHasTextInIt()
{
  string result = numberParser.TryParse("abc").ToString();
  Assert.IsTrue(result.StartsWith("Parsing failure"));
}

When run, this stores “101” into result. If the Parse method is fed an input string of “abc,” it will yield an exception. (If nonthrowing behavior is preferred, Sprache also has a TryParse method that returns a Result object that can be interrogated regarding success or failure.)

The phone-number parsing situation is a little bit more complicated, though; it needs to parse just three or four digits—no more, no less. Defining one such parser (the three-digit parser) is a bit trickier, but still doable:

public static Parser<string> threeNumberParser =
  Parse.Numeric.Then(first =>
    Parse.Numeric.Then(second =>
      Parse.Numeric.Then(third =>
        Parse.Return(first.ToString() +
          second.ToString() + third.ToString()))));

The Numeric parser takes a character and, if it’s a digit, advances to the next character. The Then method takes a function (in the form of a lambda expression) to execute. The Return method collects each of these into a single string and, as its name implies, uses that as the return value (see Figure 1).

Figure 1 Parsing a Phone Number

[TestMethod]
public void ParseJustThreeNumbers()
{
  string result = threeNumberParser.Parse("123");
  Assert.AreEqual("123", result);
}
[TestMethod]
public void ParseJustThreeNumbersOutOfMore()
{
  string result = threeNumberParser.Parse("12345678");
  Assert.AreEqual("123", result);
}
[TestMethod]
public void FailToParseAThreeDigitNumberBecauseItIsTooShort()
{
  var result = threeNumberParser.TryParse("10");
  Assert.IsTrue(result.ToString().StartsWith("Parsing failure"));
}

Success. So far. (Yes, the definition of threeNumberParser is awkward—surely there has to be a better way to define this! Fear not: there is, but to understand how to extend the parser, we have to dive deeper into how Sprache is constructed, and that’s the subject of the next part in this series.)

Now, however, we need to handle the left-parens, the right-­parens and the dash, and convert everything into a PhoneNumber object. It might seem a bit awkward with what we see so far, but watch what happens next, as shown in Figure 2.

Figure 2 Converting Input into a PhoneNumber Object

public static Parser<string> fourNumberParser =
  Parse.Numeric.Then(first =>
    Parse.Numeric.Then(second =>
      Parse.Numeric.Then(third =>
        Parse.Numeric.Then(fourth =>
          Parse.Return("" + first.ToString() +
            second.ToString() + third.ToString() +
              fourth.ToString())))));
public static Parser<string> areaCodeParser =
  (from number in threeNumberParser
  select number).
  XOr(
  from lparens in Parse.Char('(')
  from number in threeNumberParser
  from rparens in Parse.Char(')')
  select number);
public static Parser<PhoneNumber> phoneParser =
  (from areaCode in areaCodeParser
  from _1 in Parse.WhiteSpace.Many().Text()
  from prefix in threeNumberParser
  from _2 in (Parse.WhiteSpace.Many().Text()).
  Or(Parse.Char('-').Many())
  from line in fourNumberParser
  select new PhoneNumber() { AreaCode=areaCode, Prefix=prefix, Line=line});
Using the parser becomes pretty straightforward at this point:
[TestMethod]
public void ParseAFullPhoneNumberWithSomeWhitespace()
{
  var result = phoneParser.Parse("(425) 647-4526");
  Assert.AreEqual("425", result.AreaCode);
  Assert.AreEqual("647", result.Prefix);
  Assert.AreEqual("4526", result.Line);
}

Best of all, the parser is entirely extensible, because it, too, can be composed into a larger parser that transforms text input into an Address object or ContactInfo object or anything else imaginable.

The Combinatorics Concept

Historically, parsing text has been the province of “language researchers” and academia, largely due to the complicated and difficult edit-generate-compile-test-debug cycle inherent with generative metaprogramming solutions. Trying to walk through computer-­generated code—particularly the finite-state-machine-based versions that many parser generators churn out—in a debugger is a challenge to even the most hard-bitten developer. For that reason, most developers don’t think about solutions along parsing lines when presented with a text-based problem. And, in truth, most of the time, a parser- generator-based solution would be drastic overkill.

Parser combinators serve as a nice in-between solution: flexible enough and powerful enough to handle some nontrivial parsing, without requiring a Ph.D. in computer science to understand how to use them. Even more interestingly, the concept of combinatorics is a fascinating one, and leads to some other interesting ideas, some of which we’ll explore later.

In the spirit in which this column was born, make sure to keep an “eye” out for my next column (sorry, couldn’t resist), in which I’ll extend Sprache just a touch, to reduce the ugliness of the three- and four-digit parsers defined here.

Happy coding!  


Ted Newardis an architectural consultant with Neudesic LLC. He has written more than 100 articles and authored or coauthored a dozen books, including “Professional F# 2.0” (Wrox, 2010). He’s a C# MVP and speaks at conferences around the world. He consults and mentors regularly—reach him at ted@tedneward.com or Ted.Neward@neudesic.com if you’re interested in having him come work with your team, or read his blog at blogs.tedneward.com.

Thanks to the following technical expert for reviewing this article: Luke Hoban