C# 3.0

The Evolution Of LINQ And Its Impact On The Design Of C#

Anson Horton

This article is based on a prerelease version of Visual Studio code-named "Orcas." All information herein is subject to change.

This article discusses:
  • C# and LINQ
  • The evolution of LINQ
  • SQL querying from code
This article uses the following technologies:
LINQ, C#

Contents

Lambda Expressions
Extension Methods
Anonymous Types
Implicitly Typed Local Variables
Object Initializers
Query Expressions

I was a huge fan of the Connections series, hosted by James Burke, when it aired on the Discovery Channel. Its basic premise: how seemingly unrelated discoveries influenced other discoveries, which ultimately led to some modern-day convenience. The moral, if you will, is that no advancement is made in isolation. Not surprisingly, the same is true for Language Integrated Query (LINQ).

In simple terms, LINQ is a series of language extensions that supports data querying in a type-safe way; it will be released with the next version Visual Studio, code-named "Orcas." The data to be queried can take the form of XML (LINQ to XML), databases (LINQ-enabled ADO.NET, which includes LINQ to SQL, LINQ to Dataset and LINQ to Entities), objects (LINQ to Objects), and so on. The LINQ architecture is shown in Figure 1.

Figure 1 LINQ Architecture

Figure 1** LINQ Architecture **(Click the image for a larger view)

Let’s look at some code. A sample LINQ query in the upcoming "Orcas" version of C# might look like:

var overdrawnQuery = from account in db.Accounts
                     where account.Balance < 0
                     select new { account.Name, account.Address };

When the results of this query are iterated over using foreach, each element returned would consist of a name and address of an account that has a balance less than 0.

It’s immediately obvious from the sample above that the syntax is like SQL. Several years ago, Anders Hejlsberg (chief designer of C#) and Peter Golde thought of extending C# to better integrate data querying. Peter, who was the C# compiler development lead at the time, was investigating the possibility of making the C# compiler extensible, specifically to support add-ins that could verify the syntax of domain-specific languages like SQL. Anders, on the other hand, was conceiving a deeper, more specific level of integration. He was thinking about a set of "sequence operators" that would operate on any collection that implemented IEnumerable, as well as remote queries for types that implemented IQueryable. Ultimately, the sequence operator idea gained the most support, and in early 2004 Anders submitted a paper about the idea to Bill Gates’s Thinkweek. The feedback was overwhelmingly positive. In the early stages of the design, a simple query had the following syntax:

sequence<Customer> locals = customers.where(ZipCode == 98112);

Sequence, in this case, was an alias for IEnumerable<T>, and the word "where" was a special operator understood by the compiler. The implementation of the where operator was a normal C# static method that took in a predicate delegate (that is, a delegate of the form bool Pred<T>(T item)). The idea was for the compiler to have special knowledge about the operator. This would allow the compiler to correctly call the static method and create the code to hook up the delegate to the expression.

Let’s suppose that the example above would be the ideal syntax for a query in C#. What would this query look like in C# 2.0, without any language extensions?

IEnumerable<Customer> locals = EnumerableExtensions.Where(customers,
                                                    delegate(Customer c)
        {
            return c.ZipCode == 98112;
        });

This code is frightfully verbose, and worse, it requires significant digging to find the relevant filter (ZipCode == 98112). And this example is simple; imagine how much more unreadable this would be with several filters, projections, and so forth. The root of the verbosity is the syntax required for anonymous methods. In the ideal query, the expression would require nothing but the expression to be evaluated. The compiler would then attempt to infer the context; for example, that ZipCode was really referring to the ZipCode defined on Customer. How to fix this problem? Hardcoding the knowledge of specific operators into the language didn’t sit well with the language design team, so they started looking for an alternate syntax for anonymous methods. They wanted it to be extremely concise, and yet not necessarily require more knowledge than the compiler currently needed for anonymous methods. Ultimately they devised lambda expressions.

Lambda Expressions

Lambda expressions are a language feature that is similar in many ways to anonymous methods. In fact, if lambda expressions had been put into the language first, there would have been no need for anonymous methods. The basic idea is that you can treat code as data. In C# 1.0, it is common to pass strings, integers, reference types, and so on to methods so that the methods can act on those values. Anonymous methods and lambda expressions extend the range of the values to include code blocks. This concept is common in functional programming.

Let’s take the example above and replace the anonymous method with a lambda expression:

IEnumerable<Customer> locals = 
    EnumerableExtensions.Where(customers, c => c.ZipCode == 91822);

There are several things to notice. For starters, the brevity of the lambda expression can be attributed to a number of factors. First, the delegate keyword isn’t used to introduce the construct. Instead, there is a new operator, =>, which tells the compiler that this isn’t a normal expression. Second, the Customer type is inferred from the usage. In this case, the signature of the Where method looks something like:

public static IEnumerable<T> Where<T>(
    IEnumerable<T> items, Func<T, bool> predicate)

The compiler is able to infer that "c" refers to a customer because the first parameter of the Where method is IEnumerable<Customer>, such that T must, in fact, be Customer. Using this knowledge, the compiler also verifies that Customer has a ZipCode member. Finally, there is no return keyword specified. In the syntactic form, the return member is omitted but this is merely syntactic convenience. The result of the expression is still considered to be the return value.

Lambda expressions, like anonymous methods, also support variable capture. For example, it’s possible to refer to the parameters or locals of the method that contains the lambda expression within the lambda expression’s body:

public IEnumerable<Customer> LocalCusts(
    IEnumerable<Customer> customers, int zipCode)
{
    return EnumerableExtensions.Where(customers,
        c => c.ZipCode == zipCode);
}

Finally, Lambda expressions support a more verbose syntax that allows you to specify the types explicitly, as well as execute multiple statements. For example:

return EnumerableExtensions.Where(customers,
    (Customer c) => { int zip = zipCode; return c.ZipCode == zip; });

The good news is that we’re much closer to the ideal syntax proposed in the original paper, and we were able to get there with a language feature that is generally useful outside of query operators. Let’s take a look at where we are again:

IEnumerable<Customer> locals = 
    EnumerableExtensions.Where(customers, c => c.ZipCode == 91822);

There is an obvious problem here. Instead of thinking about the operations that can be performed on Customer, the consumer currently has to know about this EnumerableExtensions class. In addition, in the case of multiple operators, the consumer has to invert his thinking to write the correct syntax. For example:

IEnumerable<string> locals = 
    EnumerableExtensions.Select(
        EnumerableExtensions.Where(customers, c => c.ZipCode == 91822), 
        c => c.Name);

Notice that the Select is the outer method, even though it operates on the result of the Where method. The ideal syntax would look more like the following:

sequence<Customer> locals = 
    customers.where(ZipCode == 98112).select(Name);

So, would it be possible to move closer to the ideal syntax with another language feature?

Extension Methods

Much better syntax, it turns out, was to come in the form of a language feature known as extension methods. Extension methods are basically static methods that are callable through an instance syntax. The root of the problem for the query above is that we want to add methods to IEnumerable<T>. However, if we were to add operators, such as Where, Select, and so on, every existing and future implementer would be required to implement those methods. The vast majority of those implementations would be the same, though. The only way to share "interface implementation" in C# is to use static methods, which is what we’ve done with the EnumerableExtensions class used previously.

Let’s suppose we were to write the Where method as an extension method instead. The query could then be rewritten as:

IEnumerable<Customer> locals = 
    customers.Where(c => c.ZipCode == 91822);

For this simple query, this syntax is very close to the ideal. But what exactly does it mean to write the Where method as an extension method? It’s actually fairly straightforward. Basically the signature of the static method changes such that a "this" modifier is added to the first parameter:

public static IEnumerable<T> Where<T>(
    this IEnumerable<T> items, Func<T, bool> predicate)

In addition, the method must be declared within a static class. A static class is one that may contain only static members and that is denoted by the static modifier on the class declaration. That’s all there is to it. This declaration instructs the compiler to allow Where to be called with the same syntax as an instance method on any type that implements IEnumerable<T>. The Where method must, however, be accessible from the current scope. A method is in scope when the containing type is in scope. Therefore, it’s possible to bring extension methods into scope through the Using directive. (See the sidebar "Extension Methods" for more information.)

Extension Methods

It’s clear that extension methods help simplify our example query, but are they a generally useful language feature outside of that scenario? It turns out that there are many uses for extension methods. One of the most common will probably be to provide shared interface implementations. For example, suppose you have the following interface:

interface IDog
{
    // Barks for 2 seconds
    void Bark();
    void Bark(int seconds);
}

This interface requires that every implementer write an implementation for both overloads. With the "Orcas" version of C#, the interface could simply be:

interface IDog
{
    void Bark(int seconds);
}

An extension method could be added in another class:

static class DogExtensions
{
    // Barks for 2 seconds
    public static void Bark(this IDog dog)
    {
        dog.Bark(2);
    }
}

Now the implementer of the interface need only implement a single method, but the clients of the interface may freely call either overload.

We now have a syntax that is very close to the ideal for the filter clause, but is that all there is to the "Orcas" version of C#? Not quite; let’s extend the example a bit by projecting out only the customer’s name, as opposed to the entire customer object. As I mentioned earlier, the ideal syntax would take the following form:

sequence<string> locals = 
    customers.where(ZipCode == 98112).select(Name);

With just the language extensions we’ve discussed, lambda expressions and extension methods, this could be rewritten as:

IEnumerable<string> locals = 
    customers.Where(c => c.ZipCode == 91822).Select(c => c.Name);

Notice that the return type is different for this query—IEnumerable<string> instead of IEnumerable<Customer>. This happens because we are only returning the name of the customer from the select statement

That works really well when the projection is only a single field. However, suppose that instead of just the Name of the customer, we also want to return the customer’s address. The ideal syntax might look like this:

locals = customers.where(ZipCode == 98112).select(Name, Address);

Anonymous Types

If we were to continue using our existing syntax to return the name and address, we’d quickly run into the problem that there is no type that contains only a Name and Address. We could still write this query, however, by introducing that type:

class CustomerTuple
{
    public string Name;
    public string Address;

    public CustomerTuple(string name, string address)
    {
        this.Name = name;
        this.Address = address;
    }
}

We could then use that type, here CustomerTuple, to construct the result of our query:

IEnumerable<CustomerTuple> locals = 
    customers.Where(c => c.ZipCode == 91822)
                 .Select(c => new CustomerTuple(c.Name, c.Address));

That sure seems like a lot of boilerplate code to project out a subset of the fields. It’s also often unclear what to name such a type. Is CustomerTuple really a good name? What if we had projected out Name and Age instead? That could also be a CustomerTuple. So, the problems are that we have boilerplate code and it doesn’t seem that there are any good names for the types that we create. Plus, there could also be many different types required, and managing them could quickly become a headache.

This is exactly what anonymous types are for. This feature basically allows the creation of structural types without specifying the name. If we rewrite the query above using anonymous types, here’s what it looks like:

locals = customers.Where(c => c.ZipCode == 91822)
                       .Select(c => new { c.Name, c.Address });

This code implicitly creates a type that has the fields Name and Address:

class 
{
    public string Name;
    public string Address;
}

This type can’t be referenced by name, since it has none. The names of the fields can be explicitly declared in the anonymous type creation. For example, if the field being created is derived from a complicated expression, or the name simply isn’t desirable, it’s possible to change the name:

locals = customers.Where(c => c.ZipCode == 91822)
    .Select(c => new { FullName = c.FirstName + “ “ + c.LastName, 
                       HomeAddress = c.Address });

In this case, the type that is generated has fields named FullName and HomeAddress.

This gets us closer to the ideal, but there is a problem. You’ll notice that I strategically omitted the type of locals in any place where I used an anonymous type. Obviously we can’t state the name of anonymous types, so how do we use them?

Implicitly Typed Local Variables

There’s another language feature known as implicitly typed local variables (or var for short) that instructs the compiler to infer the type of a local variable. For example:

var integer = 1;

In this case, integer has the type int. It’s important to understand that this is still strongly typed. In a dynamic language, integer’s type could change later. To illustrate this, the following code does not compile:

var integer = 1;
integer = “hello”;

The C# compiler will report an error on the second line, stating that it can’t implicitly convert a string to an int.

In the case of the query above, we can now write the full assignment as shown here:

var locals =
   customers
       .Where(c => c.ZipCode == 91822)
       .Select(c => new { FullName = c.FirstName + “ “ + c.LastName, 
                          HomeAddress = c.Address });

The type of locals ends up being IEnumerable<?> where "?" is the name of a type that can’t be written (since it is anonymous).

Implicitly typed locals are just that: local within a method. It is not possible for them to escape the boundaries of a method, property, indexer, or other block because the type cannot be explicitly stated, and "var" is not legal for fields or parameter types.

Implicitly typed locals turn out to be convenient outside of the context of a query. For example, it helps simplify complicated generic instantiations:

var customerListLookup = new Dictionary<string, List<Customer>>();

We’re now in a good place with our query; we’re close to the ideal syntax and we’ve gotten there with general-purpose language features.

Interestingly, we found that as more people worked with this syntax, there was often a need to allow a projection to escape the boundaries of a method. As we saw earlier, this is possible by constructing an object by calling its constructor from within Select. However, what happens if there is no constructor that takes exactly the values you need to set?

Object Initializers

For this case, there is a C# language feature in the upcoming "Orcas" version known as object initializers. Object initializers basically allow the assignment of multiple properties or fields in a single expression. For example, a common pattern for object creation is:

Customer customer = new Customer();
customer.Name = “Roger”;
customer.Address = “1 Wilco Way”;

In this case, there is no constructor of Customer that takes a name and address; however, there are two properties, Name and Address, that can be set once an instance is created. Object initializers allow the same creation with the following syntax:

Customer customer = new Customer() 
    { Name = “Roger”, Address = “1 Wilco Way” };

In our earlier CustomerTuple example, we created the CustomerTuple class by calling its constructor. We can achieve the same result via object initializers:

var locals = 
    customers
        .Where(c => c.ZipCode == 91822)
        .Select(c => 
             new CustomerTuple { Name = c.Name, Address = c.Address });

Notice that object initializers allow the parentheses of the constructor to be omitted. In addition, both fields and settable properties can be assigned within the body of the object initializer.

We now have a succinct syntax for creating queries in C#. However, we also have an extensible way to add new operators (Distinct, OrderBy, Sum, and so on) through extension methods and a distinct set of language features useful in their own right.

The language design team now had several prototypes to get feedback on. So we organized a usability study with many participants who had experience with both C# and SQL. The feedback was almost universally positive, but it was clear there was something missing. In particular, it was difficult for the developers to apply their knowledge of SQL because the syntax we thought was ideal didn’t map very well to their domain expertise.

Query Expressions

The language design team then designed a syntax that is closer to SQL, known as query expressions. For example, a query expression for our example might look like this:

var locals = from c in customers
             where c.ZipCode == 91822
             select new { FullName = c.FirstName + “ “ +
                          c.LastName, HomeAddress = c.Address };

Query expressions are built on the language features described above. They are literally syntactically translated into the underlying syntax that we’ve already seen. For example, the query above is translated directly into:

var locals =
   customers
       .Where(c => c.ZipCode == 91822)
       .Select(c => new { FullName = c.FirstName + “ “ + c.LastName, 
                          HomeAddress = c.Address });

Query expressions support a number of different "clauses," such as from, where, select, orderby, group by, let, and join. These clauses translate into the equivalent operator calls, which in turn, are implemented via extension methods. The tight relationship of the query clauses and the extension methods that implement the operators makes it easy to combine them if the query syntax doesn’t support a clause for a needed operator. For example:

var locals = (from c in customers
              where c.ZipCode == 91822
              select new { FullName = c.FirstName + “ “ +
                          c.LastName, HomeAddress = c.Address})
             .Count();

In this case the query now returns the number of customers who live in the 91822 ZIP Code area.

And with that, we’ve managed to end just about where we started (which I always find rather satisfying). The syntax of the next version of C# evolved over the past few years through several new language features to ultimately arrive very close to the original syntax proposed in the winter of 2004. The addition of query expressions builds on the foundations provided by the other language features in the upcoming version of C# and makes many query scenarios easier to read and understand for developers with a background in SQL.

Anson Horton has been a Program Manager at Microsoft for almost six years. He has been working on the C# team since its creation, and was on the C++ team prior to that. He has been involved in the design of the C# language and compiler, the C# project system, the C# IDE (IntelliSense), and the C# Expression evaluator and debugger. Anson maintains a blog at blogs.msdn.com/ansonh which he updates as infrequently as possible.