Basic Instincts

Query Comprehensions

Scott Wisniewski

Code download available at:  BasicInstincts2007_08.exe(165 KB)

Contents

Basics of Query Translation
Delayed Evaluation
Query Operators
Expression Trees
Sample LINQ Provider

In the June 2007 issue of MSDN® Magazine, Kit George and Ting Liang provided an introduction to some of the new features being introduced in Visual Basic® 9.0 to support Language Integrated Query, or LINQ. In particular, they showed how LINQ is a universal data platform designed to revolutionize the way data-intensive applications are written.

In this column I am going to expand on what Kit and Ting wrote by providing some background on the key concepts that drove the design of LINQ. To accompany this column, I’ve provided a custom query provider that showcases how this information can be put to use in a real-world applications; you can download it from the MSDN Magazine Web site. The material outlined here is far from introductory and assumes a working knowledge of database programming and a passing familiarity with LINQ. For more information on this subject, see the "Resources" sidebar.

Basics of Query Translation

LINQ brings both simplicity and flexibility to query translation. It is simple because the translation used by the compiler is straightforward and easy to understand. It is flexible because third-party developers can easily plug in their own behaviors. To get a feel for how this all works, let’s consider a brief example:

Imports System.Linq

Module SimpleExample
    Sub Main()
        Dim names() As String = {“Fred”, “George”, “Scott”, “Steve”}
        Dim letter = “S”
        Dim q = From name In names _
                Where name(0) = letter _
                Select name
    End Sub
End Module

Here I define a simple query that selects all strings that start with the letter S from an array. The compiler evaluates this by first splitting the query up into its component parts, then taking the clauses embedded in each part of the query and converting them into inline functions. It then spits everything back out as a series of chained method calls off of the object being queried. The final result should look like this:

Dim Clause1 as Func(of String, Boolean) _
    = Function (name) name(0) = letter
Dim Clause2 as Func(of String, String) _
    = Function (name) name

Dim q = names.Where(Clause1).Select(Clause2) 

The Microsoft® .NET Framework 3.5 includes default implementations of the standard query operators. However, a library author or component vendor can provide his own implementations if he desires different or augmented functionality. For example, to modify the behavior of this specific query, a library author or component vendor need only write two methods: one named Where that is responsible for filtering the collection it was provided and another named Select that performs a projection or mutation of the objects in its collection. Similarly, if a library author wanted to support more advanced queries, such as those involving Group By, Order By, or Join clauses, she would simply need to define extra methods corresponding to each operator she wished to enable.

The Clause1 and Clause2 parameters passed to the Where and Select methods are delegate values created using inline functions. Inline functions are a new feature introduced in Visual Basic 9.0 that enable delegates to be constructed on the fly by simply writing expressions. In addition to offering convenient syntax, they also enable the constructed delegates to access the symbols (including local variables) visible in the scope in which they were created. This makes them suitable for use in translating queries, which naturally need to have access to the variables declared around them.

In the Where example, the provided expression compares the values in the collection being processed against the local variable letter. Without inline functions and their ability to capture the values of their containing scopes, implementing this type of functionality would be much more difficult. Because of inline functions, the rules used to translate queries into methods remain relatively simple. The compiler simply needs to gather all the expressions embedded in a query, convert them into inline functions, and then pass them off as delegate arguments to the appropriate query operator methods.

Delayed Evaluation

One of the most fundamental concepts underlying LINQ is the idea of delayed evaluation. In particular, the execution of queries in LINQ is significantly different from the execution of queries in traditional database management systems because database systems tend to have a fundamental understanding of the data they query over. LINQ has no such understanding. Instead, it simply plays dumb, converting each query into a series of chained method calls. The implementations of particular query operators are then responsible for intelligently and efficiently executing the query. Database systems, on the other hand, utilize significant knowledge about both the structures of queries and the nature of the data they are querying over (such as index availability and distribution statistics). Consequently, they take a holistic approach to query evaluation rather than the compositional approach used by LINQ.

When a DBMS sees a query, it always sees the entire query and, therefore, executes it immediately. LINQ sees queries in pieces, processing one operator at a time. If LINQ were to take the same approach as a DBMS and fully execute each piece as it is processed, the resulting performance would be rather poor. To see why this is true, consider the following example:

Dim q = From x In Xs _
        Where x.Value > 10 _
        Select x * 3.14

Here I simply take all the items in an array with a value greater than 10 and multiply them by pi. When translated by the compiler, this code looks something like the following:

Dim q = xs.Where(Function(x) x.Value > 10).Select( _
    Function (x) x  * 3.14)

If these query operator methods were implemented natively and completely executed the desired operation on their input collection, computing the whole query would require two passes over the original collection: one to filter it with the Where clause and another to project out the desired multiplication in the Select clause. This would be inefficient in both time and space as it would process its input twice (when clearly the combined operation of filtering and projection can be done in one pass). It would also need to store a copy of the collection in order to contain the results of the intermediate operation. Obviously this is undesirable.

To get around this, all built-in LINQ providers utilize a concept known as delayed execution. Rather than having query operators execute immediately, they all simply return types that implement the IEnumerable(of T) interface. These types then delay execution until the query is actually used in a for each loop. In this way many of the performance gains available in the holistic approach used by database systems are also available in LINQ. Effectively, each operator simply injects information indicating its presence into a runtime representation of the query. When iterated over, these queries are then executed as a single unit.

In the case of LINQ to SQL, this means that the query provider can delay generation of SQL code until the entire query has been processed, allowing it to integrate well with the holistic approach used by DBMSs. In the case of LINQ to Objects, this means that the underlying collections are processed a minimal number of times and that costly intermediate storage is only allocated when absolutely necessary.

Query Operators

The third and most important stop in this incursion into LINQ is a study of the query operators that make it up. Altogether, LINQ defines 48 distinct query operators that are collectively referred to as the standard query operators. I don’t, however, explore all of them here. Many of them are designed to simply be called as methods defined on collection objects. Others either give little insight into the way LINQ works or are very similar to other operators. Also, due to space constraints, I can’t give in-depth explanations of them all. As a result, let’s focus attention on the most interesting of the operators for which the Visual Basic compiler defines query comprehensions.

Select The Select clause is used to project items out of a collection. For example, it can be used to select a subset of the columns in a database table or to transform the elements of a collection. It is implemented by a method named Select and must be defined by every query provider. This is because all queries end in an implicit Select clause even if one is not explicitly provided in code. Consequently, the Visual Basic compiler will not allow a type to be queried if it does not define one. It should be defined using a signature similar to this:

Public Sub [Select](of OutputElementType)( _
    selector as Func(of InputElementType, OutputElementType)) _
        as IEnumerable(of OutputElementType)
End Sub

The selector parameter must be a delegate that maps from the source collection’s element type to the out collection’s element type. In general, the method should be generic and selector should be parameterized on its return type. If it is not, then any query features that rely on anonymous types will be disabled. This includes multiple element Select lists, Join clauses, and queries with multiple From clauses.

SelectMany The SelectMany operator is used to implement cross joins. Every query containing multiple From clauses will result in one or more calls to the SelectMany method. As an example, consider the following queries:

Dim q0 = From x In xs, y In ys
Dim q1 = From x In xs From y In ys

Here we define two queries that perform a cross join between an inner collection xs and an outer collection ys. When compiled they will look something like this:

q0 = xs.SelectMany(Function(ByVal x) ys, _
    Function(ByVal x, ByVal y) _
        New With {.x = x, .y = y}).Select(Function(ByVal it) it)

q1 = xs.SelectMany(Function(ByVal x) ys, _
    Function(ByVal x, ByVal y) _
        New With {.x = x, .y = y}).Select(Function(ByVal it) it)

A sample definition of the SelectMany method is shown here:

Public Function SelectMany(of InnerElementType, ResultType) _
    (collectionSelector as Func(of OuterElementType, _
        IEnumerable(of InnerElementType)), _
     resultSelector as Func(of InnerElementtype, OuterElementType, _
        ResultType)) as IEnumerable(of ResultType)
End Function

SelectMany takes two delegates as parameters and returns a single queryable collection as a result. The first delegate parameter, collectionSelector, is used to calculate an inner collection given an element of the outer collection. The second delegate parameter, resultSelector, is used to combine an element from the outer collection with an element from the inner collection to produce an element in the final resultset. The method should also be generic and parameterized on both the inner collection element type and the element type of the resultset.

Because the method takes in a collection selector (rather than a collection) as an argument, it is possible to do several interesting things with the From clause in Visual Basic. One particularly useful technique is to join a collection of objects with child collections of its elements. This can be used to "flatten" an object graph into a relational representation suitable for storage in a database, or for interoperability with legacy systems. For example, the query shown here will take a collection of Customer objects with embedded Order collections and convert it into a flat representation that looks similar to the results of an outer join between disparate Customers and Orders collections.

Dim q = From c In customers, o In c.Orders _
        Select c, o

Where The Where operator is used to filter a collection based on a logical condition. Its signature should look something like the following:

Public Function Where( _
    ByVal predicate As Func(Of ElementType, Boolean)) _
        As IEnumerable(Of ElementType)
End Function

Join The Join operator is used to perform an inner join between two collections. A sample usage of it is shown here:

Dim q = From usCust In usCustomers _
        Join euCust In europeCustomers On _
            usCust.CompanyName = euCust.CompanyName 
        Select euCust.CompanyName

In this case, I simply join together a set of United States Customers with a set of European Customers where both have identical company names.

In general, Join is similar to the SQL inner join operator, but is slightly more restrictive with the set of expressions that may appear inside an On clause. In particular, LINQ only allows equality comparisons to appear in the Join clause, and requires all compound logical operations to be connected using the AND operator. The equality comparisons must utilize the new Equals operator, each side of the operator must reference only one of the two collections involved in the Join, and each side must reference a different one of the two collections. For example, the On clauses shown here are all legal Visual Basic syntax:

On c.CustomerId Equals o.CustomerID 

On usCustomer.LastName Equals euCustomer.LastName and _
   usComster.FirstName Equals euCustomer.FirstName

Whereas the On clauses shown here are not legal Visual Basic:

On c.CustomerID = o.CustomerID

On usCustomer.CompanyName Equals euCustomer.ComapnyName or _
    usCustomer.FirstName Equals euCustomer.FirstName

On usCustomer.CompanyName <> euCustomer.CompanyName and _
    usCustomer.LastName = euCustomer.LastName

On usCustomer.LastName & “, “ & euCustomer.FirstName Equals _
    euCustomer.LastName & “, “ & usCustomer.FirstName

The operator itself is implemented by a method called Join with a signature similar to this:

Public Function Join(Of InnerElementType, KeyType, ResultType)( _
    ByVal inner As IEnumerable(Of InnerElementType), _
    ByVal outeKeySelector As Func(Of ElementType, KeyType), _
    ByVal innerKeySelector As Func(Of InnerElementType, KeyType), _
    ByVal resultsSelector As Func( _
        Of ElementType, InnerElementType, ResultType)) _
    As IEnumerable(Of ResultType)
End Function

The inner parameter represents the inner collection with which to join the current (or outer) collection. The outerKeySelector and innerKeySelector parameters are delegates used to extract the join key from the outer and inner collections, respectively. The resultSelector is used to combine an element from the outer collection and an element from the inner collection to form an element in the resultset.

The compiler is able to determine both the type of the join key and the expressions used to build the key selector arguments by examining the Equals operator used in a join’s On clause. In particular, the side of an Equals operator used to reference the inner collection is associated with the inner key selector, and the side used to reference the outer collection is associated with the outer key selector. When multiple Equals operators are connected using AND operators, the compiler will combine the selector expressions for each Equals instance into a single anonymous type constructor. For example, say you execute this On clause:

On usCompany.CompanyName Equals euCompany.CompanyName and _
   usCompany.StockSymbol Eqauls euCompany.StockSymbol

The clause will result in the key selectors shown here:

Dim innerKeySelector = Function (usCompany) _
    New With {.CompanyName = usCompany.CompanyName, _
              .StockSymbol = usCompany.StockSymbol} 

Dim outerKeySelector = Function (euCompany) _
    New With {.CompanyName = euCompany, _
              .StockSymbol = euCompany.StockSymbol}

The implementation of the Join operator will then use these key selectors to uniquely identify the elements of each collection and to associate related elements with each other.

Expression Trees

So far, all of the query operator implementations shown in this column have operated using methods that take delegates as arguments. The delegates are then used to provide executable representations of the expressions they were generated from. For in-memory collections of Objects, this is sufficient. For example, when walking over a source collection, a Where implementation can simply invoke its supplied delegate on each element it encounters. For query providers that need to efficiently execute queries on remote database servers, such as LINQ to SQL, a delegate representation of a query clause expression is not sufficient, however. In particular, in order to generate remotely executable SQL code, a LINQ provider needs to have a program-readable representation of the clauses actually provided by the user. Without them, implementing LINQ to SQL efficiently would be largely impossible.

Fortunately, LINQ has support for this in the form of expression trees. In addition to allowing query operator implementations to take in delegates as parameters, the Visual Basic compiler also allows them to declare parameters as types derived from Expression(of T), where T corresponds to some delegate type. For these parameters the compiler converts the inline functions it generates into trees that describe the actual expression typed by the user. By reading these trees, and appropriately translating them to SQL code, LINQ providers are then able to marshal queries written in Visual Basic over to a database server where they can be executed efficiently. Figure 1 shows a diagram representing the layout of the expression tree generated for the following inline function:

Figure 1 Expression Tree for a Simple Inline Function

Figure 1** Expression Tree for a Simple Inline Function **(Click the image for a larger view)

Dim e As Expression(Of Func(Of Integer, Boolean)) = _
    Function(ByVal i) i > 5 AndAlso i Mod 2 = 1

And here is the code generated by the compiler for it:

Dim param = Expression.Parameter(GetType(Integer), “i”)
Dim body = Expression.AndAlso( _
    Expression.GreaterThan(param, Expression.Constant(5)), _
    Expression.Equal( _
        Expression.Modulo(param, Expression.Constant(2)), _
        Expression.Constant(1)))
Dim e = expression.Lambda(body, param)

Sample LINQ Provider

Walking through the actual implementation of a sample LINQ provider is outside the scope of this column. Fortunately, you can download a working sample from the MSDN Magazine Web site. The sample defines several custom query providers that enable LINQ query execution to be measured using PerfMon. In addition to augmenting the information presented here, the sample is also a very useful tool for optimizing the performance of your LINQ-enabled applications.

Resources

Send your questions and comments to instinct@microsoft.com.

Scott Wisniewski is a Software Design Engineer at Microsoft, where he works on the Visual Basic compiler. For the upcoming Visual Studio "Orcas" release, he worked on several features, including nullable types, error correction, and extension methods. You can contact Scott at scottwis@microsoft.com or through the Visual Basic Team blog.