Scaling ADO.NET DataTables
Many data access strategies are available in the .NET development space—ADO.NET Entity Framework, ASP.NET Dynamic Data, XML, Hibernate, LINQ to SQL, and ObjectDataSource, to name a few. However, in many cases ADO.NET DataTables are still in use. This may be because your current architecture is based on DataTables or because they are simple to use. The business may require you to stay on .NET 1.x. You may simply have not had the time to explore these other data access approaches. Whatever the reason, it’s useful to know how to maximize the performance of the DataTable. We’ll look at two common scenarios: querying data and aggregating data.
Before I discuss how to manipulate data in memory, and the associated benchmarks, let me state that, in most situations, it makes sense to query or aggregate data on the database. There are a few reasons for this. First, the database engine is highly optimized to perform these sorts of functions; we’ll see shortly that ADO.NET DataTables really aren’t. Second, the less data you move across the wire, the better; performing data operations on the database means that you only have to transfer the results, not the raw data. Third, if you’re caching large amounts of data on the local machine, you run the risk of having stale data displayed to your users, leaving you to deal with the possibility that multiple users may be working with out-of-sync versions of the same data.
With that caveat out of the way, let’s talk about one scenario where it makes sense for potentially large data tables to be cached locally. You could well have a number of static tables that exist almost solely to provide data normalization—what’s often referred to as a “lookup table.” I deal daily with lookup tables that contain multiple sets of data, identified by year. Since users can work with data from past years, we potentially need to present lookup data from any year, at any time. The ramification of this scenario is that long-term customers often have significantly more data in lookup tables than new customers. To make matters worse, due to the nature of the data, larger clients tend to have more entries in their lookup tables for each year. As you can probably guess, this has led to some performance issues, since the ADO.NET DataTable doesn’t scale very well when used naively. In our development environment, that lookup table had a handful of records; in testing with a realistic data set, there were almost 90,000 records in the same table. Performance degraded significantly with the larger set of records.
Why did performance problems exist? One cause was the frequent use of DataTable.Select to return a single record from the DataTable. Select takes arbitrary criteria and returns an array of DataRows. Essentially, DataTable.Select has to walk the entire table and compare every record to the criteria that you passed in.
DataTable.Rows.Find, on the other hand, returns only a single row. Essentially, when you specify the primary key, a binary tree is created. This has some overhead associated with it, but tremendously speeds up the retrieval.
How much overhead? Well, that depends on a couple of factors. The first is the number of rows in the table. The second factor is whether you create the index before or after you add the data. As the following graph shows, there is significant overhead associated with creating the index before adding the data. Examining the data shows a 30%-70% difference in the time taken to create and index the tables.
You can perform this same benchmark using the “Index Creation” button in the benchmarking UI. (The source code used in producing the benchmarks is available on the MSDN Code Gallery.)
From the graph above, you can see that it takes roughly five seconds to index a table of 100,000 records. To Index a DataTable, you’ll need to specify an array of DataColumns that form the unique key for records in the DataTable; for example,
Knowing that there are performance implications to creating these indexes, I’d recommend spinning the indexing task off to a background thread and doing this asynchronously. (Note that DataTables aren’t inherently thread-safe, so you need to be a little careful doing this. BackgroundWorker is a nice option, since the DataTable won’t be available to the main application thread until the table creation, indexing and population is complete.) The question becomes, is a five-second overhead worth the benefit? Let’s take a look at the differences in performance of various methods of querying a DataTable.
Data Retrieval Performance
Before I discuss the results of the benchmarks, let me talk a little bit about how the benchmarks are set up. The following is a pretty typical block of code:
Private Sub RunIndexCreationTest() ' make sure all libraries are loaded TestIndexCreation(1) Dim results As New List(Of IndexCreationResult) For Each arraySize As Integer In arraySizes results.Add(TestIndexCreation(arraySize)) Next CsvFileOutput(Of IndexCreationResult).Output("filename”, results) End Sub
In all of the benchmarks, the first thing we do is call the test and execute it for a table with a single element. This forces the required libraries to load, and keeps the first test sequence from reporting an unusually high execution time. We then execute the test for table sizes from 10 to 1,000,000 records. Within each benchmark, the operations are repeated several times, with the results being averaged. This should help smooth out any inconsistencies due to transient conditions. The results are then written to a CSV file, located in the executing directory (typically ..\bin\Debug).
The LINQ test includes an extra line of code to force the execution of the query. Since LINQ uses deferred execution, without this line of code, the LINQ results will always be instantaneous...and won’t really return a result. I also executed this LINQ query against both the indexed and non-indexed tables. Here’s a code sample to demonstrate creating a LINQ query against a dataset.
Private Function TestLinqSearch(ByVal dt As DataTable, ByVal key As String) As Integer Dim sw As New Stopwatch sw.Start() Dim result = From r In dt Where r(0) = key Select r(1) Dim output As String = result(0).ToString sw.Stop() Console.WriteLine(output) Return sw.ElapsedTicks End Function
The line above that begins “
Dim result” creates a LINQ query.
However, due to deferred execution, that query isn’t evaluated until the line
that begins “Dim output”. This second call is required to get an accurate
portrayal of the performance.
Wherever possible, I’ve removed extraneous work, such as string concatenation and resolving the reference to a column when referred to by name, from the benchmarked code. While this may not always reflect the way DataTables are used in the real world, I’ve deliberately done this to optimize the DataTable as much as possible. Believe me, it needs the help! In all fairness, however, I’ve applied the same optimizations to the other approaches.
Finally, all of the results for this benchmark are in ticks, not milliseconds. The reason for that is simple: the results of the searches for indexed tables and dictionaries are too fast to have values in the millisecond range.
Also of note is that indexing the table provides a minor advantage to the LINQ query. This was somewhat surprising to me, and the effects of indexing the table when filtering on multiple columns will certainly bear further exploration.
Third, given the scenario of finding exactly one record from a set, transforming your DataTable to a Dictionary is clearly the fastest and most scalable alternative. In a scenario where more than one record needs to be retrieved, LINQ should be seriously considered. Clearly, Microsoft has done a great deal of work on optimizing LINQ, even against large sets of data.
Since every option other than DataTable.Select scales well with large data sets, any of them are a better option. Scour DataTable.Select from your code base!
Indexed Table Search
Now, let’s talk about the results of the benchmark; I’ve included both the raw results (above) and the graph (below). The columns are ordered by the duration of each operation. Since I’m proposing a Dictionary as an alternative to a DataTable, I’ve also included the cost of creating a Dictionary(Of String, String) to hold the same data as the DataTable. You’ll note that it is slightly less expensive to create and populate the Dictionary than it is to perform a single DataTable.Select against the data. As long as you’re querying the data more than once, creating a Dictionary actually proves to be a winning proposition. (As a side note, there’s some variation with Dictionary based on the key type. This isn’t really germane to the conversation at hand, but I include in for completeness.)
However, since with either a Dictionary or an Indexed Table, you have a significant amount of overhead associated with the refactoring of the data into an efficient data structure, LINQ is a much better option from a performance perspective. In addition, the code is much simpler, more concise, and intuitively readable for anyone familiar with database operations.
So far, the score is LINQ to DataSets 1, DataTable methods 0. Let’s see if the venerable ADO.NET DataTable can redeem itself in part two of this matchup: aggregation. We’ll use the same basic DataTable that we’ve been using so far: a non-indexed DataTable with two columns and N rows. We’ll perform three basic aggregations: MIN, MAX and AVG. Syntactically, these look like the following:
Dim linqMinResult = (From r In dt Select r.Field(Of Int32)("Value")).Min Dim linqMaxResult = (From r In dt Select r.Field(Of Int32)("Value")).Max Dim linqAvgResult = (From r In dt Select r.Field(Of Int32)("Value")).Average Dim computeMinResult = dt.Compute("MIN(Value)", "") Dim computeMaxResult = dt.Compute("MAX(Value)", "") Dim computeAvgResult = dt.Compute("AVG(Value)", "")
This is probably the appropriate time for me to issue a disclaimer. Since I’m not an expert on LINQ, the first six LINQ queries in this article were built using the code found on MSDN. However, not all LINQ queries are created equal. A colleague was kind enough to point out the following variant of the MAX aggregate query to demonstrate this fact:
I’ve included the results of this query in the chart for the MAX aggregation, labeled as LINQ MAX (slow). You’ll notice that, although the queries are semantically equivalent, the second query runs about three times slower than the previous LINQ query. The difference in performance has to do with the fact that the fast version uses an overload of MAX that doesn’t require any parameter. (The rule of thumb here, when using LINQ to DataSets, is that you’re better off using a LINQ query that explicitly uses Select.) However, although there is a performance difference, the queries scale similarly.
The winner in this case really isn’t as clear-cut. It’s hard to see from the graphs, but LINQ tends to run slightly faster for MIN and slightly slower for both MAX and AVG. After seeing the huge performance improvement for LINQ when doing searches against a DataTable, I expected to see similar scalability enhancements with aggregation. Looks like we get to wait for Parallel LINQ (PLINQ; part of .NET 4.0) to see that.
At the end of the day, however, if you’re using DataTables,
the Select method is one of the worst things you could possibly do to limit the
scalability of your application. We’ve discussed several alternatives here,
including LINQ, indexing tables to make use of
Rows.Find, and morphing the DataTable
into a generic collection. Any of these options will be orders of magnitude
faster than Select for a set of records of reasonable size. You also want to
consider paring your set of records down as much as possible to speed up both
aggregation and selection of records.
Note: all benchmarks in this article were performed using a 2.8GHz dual-core laptop with 4 GB RAM, running 32-bit Vista. Your mileage may vary.
About the Author
Jeff Certain is a software architect at Colorado CustomWare, a Fort Collins, CO company specializing in CAMA software. Coming from a hardware engineering background, he is particularly interested in code optimization and multi-threading of applications. Jeff is a member of the IEEE, President of the Northern Colorado .NET User Group, a co-founder of the Northern Colorado Software Architects group, and a Microsoft Visual Basic MVP.