By Jeff Certain
Colorado CustomWare
Download the code for this article
Many data access strategies are available in the .NET
development space—ADO.NET Entity Framework, ASP.NET Dynamic Data, XML,
Hibernate, LINQ to SQL, and ObjectDataSource, to name a few. However, in many
cases ADO.NET DataTables are still in use. This may be because your current
architecture is based on DataTables or because they are simple to use. The
business may require you to stay on .NET 1.x. You may simply have not had the
time to explore these other data access approaches. Whatever the reason, it’s
useful to know how to maximize the performance of the DataTable. We’ll look at
two common scenarios: querying data and aggregating data.
Querying Data
Before I discuss how to manipulate data in memory, and the
associated benchmarks, let me state that, in most situations, it makes sense to
query or aggregate data on the database. There are a few reasons for this.
First, the database engine is highly optimized to perform these sorts of
functions; we’ll see shortly that ADO.NET DataTables really aren’t. Second, the
less data you move across the wire, the better; performing data operations on
the database means that you only have to transfer the results, not the raw
data. Third, if you’re caching large amounts of data on the local machine, you
run the risk of having stale data displayed to your users, leaving you to deal
with the possibility that multiple users may be working with out-of-sync
versions of the same data.
With that caveat out of the way, let’s talk about one
scenario where it makes sense for potentially large data tables to be cached
locally. You could well have a number of static tables that exist almost solely
to provide data normalization—what’s often referred to as a “lookup table.” I
deal daily with lookup tables that contain multiple sets of data, identified by
year. Since users can work with data from past years, we potentially need to
present lookup data from any year, at any time. The ramification of this
scenario is that long-term customers often have significantly more data in
lookup tables than new customers. To make matters worse, due to the nature of
the data, larger clients tend to have more entries in their lookup tables for
each year. As you can probably guess, this has led to some performance issues,
since the ADO.NET DataTable doesn’t scale very well when used naively. In our
development environment, that lookup table had a handful of records; in testing
with a realistic data set, there were almost 90,000 records in the same table.
Performance degraded significantly with the larger set of records.
Why did performance problems exist? One cause was the
frequent use of DataTable.Select to return a single record from the DataTable.
Select takes arbitrary criteria and returns an array of DataRows. Essentially,
DataTable.Select has to walk the entire table and compare every record to the
criteria that you passed in.
DataTable.Rows.Find, on the other hand, returns only a
single row. Essentially, when you specify the primary key, a binary tree is
created. This has some overhead associated with it, but tremendously speeds up
the retrieval.
How much overhead? Well, that depends on a couple of
factors. The first is the number of rows in the table. The second factor is
whether you create the index before or after you add the data. As the following
graph shows, there is significant overhead associated with creating the index
before adding the data. Examining the data shows a 30%-70% difference in the
time taken to create and index the tables.
.jpg)
You can perform this same benchmark using the “Index
Creation” button in the benchmarking UI. (The source code used in producing the
benchmarks is available on the MSDN Code Gallery.)
From the graph above, you can see that it takes roughly five
seconds to index a table of 100,000 records. To Index a DataTable, you’ll need
to specify an array of DataColumns that form the unique key for records in the
DataTable; for example,
dtIndexed.PrimaryKey = New DataColumn() {dtIndexed.Columns("Key")}
Knowing that there are performance implications to creating
these indexes, I’d recommend spinning the indexing task off to a background
thread and doing this asynchronously. (Note that DataTables aren’t inherently
thread-safe, so you need to be a little careful doing this. BackgroundWorker is
a nice option, since the DataTable won’t be available to the main application
thread until the table creation, indexing and population is complete.) The
question becomes, is a five-second overhead worth the benefit? Let’s take a
look at the differences in performance of various methods of querying a
DataTable.
Data Retrieval Performance
Before I discuss the results of the benchmarks, let me talk
a little bit about how the benchmarks are set up. The following is a pretty
typical block of code:
Private Sub RunIndexCreationTest()
' make sure all libraries are loaded
TestIndexCreation(1)
Dim results As New List(Of IndexCreationResult)
For Each arraySize As Integer In arraySizes
results.Add(TestIndexCreation(arraySize))
Next
CsvFileOutput(Of IndexCreationResult).Output("filename”, results)
End Sub
In all of the benchmarks, the first thing we do is call the
test and execute it for a table with a single element. This forces the required
libraries to load, and keeps the first test sequence from reporting an
unusually high execution time. We then execute the test for table sizes from 10
to 1,000,000 records. Within each benchmark, the operations are repeated
several times, with the results being averaged. This should help smooth out any
inconsistencies due to transient conditions. The results are then written to a
CSV file, located in the executing directory (typically ..\bin\Debug).
The LINQ test includes an extra line of code to force the
execution of the query. Since LINQ uses deferred execution, without this line
of code, the LINQ results will always be instantaneous...and won’t really
return a result. I also executed this LINQ query against both the indexed and
non-indexed tables. Here’s a code sample to demonstrate creating a LINQ query
against a dataset.
Private Function TestLinqSearch(ByVal dt As DataTable, ByVal key As String) As Integer
Dim sw As New Stopwatch
sw.Start()
Dim result = From r In dt Where r(0) = key Select r(1)
Dim output As String = result(0).ToString
sw.Stop()
Console.WriteLine(output)
Return sw.ElapsedTicks
End Function
The line above that begins “Dim result” creates a LINQ query.
However, due to deferred execution, that query isn’t evaluated until the line
that begins “Dim output”. This second call is required to get an accurate
portrayal of the performance.
Wherever possible, I’ve removed extraneous work, such as
string concatenation and resolving the reference to a column when referred to
by name, from the benchmarked code. While this may not always reflect the way
DataTables are used in the real world, I’ve deliberately done this to optimize
the DataTable as much as possible. Believe me, it needs the help! In all fairness,
however, I’ve applied the same optimizations to the other approaches.
Finally, all of the results for this benchmark are in ticks,
not milliseconds. The reason for that is simple: the results of the searches
for indexed tables and dictionaries are too fast to have values in the
millisecond range.
Also of note is that indexing the table provides a minor
advantage to the LINQ query. This was somewhat surprising to me, and the
effects of indexing the table when filtering on multiple columns will certainly
bear further exploration.
Third, given the scenario of finding exactly one record from
a set, transforming your DataTable to a Dictionary is clearly the fastest and
most scalable alternative. In a scenario where more than one record needs to be
retrieved, LINQ should be seriously considered. Clearly, Microsoft has done a
great deal of work on optimizing LINQ, even against large sets of data.
Since every option other than DataTable.Select scales well
with large data sets, any of them are a better option. Scour DataTable.Select
from your code base!
ArraySize | Table Search | Dictionary Create | LINQ | Indexed LINQ | Indexed Table Search | Dictionary Search |
10 | 533 | 217 | 197 | 179 | 111 | 38 |
50 | 1095 | 487 | 236 | 159 | 158 | 44 |
100 | 1552 | 935 | 232 | 148 | 143 | 38 |
500 | 6820 | 5645 | 315 | 249 | 182 | 47 |
1000 | 14303 | 8922 | 234 | 162 | 172 | 40 |
5000 | 85588 | 50487 | 233 | 171 | 198 | 42 |
10000 | 185221 | 95179 | 247 | 193 | 216 | 44 |
50000 | 1324494 | 490918 | 256 | 185 | 231 | 45 |
100000 | 2355922 | 999825 | 297 | 181 | 259 | 76 |
500000 | 14347129 | 5004418 | 279 | 186 | 262 | 45 |
1000000 | 29753787 | 9910418 | 305 | 184 | 275 | 48 |
Now, let’s talk about the results of the benchmark; I’ve
included both the raw results (above) and the graph (below). The columns are
ordered by the duration of each operation. Since I’m proposing a Dictionary as
an alternative to a DataTable, I’ve also included the cost of creating a
Dictionary(Of String, String) to hold the same data as the DataTable. You’ll
note that it is slightly less expensive to create and populate the Dictionary
than it is to perform a single DataTable.Select against the data. As long as
you’re querying the data more than once, creating a Dictionary actually proves
to be a winning proposition. (As a side note, there’s some variation with
Dictionary based on the key type. This isn’t really germane to the conversation
at hand, but I include in for completeness.)
.jpg)
However, since with either a Dictionary or an Indexed Table,
you have a significant amount of overhead associated with the refactoring of
the data into an efficient data structure, LINQ is a much better option from a
performance perspective. In addition, the code is much simpler, more concise,
and intuitively readable for anyone familiar with database operations.
Aggregation
So far, the score is LINQ to DataSets 1, DataTable methods
0. Let’s see if the venerable ADO.NET DataTable can redeem itself in part two
of this matchup: aggregation. We’ll use the same basic DataTable that we’ve
been using so far: a non-indexed DataTable with two columns and N rows. We’ll
perform three basic aggregations: MIN, MAX and AVG. Syntactically, these look
like the following:
Dim linqMinResult = (From r In dt Select r.Field(Of Int32)("Value")).Min
Dim linqMaxResult = (From r In dt Select r.Field(Of Int32)("Value")).Max
Dim linqAvgResult = (From r In dt Select r.Field(Of Int32)("Value")).Average
Dim computeMinResult = dt.Compute("MIN(Value)", "")
Dim computeMaxResult = dt.Compute("MAX(Value)", "")
Dim computeAvgResult = dt.Compute("AVG(Value)", "")
This is probably the appropriate time for me to issue a
disclaimer. Since I’m not an expert on LINQ, the first six LINQ queries in this
article were built using the code found on MSDN. However, not all LINQ queries
are created equal. A colleague was kind enough to point out the following
variant of the MAX aggregate query to demonstrate this fact:
Dim linqMaxResult2 = Aggregate r In dt Into Max(r.Field(Of Integer)("value"))
I’ve included the results of this query in the chart for the
MAX aggregation, labeled as LINQ MAX (slow). You’ll notice that, although the
queries are semantically equivalent, the second query runs about three times slower
than the previous LINQ query. The
difference in performance has to do with the fact that the fast version uses an
overload of MAX that doesn’t require any parameter. (The rule of thumb here,
when using LINQ to DataSets, is that you’re better off using a LINQ query that
explicitly uses Select.) However, although there is a performance difference,
the queries scale similarly.
.jpg)
.jpg)
.jpg)
The winner in this case really isn’t as clear-cut. It’s hard
to see from the graphs, but LINQ tends to run slightly faster for MIN and
slightly slower for both MAX and AVG. After seeing the huge performance
improvement for LINQ when doing searches against a DataTable, I expected to see
similar scalability enhancements with aggregation. Looks like we get to wait
for Parallel LINQ (PLINQ; part
of .NET 4.0) to see that.
At the end of the day, however, if you’re using DataTables,
the Select method is one of the worst things you could possibly do to limit the
scalability of your application. We’ve discussed several alternatives here,
including LINQ, indexing tables to make use of Rows.Find, and morphing the DataTable
into a generic collection. Any of these options will be orders of magnitude
faster than Select for a set of records of reasonable size. You also want to
consider paring your set of records down as much as possible to speed up both
aggregation and selection of records.
Note: all benchmarks in this article were performed using a
2.8GHz dual-core laptop with 4 GB RAM, running 32-bit Vista. Your mileage may
vary.
About the Author
Jeff
Certain is a software architect at Colorado
CustomWare, a Fort Collins, CO company specializing in CAMA software.
Coming from a hardware engineering background, he is particularly interested in
code optimization and multi-threading of applications. Jeff is a member of the
IEEE, President of the Northern Colorado .NET User Group, a co-founder of the
Northern Colorado Software Architects group, and a Microsoft Visual Basic MVP.