By Jeff Certain
Colorado CustomWare
Download the code for this article
Many data access strategies are available in the .NETdevelopment space—ADO.NET Entity Framework, ASP.NET Dynamic Data, XML,Hibernate, LINQ to SQL, and ObjectDataSource, to name a few. However, in manycases ADO.NET DataTables are still in use. This may be because your currentarchitecture is based on DataTables or because they are simple to use. Thebusiness may require you to stay on .NET 1.x. You may simply have not had thetime to explore these other data access approaches. Whatever the reason, it’suseful to know how to maximize the performance of the DataTable. We’ll look attwo common scenarios: querying data and aggregating data.
Querying Data
Before I discuss how to manipulate data in memory, and theassociated benchmarks, let me state that, in most situations, it makes sense toquery or aggregate data on the database. There are a few reasons for this.First, the database engine is highly optimized to perform these sorts offunctions; we’ll see shortly that ADO.NET DataTables really aren’t. Second, theless data you move across the wire, the better; performing data operations onthe database means that you only have to transfer the results, not the rawdata. Third, if you’re caching large amounts of data on the local machine, yourun the risk of having stale data displayed to your users, leaving you to dealwith the possibility that multiple users may be working with out-of-syncversions of the same data.
With that caveat out of the way, let’s talk about onescenario where it makes sense for potentially large data tables to be cachedlocally. You could well have a number of static tables that exist almost solelyto provide data normalization—what’s often referred to as a “lookup table.” Ideal daily with lookup tables that contain multiple sets of data, identified byyear. Since users can work with data from past years, we potentially need topresent lookup data from any year, at any time. The ramification of thisscenario is that long-term customers often have significantly more data inlookup tables than new customers. To make matters worse, due to the nature ofthe data, larger clients tend to have more entries in their lookup tables foreach year. As you can probably guess, this has led to some performance issues,since the ADO.NET DataTable doesn’t scale very well when used naively. In ourdevelopment environment, that lookup table had a handful of records; in testingwith a realistic data set, there were almost 90,000 records in the same table.Performance degraded significantly with the larger set of records.
Why did performance problems exist? One cause was thefrequent use of DataTable.Select to return a single record from the DataTable.Select takes arbitrary criteria and returns an array of DataRows. Essentially,DataTable.Select has to walk the entire table and compare every record to thecriteria that you passed in.
DataTable.Rows.Find, on the other hand, returns only asingle row. Essentially, when you specify the primary key, a binary tree iscreated. This has some overhead associated with it, but tremendously speeds upthe retrieval.
How much overhead? Well, that depends on a couple offactors. The first is the number of rows in the table. The second factor iswhether you create the index before or after you add the data. As the followinggraph shows, there is significant overhead associated with creating the indexbefore adding the data. Examining the data shows a 30%-70% difference in thetime taken to create and index the tables.
.jpg)
You can perform this same benchmark using the “IndexCreation” button in the benchmarking UI. (The source code used in producing thebenchmarks is available on the MSDN Code Gallery.)
From the graph above, you can see that it takes roughly fiveseconds to index a table of 100,000 records. To Index a DataTable, you’ll needto specify an array of DataColumns that form the unique key for records in theDataTable; for example,
dtIndexed.PrimaryKey = New DataColumn() {dtIndexed.Columns("Key")}Knowing that there are performance implications to creatingthese indexes, I’d recommend spinning the indexing task off to a backgroundthread and doing this asynchronously. (Note that DataTables aren’t inherentlythread-safe, so you need to be a little careful doing this. BackgroundWorker isa nice option, since the DataTable won’t be available to the main applicationthread until the table creation, indexing and population is complete.) Thequestion becomes, is a five-second overhead worth the benefit? Let’s take alook at the differences in performance of various methods of querying aDataTable.
Data Retrieval Performance
Before I discuss the results of the benchmarks, let me talka little bit about how the benchmarks are set up. The following is a prettytypical block of code:
Private Sub RunIndexCreationTest()
' make sure all libraries are loaded
TestIndexCreation(1)
Dim results As New List(Of IndexCreationResult)
For Each arraySize As Integer In arraySizes
results.Add(TestIndexCreation(arraySize))
Next
CsvFileOutput(Of IndexCreationResult).Output("filename”, results)
End SubIn all of the benchmarks, the first thing we do is call thetest and execute it for a table with a single element. This forces the requiredlibraries to load, and keeps the first test sequence from reporting anunusually high execution time. We then execute the test for table sizes from 10to 1,000,000 records. Within each benchmark, the operations are repeatedseveral times, with the results being averaged. This should help smooth out anyinconsistencies due to transient conditions. The results are then written to aCSV file, located in the executing directory (typically ..\bin\Debug).
The LINQ test includes an extra line of code to force theexecution of the query. Since LINQ uses deferred execution, without this lineof code, the LINQ results will always be instantaneous...and won’t reallyreturn a result. I also executed this LINQ query against both the indexed andnon-indexed tables. Here’s a code sample to demonstrate creating a LINQ queryagainst a dataset.
Private Function TestLinqSearch(ByVal dt As DataTable, ByVal key As String) As Integer
Dim sw As New Stopwatch
sw.Start()
Dim result = From r In dt Where r(0) = key Select r(1)
Dim output As String = result(0).ToString
sw.Stop()
Console.WriteLine(output)
Return sw.ElapsedTicks
End FunctionThe line above that begins “Dim result” creates a LINQ query.However, due to deferred execution, that query isn’t evaluated until the linethat begins “Dim output”. This second call is required to get an accurateportrayal of the performance.
Wherever possible, I’ve removed extraneous work, such asstring concatenation and resolving the reference to a column when referred toby name, from the benchmarked code. While this may not always reflect the wayDataTables are used in the real world, I’ve deliberately done this to optimizethe DataTable as much as possible. Believe me, it needs the help! In all fairness,however, I’ve applied the same optimizations to the other approaches.
Finally, all of the results for this benchmark are in ticks,not milliseconds. The reason for that is simple: the results of the searchesfor indexed tables and dictionaries are too fast to have values in themillisecond range.
Also of note is that indexing the table provides a minoradvantage to the LINQ query. This was somewhat surprising to me, and theeffects of indexing the table when filtering on multiple columns will certainlybear further exploration.
Third, given the scenario of finding exactly one record froma set, transforming your DataTable to a Dictionary is clearly the fastest andmost scalable alternative. In a scenario where more than one record needs to beretrieved, LINQ should be seriously considered. Clearly, Microsoft has done agreat deal of work on optimizing LINQ, even against large sets of data.
Since every option other than DataTable.Select scales wellwith large data sets, any of them are a better option. Scour DataTable.Selectfrom your code base!
ArraySize | Table Search | Dictionary Create | LINQ | Indexed LINQ | Indexed Table Search | Dictionary Search |
10 | 533 | 217 | 197 | 179 | 111 | 38 |
50 | 1095 | 487 | 236 | 159 | 158 | 44 |
100 | 1552 | 935 | 232 | 148 | 143 | 38 |
500 | 6820 | 5645 | 315 | 249 | 182 | 47 |
1000 | 14303 | 8922 | 234 | 162 | 172 | 40 |
5000 | 85588 | 50487 | 233 | 171 | 198 | 42 |
10000 | 185221 | 95179 | 247 | 193 | 216 | 44 |
50000 | 1324494 | 490918 | 256 | 185 | 231 | 45 |
100000 | 2355922 | 999825 | 297 | 181 | 259 | 76 |
500000 | 14347129 | 5004418 | 279 | 186 | 262 | 45 |
1000000 | 29753787 | 9910418 | 305 | 184 | 275 | 48 |
Now, let’s talk about the results of the benchmark; I’veincluded both the raw results (above) and the graph (below). The columns areordered by the duration of each operation. Since I’m proposing a Dictionary asan alternative to a DataTable, I’ve also included the cost of creating aDictionary(Of String, String) to hold the same data as the DataTable. You’llnote that it is slightly less expensive to create and populate the Dictionarythan it is to perform a single DataTable.Select against the data. As long asyou’re querying the data more than once, creating a Dictionary actually provesto be a winning proposition. (As a side note, there’s some variation withDictionary based on the key type. This isn’t really germane to the conversationat hand, but I include in for completeness.)
.jpg)
However, since with either a Dictionary or an Indexed Table,you have a significant amount of overhead associated with the refactoring ofthe data into an efficient data structure, LINQ is a much better option from aperformance perspective. In addition, the code is much simpler, more concise,and intuitively readable for anyone familiar with database operations.
Aggregation
So far, the score is LINQ to DataSets 1, DataTable methods0. Let’s see if the venerable ADO.NET DataTable can redeem itself in part twoof this matchup: aggregation. We’ll use the same basic DataTable that we’vebeen using so far: a non-indexed DataTable with two columns and N rows. We’llperform three basic aggregations: MIN, MAX and AVG. Syntactically, these looklike the following:
Dim linqMinResult = (From r In dt Select r.Field(Of Int32)("Value")).Min
Dim linqMaxResult = (From r In dt Select r.Field(Of Int32)("Value")).Max
Dim linqAvgResult = (From r In dt Select r.Field(Of Int32)("Value")).Average
Dim computeMinResult = dt.Compute("MIN(Value)", "")
Dim computeMaxResult = dt.Compute("MAX(Value)", "")
Dim computeAvgResult = dt.Compute("AVG(Value)", "")This is probably the appropriate time for me to issue adisclaimer. Since I’m not an expert on LINQ, the first six LINQ queries in thisarticle were built using the code found on MSDN. However, not all LINQ queriesare created equal. A colleague was kind enough to point out the followingvariant of the MAX aggregate query to demonstrate this fact:
Dim linqMaxResult2 = Aggregate r In dt Into Max(r.Field(Of Integer)("value"))I’ve included the results of this query in the chart for theMAX aggregation, labeled as LINQ MAX (slow). You’ll notice that, although thequeries are semantically equivalent, the second query runs about three times slowerthan the previous LINQ query. Thedifference in performance has to do with the fact that the fast version uses anoverload of MAX that doesn’t require any parameter. (The rule of thumb here,when using LINQ to DataSets, is that you’re better off using a LINQ query thatexplicitly uses Select.) However, although there is a performance difference,the queries scale similarly.
.jpg)
.jpg)
.jpg)
The winner in this case really isn’t as clear-cut. It’s hardto see from the graphs, but LINQ tends to run slightly faster for MIN andslightly slower for both MAX and AVG. After seeing the huge performanceimprovement for LINQ when doing searches against a DataTable, I expected to seesimilar scalability enhancements with aggregation. Looks like we get to waitfor Parallel LINQ (PLINQ; partof .NET 4.0) to see that.
At the end of the day, however, if you’re using DataTables,the Select method is one of the worst things you could possibly do to limit thescalability of your application. We’ve discussed several alternatives here,including LINQ, indexing tables to make use of Rows.Find, and morphing the DataTableinto a generic collection. Any of these options will be orders of magnitudefaster than Select for a set of records of reasonable size. You also want toconsider paring your set of records down as much as possible to speed up bothaggregation and selection of records.
Note: all benchmarks in this article were performed using a2.8GHz dual-core laptop with 4 GB RAM, running 32-bit Vista. Your mileage mayvary.
About the Author
JeffCertain is a software architect at ColoradoCustomWare, a Fort Collins, CO company specializing in CAMA software.Coming from a hardware engineering background, he is particularly interested incode optimization and multi-threading of applications. Jeff is a member of theIEEE, President of the Northern Colorado .NET User Group, a co-founder of theNorthern Colorado Software Architects group, and a Microsoft Visual Basic MVP.