Export (0) Print
Expand All
11 out of 13 rated this helpful - Rate this topic

Tips for Successful Data Access

 

Dino Esposito
Wintellect

December 2003

Applies to
   Microsoft® ASP.NET
   Microsoft® ADO.NET

Summary: Discusses techniques to speed up data retrieval for data-driven applications, considering successful strategies for querying, paging, sorting, and caching. (7 printed pages)

Contents

Data Access and Execution Time
The Data Reader Object
Internals of the SqlDataReader Class
Command Behaviors
Caching Strategies
Strategies for Paging and Sorting
Summary

Data Access and Execution Time

The actual performance of a distributed, data-driven Web application can't be accurately measured using a single metric. A reliable measure of performance needs to balance various metrics, such as execution time, response time, throughput, and so on. These are independent but strictly related parameters. The extreme optimization of one of these parameters inevitably leads to a deterioration of the next, and the more that quantities are imbalanced, the more your real performance is at risk. In order to write applications that perform well, an optimal balance between the various metrics needs to be achieved.

That said, what makes optimization a delicate art is the lack of rules. Though a few commonly accepted guidelines do exist, they aren't sufficient to form a reliable recipe; rather, they merely outline the road ahead.

It is widely recognized that throughput is the most important of the performance parameters. Throughput is defined as the number of requests served per unit of time—typically requests per second. If you buy the statement that throughput is what really matters, the next step entails figuring out what weapons you have to improve those numbers.

Throughput varies according to the number of client threads that act on the server. The number of available processors, and the use that the server application makes of them, also influences the final throughput. In addition, the throughput is seriously affected by network latency. You can squeeze every little bit of performance out of your code, but if you operate over a slow connection, well, you still have a problem. It may be a relatively straightforward one to solve, but it's a problem nonetheless.

Another factor is the execution time, a rather self-explanatory parameter. Execution time is the time it takes to process a request and generate all the bytes for the client. For distributed, data-driven applications (do other types of application really exist?), execution time is directly connected to data access and caching strategies, state management, and the availability of an effective infrastructure for tiers.

In spite of a wide-ranging introduction, this article focuses only on a particular aspect of performance. This aspect is not one of the first-level parameters for optimization (execution and response time), but highly influences one of them. That is, in this article, I'll review some of the techniques to speed up data retrieval, taking into account different scenarios like sorting, paging, and caching. This article won't give you a pat recipe, but hopefully it will give you what you need to find a solution.

The Data Reader Object

For Microsoft® .NET Web applications, data access means using the Microsoft® ADO.NET classes. Microsoft designed ADO.NET with a particular scenario in mind. While ADO was built in the age of 2-tiered, client/server applications, the design goals of ADO.NET reflect more the disconnected style of data access as seen in highly scalable and multi-tiered Web applications. In this context, data is passed around from tier to tier and goes from presentation to storage and vice versa through business objects. The number of these objects can be fine tuned to accommodate for scalability and throughput.

For this reason, the DataSet is by far the most talked about ADO.NET object. But what exactly is the DataSet? It is an in-memory cache of data whose overall set of functions closely recalls those of an in-memory database. The object is serializable to a binary or SOAP stream and also to a couple of XML schemas. You can save the DataSet to a stateless or a stateful format (DiffGram) at your convenience. Once serialized, the object can be easily transferred across tiers. Combine this feature with batch update, and you have the ingredients of a typical, modern, highly scalable application.

The DataSet, though, is especially a container of disconnected data. It relies on data adapters to fetch data. Data adapters, in turn, are an abstraction of classic SQL command objects. Compared to commands, data adapters add the capabilities of automatic data caching. In other words, the adapter first retrieves data, then caches it to a DataSet, closes the connection, and returns. Depending on what you have to do with the data, this may or may not be an ideal approach. If you're just consuming rather than caching data, and operate in a forward-only, read-only way, the data reader represents a better option. This is particularly true for the Microsoft® SQL Server™ data-reader object—the SqlDataReader class.

Internals of the SqlDataReader Class

The SqlDataReader class reads database rows in a cursor-like fashion using a native SQL Server's Tabular Data Stream (TDS) stream. There are two important points in favor of data readers as long as you're looking for data to consume (that is, fresh data to populate a user interface, perform calculations, and retrieve helper values) as opposed to data to cache. The first point relates to the high performance of data readers, which is inherent to the design and the implementation of the class. The second point is even more substantial. All things considered, in ADO.NET the data reader is actually the only way for you to get data. Data can be obtained using commands or data adapters. The SqlCommand class retrieves data through scalars or readers. The SqlDataAdapter class retrieves data through the DataSet. However, to populate the DataSet, the adapter uses a data reader. Next, the adapter scrolls all the rows and packs values into a given instance of the DataSet.

To create a SqlDataReader, you call the ExecuteReader method of the SqlCommand object, rather than using a constructor. While the data reader is in use, the underlying connection is busy and no other operations can be performed, aside from closing the connection. It is important to notice that any changes made to the result set by a concurrent process is immediately visible to the user of the data reader.

The data reader has properties to return information about the characteristics of a row of data, such as the number of columns and whether the current result set has rows. The property HasRows has been added in version 1.1 of the Microsoft® .NET Framework and indicates whether the SqlDataReader contains rows or is empty. The data reader provides a number of methods to read values. There are several ways to retrieve values using a data reader.

Object o = reader.Item[fieldName];
Object o = reader.Item[fieldIndex];
Object o = reader.GetValue(fieldIndex);
Object o = reader.GetSqlValue(fieldIndex);

Do these methods all work in the same way? Actually, there are some slight differences that can impact the performance of your application.

First and foremost, methods that accept indexes are faster than methods requiring a column name, although they are perhaps less practical from a developer's standpoint, as it is less intuitive to deal with a call like reader[3] than reader["LastName"]. A column name is always resolved internally into a 0-based column index. The data reader's GetOrdinal method provides for this translation service.

The GetValue method is also slightly faster than the Item accessors. The Item property simply calls into GetValue and in so doing transforms the column name into an index. The difference between GetSqlValue and GetValue is all in the type they return. The former returns a .NET type that mimics a native SQL Server type; the latter returns the content of a field as an instance of a base .NET type. The data reader features a handful of type-specific methods, such as GetString, GetDateTime, GetSqlString, and GetSqlDateTime. The GetXXX methods just call into the corresponding GetSqlXXX and then extract the actual value from the SQL wrapper type. The SQL types are just wrappers around a .NET object, as the following pseudo-code demonstrates.

string GetString(int i) {
   SqlString str = GetSqlString(i);
   return str.Value;
}

GetSqlXXX methods are slightly faster than corresponding GetXXX methods. The reason for the better performance lies in the fact that the data types in the .NET Framework SqlDbType enumeration closely match the internal representation of types made by SQL Server. As a result, no conversion is required to return data to the application.

The performance improvement provided by GetSqlXXX methods is not dramatic, but there are other reasons to consider their use. Using SQL types allows for a more comfortable and lazier manipulation of null values. For example, if you attempt to store a null value contained in one of the columns retrieved in a .NET type, at a minimum you get an exception. By contrast, nothing bad happens if you store a null value in a SQL type. The IsNull method on each SQL type lets you check for the null value later. The IsDBNull method on the data reader class lets you check whether the data is storing null when GetXXX methods are used, but you have to check it right after you retrieve the data to avoid exceptions.

Another reason for using SQL types is that in this way you may avoid loss of precision during the conversion. For example, the decimal type has 38 digits in SQL Server, but only 28 in the .NET type system.

To top off the discussion about the internal plumbing of the SqlDataReader class, it is worth mentioning that the class implements the IEnumerable interface. This allows you to data bind a data reader to server data-bound controls. If you bind a data reader to, say, a DataGrid, you should close the reader (and the underlying connection) immediately after the method DataBind returns. Finally, note that when you close the reader, the underlying connection is not automatically closed. Calling the Close method on the connection object is therefore a necessity. There's just one situation in which closing the reader closes the connection as well—when a special command behavior is used.

Command Behaviors

You obtain a reader from a SqlCommand object by calling the ExecuteReader method. When calling ExecuteReader, you can require a particular working mode known as a command behavior. ExecuteReader has an overload that takes an argument of type CommandBehavior.

cmd.ExecuteReader(CommandBehavior.CloseConnection);

CommandBehavior is an enumeration whose values are listed below.

Table 1. Command behaviors for the data reader

BehaviorDescription
CloseConnectionAutomatically closes the connection when the data reader is closed.
DefaultSetting this option is functionally equivalent to calling ExecuteReader without parameters.
KeyInfoThis query returns only column metadata and primary key information.
SchemaOnlyThis query returns only column metadata.
SequentialAccessThis enables the reader to load data as a sequential stream.
SingleResultOnly the first result set is returned.
SingleRowThis query is expected to return a single row.

A proper use of the command behaviors can enhance the performance of the data reader

The SequentialAccess mode applies to all columns in the returned result set. This means you can access columns only in the order in which they appear in the result set. For example, you cannot read column #2 before column #1. More exactly, if you read or move past a given location, you can no longer read or move back. Combined with the GetBytes method, sequential access can be helpful in cases in which you must read binary large objects (BLOB) with a limited buffer. It is also helpful to optimize data access time if you need to read only a few fields in a table with dozens of fields.

To operate with a data reader, you first have to call its Read method. The method moves the internal pointer forward and selects the next record, if any. By default, the values of all columns are cached internally and then returned by the various GetXXX and GetSqlXXX methods. If you know you need only a few fields, the SequentialAccess behavior helps to optimize your code.

You can specify SingleRow also when executing queries that are expected to return multiple result sets. In this case all the generated result sets are correctly returned, but each result set has a single row.

Caching Strategies

Retrieving data from a database is certainly slower than retrieving cached data from memory, however, caching is not for all types of applications. An application that needs to provide up-to–date information can hardly exploit caching or, at least, can afford caching for a limited amount of time only. The less time you hold data in memory, the less benefit you get from caching. However, the longer time you cache a value, the more stale it becomes. An application-specific tradeoff is mandatory to make a decision.

In ASP.NET, the built-in caching mechanism—the Cache object—is particularly efficient and effective, but this fact alone is insufficient to make caching great for everybody. The Cache object makes implementing a cache of global data particularly easy, but it doesn't make caching affordable for all applications. This may sound obvious, but it is a frequent source of poor design.

While evaluating the pros and the cons of caching for your application, consider data retrieval on one side and speed of access on the other. Mediate the results with the needed refresh frequency and you have the magic formula.

For ASP.NET applications, caching is particularly effective for frequently used global data, such as lookup tables. Caching is less recommended for session-specific data, which would more significantly affect the scalability of the application. Also, the Cache object is of no use in a Web farm scenario and can compromise overall performance in the case of Web gardens. This is mostly due to the particular use of multiple processors made by the ASP.NET worker process. When the Web garden model is enabled, the worker process is cloned per processor. The application data is duplicated, although global objects are stored in shared memory. For applications holding a lot of state in the Cache object, the Web garden can be a risky model.

The second consideration relates to the excellent SQL Server internal cache mechanism, which allows for a good deal of reuse for repeated queries and stored procedures. By avoiding or limiting the application's use of the cache, you make your application inherently more scalable horizontally (scale-out) because you can port it to a Web farm more easily. By extensively using the cache, your application is inherently more scalable vertically (scale-up) and better capable of exploiting faster machines—but not necessarily multi-processor machines.

Strategies for Paging and Sorting

Paging and sorting have something in common. Logically speaking, these two functions don't get new data, and they don't apply transformations to data, but just return a different view of the same data. At first glance, both can be reasonably performed on the topmost tier (that is, the presentation or middle-tier) without going down to the database. Upon a second look, however, paging and sorting turn out to be tough functions. There is a real advantage only if you can perform paging and sorting on the client—which is not an impossible thing to do, at least for paging.

Paging on an intermediate layer is effective only if you can afford caching, namely only if the characteristics of the application and the data allow for that and don't raise scalability issues. Data-bound controls, the DataSet, and the Cache object are the base elements to combine together for an effective approach. Data is retrieved once or periodically, and is stored in a DataSet cached on the Web server. The PagedDataSource object (embedded in the DataGrid control) provides for free paging at the user code level.

If caching is not an option, things get complicated. The most effective approach is resorting to SQL Server and its built-in and transparent caching mechanism. However, this leads to another problem. How would you query for a "page" of rows? The SQL language doesn't include the notion of a page, and reasons in terms of clauses and conditions. Expressing a "page" with clauses means designing a SELECT statement that can retrieve records by position, much as a server cursor would do.

Some databases—Oracle for instance—support a special keyword that gives any selected row a progressive number. For Oracle the keyword is ROWNUM. The most efficient way to work around this issue is to make assumptions about the structure of the tables involved. For example, you could add an identity column containing a regular succession of values to base your paging algorithm upon. Things are simpler if you content yourself with next/prev paging. In this case, caching a couple of key values (first and last in the page) can do the trick of effective server-side paging. Going to SQL Server for paging has the advantage that little data is moved around and the Web server's memory is not taxed.

Sorting data is somewhat easier if you go to SQL Server to query for data. It is a computationally heavy operation if performed elsewhere, especially on the Web server where it can lengthen the execution time. Sorting is a N*Log(N) operation that is more than linear. Sorting the 20 records displayed in a page is not an issue; sorting 40 thousand records stored in the cache, on a per-user basis, is much more serious. Once again, caching is key. If your data is global and shared across sessions, consider caching pre-sorted data too. Of course, this tip makes sense if you know in advance the required sorting.

Summary

I started out this article introducing the concept of performance and measures. Next I demonstrated how data access, for typical Web applications, is a critical element to favor or hinder the increase of the throughput. The theory of optimization is unquestionably important. However, we're not talking abstract mathematics or general-purpose algorithms. We're talking about real-world applications and customers. Benchmarking is key. Tradeoffs are a necessity, but you are called to make your choices on numbers. If you have bad numbers, the likelihood of a bad choice grows exponentially. To help you capture reliable numbers, Microsoft provides Application Center Test—a Microsoft® Visual Studio .NET tool you can use to write tests that simulate multiple users accessing your application. This can give you a clear sense of the stability and responsiveness of the application.

About the Author

Dino Esposito is a trainer and consultant based in Rome, Italy. Member of the Wintellect team, Dino specializes in ASP.NET and ADO.NET and spends most of his time teaching and consulting across Europe and the United States. In particular, Dino manages the ADO.NET courseware for Wintellect and writes the Cutting Edge column for MSDN Magazine.

Show:
© 2014 Microsoft. All rights reserved.