| ast month (November 2001) I concluded that in ASP.NET, caching is the key to performance if you want to exploit Web controls and maintain optimal server response times. Caching relates directly to applications that can work disconnected from the data source. Not all applications can afford this. Applications that work in a highly concurrent environment that need to detect incoming changes to data can't be adapted to work disconnected. However, there are scenarios where you have a large block of user-specific data that needs to be analyzed, sorted, aggregated, scrolled, and filtered. In this case, your users need to extrapolate numbers and trends, but aren't interested in the last-minute record. In this case, server-side caching can be a key advantage.|
Data caching can mean two things. You can temporarily park your frequently used data into memory data containers, or you can persist them to disk on the Web server or a machine downstream. But what is the ideal format for this data? And what is the most efficient way to load it back into an in-memory binary usable format? These are the questions I will answer this month.
ADO.NET and XML ADO.NET and XML are the core technologies that help you design an effective caching subsystem. ADO.NET provides a namespace of data-oriented classes through which you can build a rough but functional in-memory DBMS. XML is the input and output language of this subsystem, but it's much more than the language used to serialize and deserialize living instances of ADO.NET objects. If you have XML documents formatted like data—hierarchical documents with equally sized subtrees—you can synchronize them to ADO.NET objects and use both XML-related technologies and relational approaches to walk through the collection of data rows. Although ADO.NET and XML are tightly integrated, only one ADO.NET object has the ability to publicly manipulate XML for reading and writing. This object is called the DataSet.
ASP.NET apps often end up handling DataSet objects. DataSet objects are returned by data adapter classes, which are one of the two ADO.NET command classes that get in touch with remote data sources. DataSets can also be created from local data—any valid stream object can be read into and populate a DataSet object.
The DataSet has a powerful, feature-rich programming interface and works as an in-memory cache of disconnected data. It is structured as a collection of tables and relationships. This makes it suitable when you have to work with related tables of data. Using DataSets, all of your tables are stored in a single container. This container knows how to serialize its content to XML and how to restore it to its original state. What more could you ask for from a data container?
Devising an XML-based Caching System The majority of ASP.NET applications could take advantage of the Cache object for all of their caching needs. The Cache object is new to ASP.NET and provides unique and powerful features. It is a global, thread-safe object that does not store information on a per-session basis. In addition, the Cache is designed to ensure it does not tax the server's memory whatsoever. If memory pressure does become an issue, the Cache will automatically purge less recently used items based on a priority defined by the developer.
Like Application, though, the Cache object does not share its state across the machines of a Web farm. I'll have more to say about the Cache object later. Aside from Web farms, there are a few tough scenarios you might want to consider as alternatives to Cache. Even when you have large DataSets to store on a per-session basis, storing and reloading them from memory will be faster than any other approach. However, with many users connected at the same time, each storing large blocks of data, you might want to consider helping the Cache object to do its job better. An app-specific layered caching system built around the Cache object is an option. In this case, sensitive data will go into the Cache efficiently managed by ASP.NET. The rest of them could be cached in a slower but memory-free storage—for example, session-specific XML files. Let's look at writing and reading DataSets from disk.
Saving intermediate data to disk is a caching alternative that significantly reduces the demands on the Web server. To be effective, though, it should involve minimum overhead—just the time necessary to serialize and deserialize data. Custom schemas and proprietary data formats are unfit for this technique because the extra steps required introduce a delay. In .NET, you can use the DataSet object to fetch data and to persist it to disk. The DataSet object natively provides methods to save to XML and to load from it. These procedures, along with the internal representation of the DataSet, have been carefully optimized. They let you save and restore XML files in an amount of time that grows linearly (rather than geometrically) with the size of the data to process. So instead of storing persistent data sets to Session, you can save them on the server on a per-user basis with temporary XML files.
To recognize the XML file of a certain session, use the Session ID—an ASCII sequence of letters and digits that uniquely identifies a connected user. To avoid the proliferation of such files, you kill them when the session ends. Saving DataSet objects to XML does not affect the structure of the app, as it will continue to work with the DataSet object in mind. The writing and reading is performed by a couple of ad hoc methods provided by the DataSet object with a little help from .NET stream objects.
A Layered Caching System If you want to use a cache mechanism to store data across multiple requests of the same page, your code will probably look like Figure 1. When the page first loads, you fetch all the data needed using the private member DataFromSourceToMemory. This function reads the rows from the data source and stores them into the cache, whatever it is. Then requests for the page will result in a call to DeserializeDataSource to fetch data. This call will try to load the DataSet from the cache and will resort to other physical access to the underlying DBMS if an exception is thrown. This can happen if the file is deleted from its location for any reason. Figure 2 shows the app's global.asax file. In the OnEnd event, the code deletes the XML file whose name matches the current session ID.
The global.asax file resides in the root directory of an ASP.NET application. When you run an ASP.NET application, you must use a virtual directory. If you test an ASP.NET page outside a virtual directory, you won't capture any session or application event in your global.asax file. Also, while Session_OnStart is always raised, the Session_OnEnd event is not guaranteed to fire in an out-of-process scenario.
Each active ASP.NET session is tracked using a 120-bit string that is composed of URL-legal ASCII characters. Session ID values are generated so uniqueness and randomness are guaranteed. This avoids collisions and makes it harder to guess the session ID of an existing session.
The following code shows how to use session ID to persist to and reload data from disk, serializing a DataSet to an XML file.
That code is equivalent to storing the DataSet in a Session slot.
void SerializeDataSource(DataSet ds)
strFile = Server.MapPath(Session.SessionID + ".xml");
XmlTextWriter xtw = new XmlTextWriter(strFile, null);
Of course, the functionality of the previous two approaches is actually radically different.
Session["MyDataSet"] = ds;
To read back previously saved data, you can use this code:
This function locates an XML file whose name matches the ID of the current session and loads it into a newly created DataSet object. If you have a caching system based on the Session object, you should use this routine to replace any code that looks like this:
strFile = Server.MapPath(Session.SessionID + ".xml");
// Read the content of the file into a DataSet
XmlTextReader xtr = new XmlTextReader(strFile);
DataSet ds = new DataSet();
How many of you remember the IBM 360/370s? When I was a first-year university student, I learned about memory management on them, which introduced virtual memory as a way to increase performance. It is structured like a pyramid of storage devices with decreasing size and increasing speed as you move from the bottom up.
DataSet ds = (DataSet) Session["MyDataSet"];
Why all this history? Because an app-specific layered caching system built around the Cache object, even in the toughest scenario with the most stringent scalability requirements, can help the Cache object to perform in a better and more effective way.
Figure 3 shows some of the elements that could form the ASP.NET caching pyramid, but the design is not set in stone. The number and the type of layers are completely up to you, and are application-specific. In several Web applications, only one level is used: the DBMS tables level. If scalability is important, and your data is mostly disconnected, a layered caching system is almost a must.
Figure 3 Caching
The amount of data you can reach at any level is different, but the right doses are determined on a per-application basis.
Also different from layer to layer is the time needed to retrieve data. Session, in most cases, is an in-process and in-memory object. Nothing could be faster. Keeping Session lean is critical because it is duplicated for each connected user. For quick access to data that can be shared between users, nothing is better than Cache or Application. Cache is faster and provides for automatic decay and prioritization. Relatively large amounts of frequently used static data can be effectively stored in any of these containers.
Disk files serve as an emergency copy of data. Use them when you don't need or can't afford to keep all the data in memory, but when going to the database is too costly. Finally, DBMS views are just like virtual tables that represent the data from one or more tables in an alternative way. Views are normally used for read-only data, but under certain conditions they can be updateable.
Views can also be used as a security mechanism to restrict the data that a certain user can access. For example, some data can be available to users for query and/or update purposes, while the rest of the table remains invisible. And table views can constitute an intermediate storage for preprocessed or post-processed data. Therefore, accessing a view has the same effect for the application, but doesn't cause preprocessing delays or place any locks on the physical table.
XML Server-side Data Islands Caching is particularly useful when you have a large amount of data to load. However, when the amount of data is really huge, any technique—either on the client or the server—can hardly be optimal. When you have one million records to fetch, you're out of luck. In such situations, you can reduce the impact of the data bulk by using a layered architecture for caching by bringing the concept of client-side data islands to the server. An XML data island is a block of XML that is embedded in HTML and can be retrieved through the page's DOM. They're good at storing read-only information on the client, saving round-trips.
Used on the server, an XML data island becomes a persistent bulk of information that you can store in memory, or (for scalability) on disk. But, how do you read it back? Typically, in .NET you would use DataSet XML facilities to read and write. For lots of data (say, one million records), caching this way is not effective if you don't need all records in memory. Keeping all the records in a single file makes it heavier for the system. What about splitting records into different XML files that are organized like those in Figure 4? This expands the level of XML disk files shown in Figure 3.
Figure 4 Dividing Records for Performance
You can build up an extensible tree of XML files, each representing a page of database records. Each time you need a block of non-cached records, you fetch them from the database and add them to a new or existing XML data island. You would use a special naming convention to distinguish files on a per-session basis, for example, by appending a progressive index to the session ID. An index file can help you locate the right data island where a piece of data is cached. For really huge bulks of data, this minimizes the processing on all tiers. However, with one million records to manage there is no perfect tool or approach.
Automatic Cache Bubble-up Once you have a layered caching system, how you move data from one tier to the next is up to you. However, ASP.NET provides a facility that can involve both a disk file and a Cache object. The Cache object works like an application-wide repository for data and objects. Cache looks quite different from the plain old Application object. For one thing, it is thread-safe and does not require locks on the repository prior to reading or writing.
Some of the items stored in the Cache can be bound to the timestamp of one or more files or directories as well as an array of other cached items. When any of these resources change, the cached object becomes obsolete and is removed from the cache. By using a proper try/catch block you can catch the empty item and refresh the cache.
To help the scavenging routines of the Cache object, you can assign some of your cache items with a priority and even a decay factor that lowers the priority of the keys that have limited use. When working with the Cache object, you should never assume that an item is there when you need it. Always be ready to handle an exception due to null or invalid values. If your application needs to be notified of an item's removal, then register for the cache's OnRemove event by creating an instance of the CacheItemRemovedCallback delegate and passing it to the Cache's Insert or Add method.
strFile = Server.MapPath(Session.SessionID + ".xml");
CacheDependency fd = new CacheDependency(strFile);
DataSet ds = DeserializedDataSource();
Cache.Insert("MyDataSet", ds, fd);
The signature of the event handler looks like this:
CacheItemRemovedCallback onRemove = new
void DoSomething(String key, Object value,
From DataSet to XML When stored in memory, the DataSet is represented through a custom binary structure like any .NET class. Each and every data row is bound with two arrays: one for the current value and one for the original value. The DataSet is not kept in memory as XML, but XML is used for output when the DataSet is remoted through app domains and networks or serialized to disk. The XML representation of a DataSet object is based on diffgrams—a subset of the SQL Server™ 2000 updategrams. It is an optimized XML schema that describes changes the object has undergone since it was created or since the last time changes were committed.
If the DataSet—or any contained DataTable and DataRow object—has no changes pending, then the XML representation is a description of the child tables. If there are changes pending, then the remoted and serialized XML representation of the DataSet is the diffgram. The structure of a diffgram is shown in Figure 5. It is based on two nodes, <before> and <after>. A <before> node describes the original state of the record, while <after> exposes the contents of the modified record. An empty <before> node means the record has been added and an empty <after> node means the node has been deleted.
The method that returns the current XML format is GetXml, which returns a string. WriteXml saves the content to a stream while ReadXml rebuilds a living instance of the DataSet object. If you want to save a DataSet to XML, use WriteXml directly (instead of getting the text through GetXml) then save using file classes. When using WriteXml and ReadXml, you can control how data is written and read. You can choose between the diffgram and the basic format and decide if the schema information should be saved or not.
Working with Paged Data Sources There is a subtler reason that makes caching vital in ASP.NET. ASP.NET relies heavily on postback events, so when posted back to the server for update, any page must rebuild a consistent state. Each control saves a portion of its internal state to the page's view state bag. This information travels back and forth as part of the HTML. ASP.NET can restore this information when the postback event is processed on the Web server. But what about the rest? Let's consider the DataGrid control.
The DataGrid gets its contents through the DataSource property. In most cases, this content is a DataTable. The grid control does not store this potentially large block of data to the page's view bag. So, you need to retrieve the DataTable each time a postback event fires, and whenever a new grid page is requested per view. If you don't cache data, you're at risk. You repeatedly download all the data—say, hundreds of records—just to display the few that fit into the single grid page. If data is cached, you significantly reduce this overhead. This said, custom paging is probably the optimal approach for improving the overall performance of pagination. I covered the DataGrid custom paging in the April 2001 issue. Although that code was based on Beta 1, the key points apply. I'll review some of them here.
To enable custom pagination, you must set both the AllowPaging and AllowCustomPaging properties to True. You can do that declaratively or programmatically. Next, you arrange your code for pagination as usual and define a proper event handler for PageIndexChanged. The difference between custom and default pagination for a DataGrid control is that when custom paging is enabled, the control assumes that all the elements currently stored in its Items collection—the content of the object bound to the DataSource property—are part of the current page. It does not even attempt to extract a subset of records based on the page index and the page size. With custom paging, the programmer is responsible for providing the right content when a new page is requested. Once again, caching improves performance and scalability. The caching architecture is mostly application-specific, but I consider caching and custom pagination vital for a data-driven application.
Data Readers To gain scalability I'd always consider caching. However, there might be circumstances (such as highly volatile tables) in which project requirements lead you to consider alternative approaches. If you opt for getting data each time you need it, then you should use the DataReader classes instead of DataSets. A DataReader class is filled and returned by command classes like SqlCommand and OleDbCommand. DataReaders act like read-only, firehose cursors. They work connected, and to be lightweight they never cache a single byte of data. DataReader classes are extremely lean and are ideal for reading small portions of data frequently. Starting with Beta 2, a DataReader object can be assigned to the DataSource property of a DataGrid, or to any data-bound control.
By combining DataReaders with the grid's custom pagination, and both with an appropriate query command that loads only the necessary portions of records for a given page, you can obtain a good mix that enhances scalability and performance. Figure 6 illustrates some C# ASP.NET code that uses custom pagination and data readers.
As mentioned earlier, a DataReader works while connected, and while the reader is open, the attached connection results in busy. It's clear that this is the price to pay for getting up-to-date rows and to keep the Web server's memory free. To avoid the overturn of the expected results, the connection must be released as soon as possible. This can happen only if you code it explicitly. The procedure that performs data access ends as follows:
You open the connection, execute the command, and return an open DataReader object. When the grid is going to move to a new page, the code looks like this:
dr = cmd.ExecuteReader(CommandBehavior.CloseConnection);
Once the grid has been refreshed (DataBind does that), explicitly closing the reader is key, not only to preserve scalability, but also to prevent the application's collapse. Under normal conditions, closing the DataReader does not guarantee that the connection will be closed. So do that explicitly through the connection's Close or the Dispose method. You could synchronize reader and connection by assigning the reader a particular command behavior, like so:
grid.DataSource = CreateDataSource(grid.CurrentPageIndex);
In this way, the reader enables an internal flag that automatically leads to closing the associated connection when the reader itself gets closed.
dr = cmd.ExecuteReader(CommandBehavior.CloseConnection);
SQL Statements The standards of the SQL language do not provide special support for pagination. Records can be retrieved only by condition and according to the values of their fields, not based on absolute or relative positions. Retrieving records by position—for example, the second group of 20 records in an sorted table—can be simulated in various ways. For instance, you could use an existing or custom field that contains a regular series of values (such as 1-2-3-4) and guarantee its content to stay consistent across deletions and updates. Alternatively, you could use a stored procedure made of a sequence of SELECT statements that, through sorting and temporary tables, reduces the number of records returned from a particular subset. This is outlined in this pseudo SQL:
You could also consider T-SQL cursors for this, but normally server cursors are the option to choose when you have no other option left. The previous SQL code could be optimized to do without temporary tables which, in a session-oriented scenario, could create serious management issues as you have to continuously create and destroy them while ensuring unique names.
— first n records are, in reverse order, what you need
SELECT INTO tmp TOP page*size field_names
FROM table ORDER BY field_name DESC
— only the first "size" records are, in reverse order,
— copied in a temp table
SELECT INTO tmp1 TOP size field_names FROM tmp
— the records are reversed and returned
SELECT field_names FROM tmp1 ORDER BY field_name DESC
More efficient SQL can be written if you omit the requirement of performing random access to a given page. If you allow only moving to the next or previous page, and assume to know the last and the first key of the current page, then the SQL code is simpler and faster.
Conclusion Caching was already a key technique in ASP, but it's even more important in ASP.NET—not just because ASP.NET provides better infrastructural support for it, but because of the architecture of the Web Forms model. A lot of natural postback events, along with a programming style that transmits a false sense of total statefulness, could lead you to bad design choices like repeatedly reloading the whole DataSet just to show a refreshed page. To make design even trickier, many examples apply programming styles that are only safe in applications whose ultimate goal is not directly concerned with pagination or caching.
The take-home message is that you should always try to cache data on the server. The Session object has been significantly improved with ASP.NET and tuned to work in most common programming scenarios. In addition, the Cache object provides you with a flexible, dynamic, and efficient caching mechanism. And remember, if you can't afford caching, custom paging is a good way to improve your applications.
Send questions and comments for Dino to firstname.lastname@example.org.