MSDN Magazine > Issues and Downloads > 2000 > November >  Basic Instincts: To Cache or not to Cache
This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.
MSDN Magazine
To Cache or not to Cache
Ted Pattison
W
hen you're designing a Web application, state management always involves tough choices. For instance, you must decide whether it's necessary to make a trip to the database management system (DBMS) to access data. There are times when avoiding round-trips can give you a definite performance boost, and times when reading and writing directly to the DBMS during a client request is the only sensible option.
      In past columns I have discussed quite a few topics that are relevant to state management. In Advanced Basics in the February 1999 issue of Microsoft Internet Developer, I covered threading and state management in Internet Information Services (IIS) and discussed why it's inappropriate to cache Visual Basic®-based objects in an ASP Application or session variable. In the June 1999 Advanced Basics column, I talked about the management of client-specific state in a Web farm environment. This month's column will build on these topics by discussing when it's appropriate to avoid the DBMS and when it's not.

Caching Data on the Web Server

      Many developers have found they can significantly improve application performance with caching techniques that share in-memory data across a set of ASP pages and COM objects. The increase in application performance usually results from a reduced number of round-trips to the DBMS. It's much faster for a Web application to process a client's request using data that's close at hand, as opposed to reading and writing data that lives across the network in a DBMS.
      Using Visual Basic, you can employ three common techniques that cache data on the Web server. First, I'll look at using a .BAS module (also known as a standard module). Then I'll examine the use of the shared property manager (SPM). Finally, I'll describe the use of ASP Application variables to cache data. Each of these techniques has advantages and drawbacks with respect to performance, synchronization, and concurrency. As you'll see, each has certain advantages over the others. It's important to weigh quite a few factors when you're designing a data-caching strategy for a Web application built with components written in Visual Basic.
      The first technique is implemented entirely with Visual Basic. You can share data across Visual Basic-based objects by defining constants and variables in a .BAS module. Defined in this way, constants and variables can be seen by any component in the same ActiveX® DLL project. The most important thing to note about defining a constant or variable in a .BAS module is that it's private to a specific thread. Visual Basic stores all .BAS module data using thread-local storage (TLS), shown in Figure 1. This means there can be multiple instances of a constant or a variable loaded into a single process.

Figure 1 .BAS Module Data
Figure 1 .BAS Module Data

      When many users are accessing your Web application at the same time, the ASP runtime processes client requests using a pool of single-threaded apartment (STA) worker threads. When ASP pages create objects from one of your components, these objects get distributed across the various threads in the pool. Whenever each specific thread creates an object from your ActiveX DLL for the first time, the DLL initializes all its .BAS module data in TLS.
      Why should you cache data with a .BAS module rather than using one of the two other techniques? Speed, which can be as much as ten times greater than the other two techniques, is the primary advantage. It's quicker for an object to access TLS data than it is to access heap-based data (both the SPM and the ASP Application object rely on heap-based data). The performance of the shared property manager and the ASP Application object are also affected by the need to perform a dictionary lookup based on a string value at runtime.
      Another reason you get faster performance when accessing .BAS module data is that it never involves locking. Since all .BAS module data is thread-specific, there are no concurrency issues that require synchronization. There's no need to acquire and release locks to maintain data consistency. Both the SPM and ASP Application, on the other hand, must deal with concurrency issues. Therefore, they provide their own built-in locking schemes to prevent data from becoming inconsistent in the presence of multiple threads. While locking and synchronization is often critical, the need to acquire and release locks negatively affects performance.
      There are two main drawbacks to using .BAS module data. Both are related to the fact that .BAS module data requires one instance per thread. First, having redundant copies of the same data in a process wastes memory. Second, when you have multiple copies per process it's usually impractical to use the data in any way other than a read-only fashion. Let's concentrate on memory usage first, and then I'll discuss how multiple copies of a variable affects data consistency.
      When you design a middle-tier application based on IIS, Microsoft® Transaction Services (MTS), or COM+, you must consider that you're relying on a hosting process that runs lots of threads. For example, the process for an MTS server package can run up to 100 threads. The process for a COM+ application typically runs somewhere between 8 and 40 threads, depending on how many processors you have. The ASP runtime in IIS 4.0 runs an STA thread pool of 10 threads per process, per processor, by default, whereas the ASP runtime in IIS 5.0 runs 25. As you can see, there are lots of threads per process in these middle-tier runtime environments, and each thread gets its own private instance of each constant and each variable defined in a .BAS module.
      When you're using constants defined in .BAS modules, you're giving away memory in exchange for better performance. It's not difficult to do the math to find out how much memory you need. Just multiply the size of your .BAS module data by the number of threads in the hosting process. Many developers are glad to give up memory for better performance.
      Note that it's important to build your ActiveX DLLs with the Retained In Memory option (see Figure 2). Selecting this option leads to better performance because it prevents the DLL from unloading its .BAS module data on any thread for the lifetime of an application. If you build a DLL without this option, .BAS module data is continually unloaded and reloaded on a thread-by-thread basis. The Retained In Memory option is important because it guarantees that .BAS module data is initialized only once per thread. Also note that you should have the latest service pack for Visual Studio® 6.0. Service Pack 3 fixed quite a few bugs relating to the Retained In Memory option.

Figure 2 The Retained In Memory Option
Figure 2 The Retained In Memory Option

      Now let's discuss the second problem with .BAS module data. Having multiple copies of a variable causes problems with data consistency; namely, different objects in the same process see different instances of the same variable. You can't guarantee which objects share the same instance. This means that one object running in a multithreaded environment cannot reliably see changes written to the variable by another object. This makes it impossible (or at least very difficult) to use .BAS module variables that are read/write as opposed to read-only.
      Many programmers avoid declaring .BAS module variables in ActiveX DLLs in the middle tier for the reasons I've just discussed. Other programmers have found that they can initialize .BAS module variables when the DLL loads onto a thread. Things can work reliably if you have the discipline to use .BAS module variables in a read-only fashion once they've been initialized. This provides the ability to load in cached data dynamically with values that don't need to be compiled into your DLLs. The use of dynamic data makes such an approach more flexible than using constants. For example, you can load in data you've retrieved from a DBMS.
      The easiest way to initialize a .BAS module variable is to supply a Sub Main procedure in your ActiveX DLL project. The Sub Main procedure must also be declared in a .BAS module. As long as you set the startup object to Sub Main in the Project | Properties dialog, this procedure fires whenever your DLL is initialized on a thread. This means you can write an implementation of Sub Main that initializes your .BAS module variables on a thread-by-thread basis. Again, make sure to use the Retained In Memory option. You don't want to run the Sub Main procedure more than once on any specific thread.
      There is one last thing to note about using .BAS module variables. If you change your DLL project's threading model to single-thread, you force every object from the DLL to load onto a single thread known as the main STA thread. Since all objects created inside the same process from a single-threaded DLL always share the same thread, it means they all see the same instance of a .BAS module variable. This allows you to reliably read and write to a .BAS module variable from every object created from the same DLL.
      While a single-threaded DLL seems to solve certain problems with data consistency, it does far more damage to an application's concurrency. All client requests are bottlenecked through a single STA thread. For this reason, the use of single-threaded DLLs should be avoided when creating components for IIS, MTS, and COM+ in Visual Basic.

Using the Shared Property Manager

      The second important caching technique involves using the SPM, which is a small component library that's built into both the MTS and COM+ runtimes. The SPM is a name/value dictionary that allows you to read and write to named property values. One of the big advantages to using the SPM is that it allows objects running on different threads within the same process to see a single instance of a named property value, as shown in Figure 3. This means that the SPM uses memory more efficiently than .BAS module data. In addition, updates to the SPM by one object can reliably be seen by all the other objects inside the same process, regardless of who's running on what thread.

Figure 3 Caching with the Shared Property Manager
Figure 3 Caching with the Shared Property Manager

      Since shared properties in a property group are accessible from any thread, it's important to think through all relevant issues regarding concurrency and synchronization. Fortunately, the SPM provides an easy-to-use scheme where an object can acquire an exclusive lock while making a series of read and write operations to properties within the same group. In other words, the SPM makes it possible for an object to perform a sequence of read and/or write operations in isolation. This can prevent scenarios where one object sees the partial, incomplete work of another.
      Figure 4 is a sample of a component that updates two shared properties in isolation. The first thing I'd like to point out in this example is the second parameter passed to the CreatePropertyGroup method. The parameter value of LockMethod sets the property group's isolation level. An isolation level of LockMethod informs the SPM to hold locks for the duration of the current method call. That is, once an object touches any property within a property group, it acquires an exclusive lock until the current method ends. The logic to release this lock is built into the MTS and COM+ interception scheme.
      Locking is great for data consistency. However, it adversely affects concurrency. You should avoid acquiring locks when and where they're not needed. It's not usually necessary to lump all your data together into a single property group; in fact, it's often better to split your data up into multiple groups because each group handles locking independently. While the ASP Application object also provides locking, it only provides a single application-wide lock. The SPM provides a more granular locking scheme that makes it possible to optimize concurrency while designing a synchronization scheme to prevent data inconsistency.
      Some shared property groups don't need such drastic locking behavior. If you create a property group with an isolation level of LockSetGet instead of LockMethod, locks are not held for the duration of an object's method call. Instead, locks are acquired and released each time a property is accessed. This means LockSetGet has a lower impact on concurrency. However, it also means you could be acquiring and releasing locks more often. With an isolation level of LockSetGet, it's possible to acquire and release a lock on the same property group several times inside a single method call. This is never the case with an isolation level of LockMethod.
      It's important to note that you cannot disable locking with the SPM. You have to create a property group with either LockMethod or LockSetGet. This is unnecessarily taxing, especially in situations where shared properties are read-only once they've been initialized. Access would be noticeably faster if the SPM allowed you to read a shared property without acquiring and releasing a lock.
      Another shortcoming of the SPM with respect to locking is that it always uses exclusive locks. The SPM is not as smart as a DBMS. A DBMS knows to use shared locks rather than exclusive locks for read operations. Shared locks prevent one reader from blocking another. With exclusive locking, one reader's locks block other readers. Shared locks, therefore, have a lower impact on concurrency and result in far less blocking.
      Here's another reason that the SPM is noticeably slower than .BAS module variables. When you access the SPM, you're typically required to create and release three different COM objects (as shown in Figure 4). Creating and destroying objects requires processing cycles. This is another factor that widens the performance gap between accessing .BAS module data and the SPM.
      One last advantage of the SPM is its ability to share data in ways that are not possible with .BAS module data. You can share data across multiple ActiveX DLL projects as long as they are running inside the same process. You can share data between components written in Visual Basic and components written in other languages, such as C++. However, you should keep in mind that it is not practical for an ASP page to use the SPM directly. While the SPM interfaces are exposed as dual, the SPM components expose methods that cause some problems, so it's not worthwhile trying to access them from an ASP page in which you're using VBScript or JavaScript.

Using an ASP Application Object

      The final caching technique I'm going to examine is the use of the ASP Application object. A Visual Basic-based object that's been created from an ASP page can create and access an ASP Application variable like this:

Dim appl As ASPTypeLibrary.Application
Set appl = GetObjectContext("Application")
appl.Value("MyProp") = "My quintessential value"
      Of course, you must reference both the ASP type library and the MTS and COM+ type library in order to access the ASP Application object in this fashion. You should also note that with COM+, the IISIntrinsics attribute of your configured component must be set to True. This is the default value, so you usually don't have to adjust anything. Also note that this attribute is not accessible through the COM+ Services administrative tool and is only accessible through the COM+ Admin objects.
      The ASP Application object differs from the SPM in that the names of its variables do not have a processwide scope. Instead, ASP Application variables are scoped within an IIS application (that is, a virtual directory). It's possible for two different IIS applications to run in the same process. The ASP Application variables of one IIS application are invisible and inaccessible to the other. This is a good thing because one IIS application cannot step on the ASP Application variables of another IIS application if there's a naming conflict. However, it's limiting because you cannot share ASP Application variables across IIS applications. (I guess this is the reason they're called Application variables to begin with.)
      When it comes to synchronization and locking, the ASP Application object doesn't provide any control over granularity. There is only one coarse-grained lock per application. This scheme is not nearly as flexible as the SPM because locking ASP Application variables is an all-or-nothing proposition. With the SPM you can lock shared properties in one group without locking the shared properties in other groups. In contrast, whenever a Visual Basic-based object or an ASP page acquires the lock on the ASP Application object, all other requests attempting to access ASP Application variables are blocked until the lock is released. Like the SPM, the ASP Application object never uses shared locks; it always uses exclusive locking. As you can imagine, designs that frequently lock the ASP Application object can create quite a bottleneck in high-volume applications.
      Unlike the SPM, the ASP Application object doesn't force you to use locks if you don't want them. Let's look at an example of when this is helpful. Imagine that you're required to initialize a large set of environmental variables in the ASP Application object upon application startup. If you acquire a lock on the ASP Application object, you can guarantee that no client request sees your data until you've initialized everything into a valid state. Once you've completed the initialization, you can release the lock and use these variables in a read-only fashion for the rest of the application's lifetime. As long as the data is read-only, there's no reason to acquire and release locks. The fact that ASP Application variables can be accessed without locking means that they are slightly faster than using the SPM.
      There's one last advantage that ASP Application variables have over the other two techniques. They can be accessed directly from ASP pages as well as Visual Basic-based components. This can be very convenient in projects where the developers using ASP and those using Visual Basic are working side by side.

Choosing the Best Approach

      I've examined three of the more common server-side data-caching techniques used by Visual Basic-based programmers. You now know to ask the right questions. Is achieving better performance more important than conserving memory? Can you use cached data in a read-only fashion, or is it necessary to make updates over the lifetime of the application? Do you need locking to synchronize access and maintain data consistency? If you don't need locking, how can you avoid it? If you do need locking, how can you minimize the amount of items you're locking down at any one time? Asking these questions will help you the next time you need to design a data-caching scheme for a Web application.
      You should also keep in mind that there is a fourth option. You might decide that no available caching technique is sufficient for your needs. For example, none of these three techniques I've just discussed offer any form of persistence or durability. In-memory data is fast to access, but it's easily lost due to critical problems like a system failure or an application crash. None of these techniques provide transaction support for automatic rollback. And none provide a way to share data across processes or across machines. Finally, neither the SPM nor the ASP Application object utilize shared locks. They only use exclusive locks, which have a much higher impact on concurrency.
      While there are times where it makes sense to avoid DBMS access, many other times it's simply the best (and only practical) solution. Here's a list of requirements that should make you consider using a DBMS:

  • Data needs to be persisted for durability and recoverability
  • Read locks are needed for improved concurrency
  • Updates need to be rolled back in the event of an error condition
  • Locks need to be synchronized with locks from other data sources
  • Data needs to be shared across process and machine boundaries

Limitations of Data Caching Techniques

      When you design a caching scheme, you typically make a copy of frequently used data items and store them in a place that provides faster access. As long as the real data items (that live in the DBMS) and the copies (that live on the Web server) stay in sync, things are fine. However, when data needs to be updated, it introduces issues related to consistency. For example, if another application writes changes to the DBMS, the copies on the Web server can become inconsistent. You should observe that it's much easier to design a caching scheme for data that's read-only as opposed to data that's updated over the lifetime of an application.
      Let's look at an example of designing a caching scheme for a Web application. Imagine that you need to generate a Web page that displays product inventory levels to your users. You decide to cache these product inventory levels in ASP Application variables to reduce round-trips to the DBMS. Now assume that another application is updating these inventory levels directly to the DBMS at periodic intervals. The product levels you're caching on the Web server are going to get out of sync with the live data in the DBMS. Consequently, you need to add some type of logic to your application to refresh the data.
      One strategy is to cache a timestamp value along with the product inventory levels. When a client request sees that the cached data is older than some configurable refreshing interval (say, five minutes), you make another round-trip to the DBMS to retrieve and cache the latest values. For the next five minutes, no other client request requires a round-trip to the DBMS.
      A refreshing technique like this can offer a measurable performance win if your application can tolerate small inconsistencies between what's actually in inventory and what's perceived to be in inventory. However, there's always a chance a user will retrieve an inconsistent value. If you need to guarantee that your application always retrieves consistent values, you might be required to make a round-trip to the DBMS for each client request. As you can see, there's a definite trade-off between performance and data consistency. The best solution for one application might be totally inappropriate for another.
      Now let's consider a more difficult scenario. Imagine you're writing a method implementation for a client request that's required to update a cached data item. Remember, when you modify an in-memory data item in the SPM or the ASP Application object, nothing is written to disk. This means updates are not recoverable in the event of an application crash. In most applications, losing updated data in this manner is unacceptable. When your method returns without raising an error, the client should be guaranteed that its changes have been stored in some durable format and that its changes are recoverable in the event of an application crash.
      While implementing a simple scheme to persist updates to disk might not prove too difficult, adding code to recover after an application crash or to optimize response times during larger updates is far more challenging. For example, consider how you'd answer the following questions. When your application starts up, how does it know if there has been an application crash and it needs to recover data from disk? How do you back up and reload persisted data to protect against hard disk failure? When there's an update, how much data do you need to write out to disk before returning control back to the client? Is it possible to perform an optimization by writing the minimal amount of data that represents the delta between the old data and the new data? (The transaction log that is generated by SQL Server™ provides this type of optimization.)
      The issues I've covered are all relevant, and must be considered when designing a durability scheme for a multitier application. Think about how much time it will take to design, implement, test, and debug the code to make your data durable and recoverable. Now compare that investment to the licensing fee for an installation of SQL Server. If SQL Server can offer comparable performance to a custom caching scheme, it's going to help save you a great deal of time and money. Before you make a judgment on whether you need a DBMS, there are a few more important issues that you need to know about.

Locking and Concurrency

      A noteworthy limitation with both the SPM and the ASP Application object is their inability to use shared locks. Both the SPM and the ASP Application only offer exclusive locking. Once a client request acquires an exclusive lock for a particular data item, it blocks every other client until the lock is released. In many situations, exclusive locking is overkill. Shared locks can offer the required levels of consistency with significantly less impact on an application's concurrency.
      A DBMS provides an optimized form of concurrency by complementing exclusive locks with shared locks. A client with a shared lock blocks another client that is attempting to perform a write operation. However, a shared lock does not block other clients that are attempting a read operation on the same data item. This additional level of locking granularity can provide faster response times and higher levels of throughput because it prevents readers from blocking one another.
      Due to their simplistic locking schemes, the SPM and ASP Application object are best used for read-only data that's loaded in at application startup. They can be used to a lesser extent in scenarios where data changes are fairly infrequent. As discussed earlier, periodic data refreshing can boost performance by reducing DBMS round-trips. However, in a large-scale app with frequent updates you need more sophisticated locking support. Neither the SPM nor the ASP Application object provide a satisfactory solution.
      What are your options? You could create a custom C++ implementation similar to the SPM that includes the logic to use shared locking in addition to exclusive locking. However, that's a pretty expensive undertaking, and it might require expertise that your company doesn't possess. Alternatively, you could save yourself lots of time and money by taking advantage of the sophisticated locking behavior that's built into a DBMS such as SQL Server.
      You'd be surprised at how well SQL Server performs when it's installed on the same computer as a Web application. While accessing data in a local SQL Server database requires cross-process calls, it's not that much slower than accessing in-memory data with the SPM or the ASP Application object. You should try running your own benchmarks because the performance differences are far less than most developers expect. The internal data caching scheme in SQL Server is very fast, and its use of shared locks might improve your application's performance.

Enforcing the ACID Rules

      While the SPM and the ASP Application object provide isolation through exclusive locking, they are unlike a DBMS because they cannot enforce the ACID rules (atomicity, consistency, isolation, and durability). For a primer on these rules, see the sidebar "Transactions in Distributed Applications" in the article "Use MSMQ and MTS to Simplify the Building of Transactional Applications" (MSJ, July 1998).
      You've already seen that neither the SPM nor the ASP Application object provide durability. Now let's consider the issue of atomicity. Neither the SPM nor the ASP Application object provide any type of rollback facilities. For example, what happens if you modify data in the SPM from inside a COM+ transaction and then roll back the transaction? The changes you've made to SPM data are not automatically reversed. That means you have to provide custom code to detect error conditions and undo changes by hand. However, if you write your changes to a DBMS, reverting those updates is as easy as aborting the current transaction. This is a huge benefit because it eliminates the need to write code for undoing changes.
      You should also realize that the locks held by the SPM or the ASP Application object are never coordinated with any other data source. For example, if you make a change to a data item in the SPM and another change to a data item in an Oracle database from within a COM+ transaction, the locks on those two data items are not synchronized. The SPM typically releases its locks as soon as the current method call finishes before the distributed transaction coordinator (DTC) starts to run the two-phase commit protocol. This lack of synchronization can lead to violations of the ACID rules.
      How would moving data items from the SPM over to the local SQL Server database change the scenario that I've just described? Since SQL Server is the resource manager, you can use a COM+ transaction to synchronize locks held across multiple resource managers. For example, from within a COM+ transaction you could establish a local connection to the SQL Server database and another remote connection to the Oracle database. Since both connections are auto-enlisted in a single distributed transaction, the DTC synchronizes the locks held by both DBMSs to enforce the ACID rules.

Sharing Data Across Process and Machine Boundaries

      Neither the SPM nor the ASP Application object provide a way to share data across process boundaries. Two Visual Basic-based objects can only see the same SPM data when they run inside the same process. Two Visual Basic-based objects (or ASP pages) can only see the same ASP Application object data when they run inside the same process and the same ASP Application.
      How do you share data between Visual Basic-based objects that are running in different processes on the same computer? One possible approach is to use a low-level programming technique used in Win32® such as memory mapped files. Unfortunately, this approach isn't very practical to implement in Visual Basic. You'd be better off using another language that is more suited for the task, such as C++.
      Alternatively, you could share data between processes by writing it out to a file on disk. For example, Visual Basic-based objects running in different processes can share data by reading and writing to a common file. However, with this approach you'd have to deal with the issues of persistence and durability discussed earlier. You'd also more than likely have to incorporate a locking scheme to deal with concurrency issues. As you can imagine, it's tough to design and implement a locking scheme that allows multiple processes to concurrently access the same file without compromising data consistency.
      As you can see, sharing data across processes on the same computer is not an easy task to accomplish. However, sharing data across computer boundaries is even more difficult. Moreover, sharing data across computer boundaries is a very common application requirement. This is especially true when the code is designed for a Web farm, where client requests are arbitrarily redirected to different Web servers on a request-by-request basis. Your code must be able to access application data and session data from any one of several Web servers in the farm.
      While sharing data across process boundaries using simple files might work in a local scenario, it doesn't scale across the network. In an application designed to accommodate a medium to large-size user base, it doesn't make sense to share data using files on a remote network share. Anyone who's tried to scale an application based on Microsoft Access .MDB files can attest to this. There must be a dedicated process on the computer with the shared data that's conducting all the I/O locally with the file system.
      Let's walk through one possible solution for maintaining session state in a Web farm environment without using a DBMS. You could dedicate one computer (a COM+ application server) to hold all the session state for every user. When processing a client request on one of the Web servers in the farm, you can activate an object on the COM+ application server. Objects running on the COM+ application server can all run in the same process and can, therefore, read and write shared data using the SPM. When a second request from the same client is redirected to another Web server in the farm, it too could activate an object on the COM+ application server to access data written in the previous request. This approach provides a foundation for reading and writing cached data in a Web farm environment.
      Once you've gone this far, you probably want to add the logic to persist all updates and make them durable to protect against system failures. If concurrent access exposes your application data to inconsistency, you'll need to add a locking scheme. If you need rollback support, you'll have to devise an update buffer to build in support for automatic rollback. If your locks need to be synchronized with other data sources, you might even be required to add the support to make your application a resource manager that can interoperate with the DTC.
      What does this example demonstrate? It shows that many of today's multitier applications present complex problems that have already been solved by the DBMS. There is no reason to reinvent the wheel. To a large extent, it doesn't really matter whether you use SQL Server, Oracle, or DB2. All these DBMS vendors have already taken the time and effort to solve the same problems encountered over and over again. Their code has gone through many generations of design, implementation, testing, and debugging. When you purchase a license for a DBMS, you get value for your money.
      Many developers envision that they can improve performance by designing a scheme to cache and synchronize application and session data across computers in a Web farm. However, many companies that have tried to design and implement such a scheme have found that they cannot significantly outperform a much simpler design that uses a DBMS.

Ted Pattison is an instructor and researcher at DevelopMentor (http://www.develop.com), where he co-manages the Visual Basic-based curriculum. The second edition of Ted's book, Programming Distributed Applications with COM and Visual Basic 6.0, was recently published by Microsoft Press.

From the November 2000 issue of MSDN Magazine.

Page view tracker