Test the Design and Not Only the Implementation
The role of test is sometimes incorrectly perceived to simply test whether the implementation conforms to the design specification. The problem with only testing the implementation is that one may miss the case when the design is incorrect in the first place. The design captures the customer scenario and it is important to understand and validate it before looking at the implementation. The following bug, that we found and fixed during testing BizTalk 2006, reflects a similar problem that should have been detected much earlier at the design table rather than the later stage.
Messages that are processed by BizTalk Server (BTS) can be saved for later retrieval by explicitly choosing to track it. The message bodies are initially present in the live database (MessageBoxDb) & get moved from the MessageBoxDb to a separate Tracking database (TrackingDb) when the processing on the message is completed. Since the TrackingDb will continuously grow in size, we added the Archiving & Purging functionality which periodically takes a backup and prunes the TrackingDb.
Purging of tracking data from the tracking database is done based on the time interval (referred to as Live Window from now on) specified by the user. For example, if the user specified a live window of 24 hours, then BTS will first archive the data in the past 24 hour segment and then purge this data continuously over the next 24 hours before it takes the next archive. The tracking data is moved from the messagebox to the tracking database by a Sql job. Since the job runs periodically at an interval of 1 minute, there is always some latency in this movement. This means that when the archive is taken that there may be some data in the messagebox which belonged to the 24 hour period but was not archived.
We were working on the final documentation of the feature when one of the UX team members asked "What will happen to this data that has not yet been archived? Does the user need to do anything special to backup this data?". We initially thought, the user will be fine since this data will be archived in the next TrackingDB backup. We obviously could not delete any data before taking a backup. We had taken into account this latency and had added a 10-minute buffer to our window calculations. How wrong we were...
We initially stored only the timestamp values corresponding to when a message started and completed processing. In an effort to minimize the database schema change, we decided to use these timestamps to determine what to purge. This caused a problem with our purging story.
The purging logic compared the completion time of a message with the current time to determine if this is too old a record based on the live window and should be deleted. If we take the basic case of a 24 hour window of data, we purge all data whose timestamp is older than 24 hours. Every 24 hours we take a backup so that (supposedly) all data is in one of the archives (that is what the backup is ... an archive). The problem happens if there is a delay between when the data is tracked and when it is inserted into the tracking database. This delay means that we could conceivably delete data which we thought was in the previous backup even though it wasn't because it took more time to move it. We had a 10 minute overlap between backups so we have some redundant data. This 10 minute overlap is why we did not catch the problem yet because as long as it got moved within 10 minutes, we were fine. But say if a situation arises where the backlog exceeds 10 minutes, for example the tracking server crashed and it would take more than 10 minutes to restore it, we would lose tracking data since that data would not be in any archive and we would delete it.
The fix was relatively simple but was tricky since we were almost at the end of the test pass and it involved database schema changes. We needed to add a new column which stored the timestamp the message was inserted to the TrackingDb, to all tables against which the purge job operated on. Not only did it affect the normal test schedule, but also reset the upgrade testing and the performance and stress testing since as the tracking database would be a little larger in size, tests would have to be re-run to ensure they met the throughput criteria. Thus, the purging logic now chose whether to delete the message based on when it was inserted in the tracking database rather than when the message was completed.
We missed this scenario during our initial design discussions and review. Maybe we were blinded by the benefit of not taking any database schema changes. However, it was painful towards the end of the product cycle.
About the author:
Vishal Chowdhary is in his 3rd year at Microsoft. He started with the BizTalk Server team where he owned the new OperationsOM API testing along with parts of BizTalk Server tracking features. Presently, he is involved with designing the test framework and test planning for future versions of BizTalk Server.