Data Confidence over the Web
by Peter Hammond
Summary: The successful exchange of data is the most integral function of applications today, and yesterday, for that matter. This article highlights some experiences from the viewpoints of an end user, novice developer, professional consultant, and the development team lead over the course of the past 10 years. I've had some fantastic successes and disastrous failures, and gained many insights along the way. I detail how the architecture of the past combined with the technology of today has led to successful exchange of data over the Web. (10 printed pages)
In the fall of 1994, I was transferred to Fairchild Air Force Base in Washington State, after a four-year tour at the beautiful Aviano Air Base in Italy. One of my responsibilities as a security police desk sergeant—a cross between a police dispatch and 911 operator—was to document the activities during my shift. In Aviano, this consisted of a typed 12+ page report with 10 carbon copies, each one distributed to the command for daily review. If any part of the report had to be fixed, the entire report had to be retyped and carbon copies redistributed. We killed a lot of trees. You can imagine my excitement at Fairchild when I was introduced to the new Windows-based Security Police Automation System (SPAS) for police reports. I would later learn that acronyms can predict trouble down the line. This networked client/server application built on FoxPro was a wondrous thing, as you could type out all your reports, correct any issues found, and distribute to command digitally. No more trees would die in vain. So it seemed.
Unfortunately, as time passed and the amount of data grew, SPAS freaked out. After spending hours typing out the detailed notes on an incident and clicking Save, a cryptic "Index Corrupted" error would flash. Even though your work still appeared on the screen, in reality it was gone, although you might not know it until command pulled you back in after you had gone home because your report was missing.
The consequences were swift. Everyone started doubting the SPAS. Superstitions arose about what would cause the system to fail. Was it hitting Save more than once, having a large amount of text, or standing on one leg while you typed with one eye closed? No one knew for sure. In the end we started printing out the entire report again, just in case. Thus began my understanding of the importance of having confidence in data systems. Without confidence in solid data communications, the users (and trees) pay a terrible price.
I started to create my reports in Microsoft Word and then cut-and-paste them into SPAS. I was getting my work done faster since I didn't have to retype all of my entries, even if there was an "Index Corrupted" error, and I quickly realized I had free time. The value of having technology work for me rather than the reverse wasn't lost on me. I spent my newfound time learning more about Microsoft's Office products.
As luck would have it, I got the opportunity to start doing light development building computer-based training programs in Microsoft PowerPoint for the Air Base Operations course (ABO). I created presentations with embedded graphics, sound, and video which made the training a big hit. When we needed to track who had completed the training, instead of using a paper signature sheet, I wrote a testing program that would track who had attended the training. Microsoft Excel was my first choice because that seemed to be where everyone tracked data, from flu shots to training records to test scores, and so on. Using Excel, I was able to define the questions as one sheet and the answers in another, quickly build a testing form, and successfully demonstrate it to my supervisor.
All seemed well until some obvious issues arose. We posted the Excel test to a network share to enable multiple people to take the test only to find out it could only be opened by one person at a time for edit and read-only mode for everyone else. I used hidden sheets to work around the security issue of storing the questions and answers in the same file. These measures failed as people found it faster to use the paper test forms rather than to wait for the single file to become available and some users discovered how to unhide the answer sheet. Yes, more trees suffered.
Little did I know that my baptism by fire into the difficulties of using network access for multiuser tenancy and ensuring real data security was just the tip of the iceberg.
I found Microsoft Access which promised to solve these problems and more, with its support for multiple simultaneous users, user-based security, plus the ease of Excel in defining tables, the ability to have complex data schema, and the advanced form designer.
The new Access testing database was a great success and I was granted server space on the base's CD tower server in the network center so the entire base could complete the mandatory training and tests remotely. The successful results quickly poured in and the training appeared to be going very well—unfortunately, a little too well.
I happened to notice that everyone was getting a 100 percent on all the tests the first time. During the design, I had included a creation and modification timestamp on each user's answer. I knew something was amiss when I ran a report using the difference between the created and modified timestamps on the answers. Most users were finishing the 30 question test in less than 27 seconds. You can imagine my surprise, as I figured out they were cheating! As the test questions were the same sequence, someone had created a cheat template (on paper, I'm sure). Serving the questions randomly helped ensure the tests were being taken legitimately. However, now that people weren't cheating, I had a new problem: The tests were taking longer to complete, so more people were accessing the tests at the same time. The Access database brought me to the harsh realization of what (lack of) scalability meant.
Apparently, the CD server started throwing a "Record Lock Timeout Error." Each one resulted in a notification bell that got pretty annoying to the network administrators, who weren't very happy that I (a security policemen) was playing developer. So they shut off the bell. As the number of users grew and the number of record-locking timeout errors increased, the inevitable occurred: My little Access database dropped the CD tower server and lost multiple in-process tests, including those of several very unforgiving officers. This was particularly bad on its own, however, the impact was far-reaching as the base was in the middle of a huge deployment of both Office 95 and Windows 95 from the CD server. I solved the issue by limiting the number of simultaneous users; however, the lessons of multiuser concurrency, data integrity, network resources and scale were not wasted on me.
In the end, the ABO course was a big success and the training was well received. I went on to build several more database programs with Access, enjoying its rapid development features but always wary of its limitations.
After I completed my second enlistment, I decided to try the civilian sector as a professional consultant. I found a lot of work fixing and upgrading a myriad of Excel- and Access-based systems. It seemed the world ran on Excel, and not very well. I eventually migrated into Visual Basic which enabled me to use Access as a back end and code around most of its issues.
I was afforded the opportunity to work on the Work Order Remote Management System (WORMS). WORMS was based on Microsoft SQL Server 6.5 and was an incredible system. Its scale was huge as it serviced hundreds of users, processing thousands of records. But the system had gone to the worms.
Apparently, WORMS was losing records at random. The loss occurred somewhere between the local SQL database and an Oracle system of the parent company. As a stopgap measure, all the work orders were printed out on paper in the morning, then hand-entered into a separate Access database that was copied to corporate each day to reconcile lost records. You guessed it, no data confidence and more trees killed in a single day then I did in my entire military career.
My job was to baby-sit the system while its replacement, the Automatic Remote Routing Of Workorders (ARROW), was developed. As I tracked down the loss of the records, I was very impressed with the diagnostic capabilities in SQL—from tracing to event logs, I could track all the data from end to end. With this advanced level of diagnostics, I was able to determine the cause and fix it, subsequently finding the records and losing my job all in one fell swoop.
The fix worked so well that it drew into question whether the customer needed the replacement ARROW system and I was quickly pulled from the project to ensure I didn't "fix" anything else. Regardless of the outcome, I was hooked. SQL Server was the answer I had been looking for, and then the real questions started.
As I advanced with SQL Server, the successes followed. Client/server-based applications were solid, easy to build, fast to troubleshoot, and users loved them because when they hit Save, it saved. Confidence in data equaled successful applications.
Years later, I was contracted to work on a project for tracking training that Microsoft conducted as an intranet academy. Using Microsoft .NET and SQL Server 2000, we quickly built a client/server application that met the users' needs extremely well. Over time, however, the number of support issues increased markedly as people tried to use the application remotely. Since it had to load all the records from the server each time it was launched, the performance over a VPN connection was dreadful. Nevertheless, users were happy because they had confidence that their changes were being persisted immediately, and with all the records loaded in memory, the UI was very responsive. The application was used for years.
Eventually, the resource requirements, network connectivity issues, and the explosion of the Internet caused the popularity of client/server architecture to fade. N-Tier implementations, with Web services, were hailed to be the holy grail of networked applications for online data exchange.
We jumped on board without looking back. I say "we" as I had started CyberSavvy, a vending company providing high-end consulting services. I hired the best of the best, some that I had had the privilege of working with over the years or by referral only. We focused on creating best-of-breed solutions for Microsoft's toughest internal business problems.
Web applications were the buzz, but Web interfaces that lost all your data when the page errored or connection went down did not build data confidence. Instead, we focused on building smart client applications using Microsoft .NET. They provided the rich features needed for applications that addressed complex business problems, involving complex relational data from many sources, requiring instant results and the ability to work both on- and offline.
Over the last five years, we built 15 smart client applications for Microsoft and built a strong working relationship with Microsoft's ClickOnce, .NET, ADO, and SQL teams. Most of these applications focused on solving issues facing Microsoft's business and sales force.
One of the most interesting solutions we built for Microsoft helps keep the field updated with new product release information.
Since the sales force often works in disconnected environments, either in transit or at a customer's site, the necessity for offline ability prohibited a Web application or typical client/server solution.
The requirements to quickly search, sort, and filter through all Microsoft products for custom presentations necessitated having all the data locally, yet kept up-to-date. The added need for intense security as certain product information had to be filtered by granular roles required row-based permissions.
By using SQL Server, Web services, and a smart client distributed with ClickOnce, we achieved fantastic results in all sectors. (See Figure 1.)
Figure 1. Original architecture for the Enterprise Product Roadmap, as of .NET 2.0 Beta 1 (Click on the picture for a larger image)
The Enterprise Product Roadmap (EPR, not to be confused with ERP) was rapidly adopted by the field, scaled incredibly well, was secure, worked online/offline, and was highlighted as a Microsoft showcase.
One disadvantage was trying to create and maintain a Web service design that used multiple Web methods to select and post data to each individual table (EPR had 20+ tables). We standardized on automated Select, Insert, Update, and Delete stored procedures that were used to create a dynamic ADO.NET Dataset DiffGram and serialized to XML a Web service that had just two methods.
The select Web method queried all of the data from all of the EPR tables a user had access to and returned a server datetime that annotated when the data was queried. Subsequent use by the same user would result in a DiffGram dataset, containing XML that represented new or updated data, as well as deletes.
The post Web method took a DiffGram with pending changes. These changes were then sorted by their change type and foreign key relationships, so that they may be committed to the SQL database in the correct sequence. That way, we had Web methods that did not require changing if we modified our schema. Nevertheless, as the schema was revised, we still had to revise the data layer that these two Web methods relied upon. For each new column or table, stored procedures, dataSet schema, and the Web service's data-layer code needed to be revised to match; a process that we automated over time. This model allowed for rapid updates to the smart client's data cache by using ADO.NET to do all the hard work.
We had extraordinary success with this on many applications—that is, until we tried it with the Emerging Business Team Tracker (EBT).
With many successful implementations of this architecture behind us, we tackled EBT with great confidence. We had created automation processes to dynamically build out the data and Web service layers from our standardized stored procedures. Our security fully integrated with Active Directory with the ability to filter user data at the row level. We built robust client applications using WinForms with all the bells and whistles of highly polished applications that users raved about. Our implementations were considered Microsoft showcases and used by the field to help sell the power of Microsoft technologies. Regardless of our previous successes, EBT would soon make an important point painfully apparent. Looking back, it's clear that the majority of our projects were report and query applications, basically read-only. EBT was a horse of a different color, and it was about to remind me of an old lesson.
EBT was both a reporting and a data capture program, comprising three separate systems that we combined into a single solution. The promise of a single application providing offline access and advanced search was greatly anticipated by the users. We released the application to great fanfare, confident of the success of the solution. We were completely unaware of the horrors to come.
At first everything went very well. Most users were connected to the corporate infrastructure while using EBT and making data changes. The amount of changes in a single DiffGram posted to the Web service was very small. As things progressed, more users worked offline for longer periods of time, and soon the support issues started.
As with all of our other applications, EBT stored its data using in-memory ADO.NET datasets. This worked relatively well for small amounts of data. However, EBT's dataset of about 60+mb, when serialized to disk, caused a slow launch time, as we had to deserialize all the data into memory on load. The problems really started when we serialized it back out to disk on exit.
We utilized DiffGrams on the client in the same way as the server, to communicate pending changes back and forth. Since pending change DiffGrams were persisted on exit, if the operating system or applications caused a fault before it was able to finish, the saved changes were lost. Sometimes the crash happened after the UI closed, so the user wouldn't know there was even an issue until the next time they launched and noticed their changes missing. We tried to compensate for this by doing incremental saves while the application was running but it caused various other issues, such as lags in the performance. Regardless of how well we did it, it wouldn't handle 100 percent of the cases.
The next big issue was multiuser concurrency. As the DiffGrams were basically sequential Insert/Update/Delete actions, if any part of the steps failed when applied to the server, the entire DiffGram failed. This manifested itself when a user working offline for several days making huge amounts of changes would synchronize and one of their first changes would be rejected due to a concurrency violation because the row had been updated by someone else while they were offline.
That caused the subsequent actions to all fail back to the client. We tried to resolve the issue automatically but the client side was usually so out of sync with the server, the user's changes compounded in a series of conflicts which corrupted the local DiffGram. The user would then force a "Full Data Reset" option, deleting their cache and pulling down a new copy. Of course, this caused all their pending changes to be lost as well. We built more and more diagnostics into the calls and more and more failover retry logic, but users lost confidence that their work would really be saved.
The Worst Was Yet to Come
The applications usage and the amount of changes being applied by both the server and the client rose exponentially. The amount of Insert/Update/Delete changes that needed to be reconciled at the same time by ADO.NET started to have some all too familiar issues. You can imagine my horror when a user sent in a support issue with screen shot of a huge error dialog on EBT with the accursed "Index Corrupted". I couldn't believe that I was responsible for recreating SPAS.
For almost a year, I funded an effort to fix the flaws and make EBT work— from creating asynchronous dataset backups, to identifying several obtuse ADO.NET bugs with large scale DiffGrams, to building advanced logging diagnostics into almost every step of the Web service.
We started exploring memory maps and binary datasets. We wrote an administrative utility that could reload the various individual DiffGrams from their XML files and recreate the entire change set in memory. Then we would try and work through the dataset changes row by row sending them one by one to the Web service to find out which one actually violated a constraint, errored on a security issue, or simply did not work for some indeterminate reason. Once identified, the change(s) could be removed and then an attempt was made to resubmit the rest of the DiffGram. This would often fail as the changes were interdependent due to the complex structure of the data and security.
In the end, most attempts were futile; the hierarchical nature of the errors combined with the offline aspect of the changes made it next to impossible to diagnose the original cause of the issues without capturing the state of both the client and the server at the same time.
We later found the same issues were actually occurring with our previous applications as well, such as EPR. But since they used read-only data, when a concurrency issue occurred, a new dataset would automatically be applied over the one in error and the user was unaware that anything had happened. With EBT, the user experience was catastrophic as the new dataset would erase any pending changes they may have had. Support incidents were through the roof, and most of the time the issue was too far gone to even try to fix. Users were literally cursing us, and, worst of all, they started saving screen shots of their data edits and then using these to reenter their changes when they were lost. Not quite as bad as printing it to paper, but just as damaging to the success of the system.
The more things we fixed and workarounds we applied, the more obvious it was that we were creating custom processes to handle data to Web service synchronization, which we had chosen not to do in favor of using the out-of-the-box ADO.NET DiffGrams. We also couldn't focus on building the remaining application features required by the client, which hampered user adoption even further.
SQL Merge Replication
Searching for answers, we engaged the ADO.NET team and did extensive reviews of the architecture. By all rights we had done exactly as prescribed. The basic elements of the architecture worked for smaller implementations or those that were read-only. However, using ADO.NET with DiffGrams to communicate these changes was determined to be inadvisable due to the combination of the size and complexity of the data, the row level security requirements, and the offline model with data entry.
As we searched for new ways to resolve the issues encountered, it became apparent that we needed a client-side database engine to handle the data locally and would need to create a solid middle tier to ensure reliable communication.
We looked into SQL replication seeing that it was essentially what we were doing. Replication had always been represented to me as a great idea, but a terrible reality. We engaged with the SQL team and joined the Yukon TAP program. SQL Merge Replication to synchronize from the SQL Server to a client SQLExpress instance seemed optimal, but would it really work?
Over the course of two months, we completely removed the various tiers of data layer, data transformations, and Web services from the server and client application. We then rebuilt the EBT client to work directly against the local SQLExpress instance and it depended solely on SQL Server Replication for the data transport.
Client/Server Simplicity, Multiuser Tenancy, True Scale and Offline Ability
Success was immediate. The user's changes were saved directly to a local SQLExpress database bringing the same reliability to the client that the full version of SQL brings to the server. The reduction in multiple tiers of custom logic, and the addition of SQL's robust diagnostics and event logging, enabled us to provide quick resolution to almost every data support issue we received. As our users' confidence in the data's integrity increased so did their demands on our architecture. To our great relief, SQL has stepped up to meet every new challenge without compromise.
From EBT user on 12/23/2005:
"Sorry, but this app has little trust worthiness at this point, and wondering if I should dump everything I do to Word documents."
From same EBT user, five months later, on 5/1/2006, after the new release using SQL replication:
"So far, so good. I've done almost every combination of infrastructure usage that would typically create problems for replication or stability, and have not had one problem. Congratulations!"
We continue to find advantages with the new design. For example, replacing ADO.NET in-memory datasets with a true query engine allowed us to strip almost all the extremely complicated data processing code that was necessary to represent the various "views" of the data the customer wanted with standard SQL. Thus, we were able to leverage the DBA for client functionality and to remove a lot of hard to maintain custom code. In some cases, code size dropped by more than one-third.
As we progressed with SQL replication, securely transferring data over the intranet and Internet was a concern. One of the promises of Web services was the idea that the design could be used to communicate data over the Web. Again SQL replication provided the answer, as it came with an out-of-the box implementation of the Web replication component. By using the REPLISAPI component, schema, data and updates to both can be securely and reliably relayed over https.
We have had such great success with SQL replication that we are now providing a Software as a Service (SaaS) product line called DataPlace (see Figure 2). DataPlace exposes the tremendous power of SQL Server and replication for the end users, novice developers, professional consultants, and a development teams, without the complexities of having to set it up for themselves.
Figure 2. DataPlace architecture (Click on the picture for a larger image)
It provides a database factory to design, develop, deploy, and use the power of SQL databases dynamically in a fully hosted environment. It is the solution we now use to build all of our latest smart client applications.
This article recounts our experiences over the course of several years and how we've spent a considerable amount of time and effort trying to solve the complex problems relating to working with data over the web in a multiuser, secure, scalable, confident and repeatable way. These experiences are not uncommon among developers, and they illustrate the necessity for dependable products and the knowledge to use them. Microsoft SQL Server 2005 Replication has provided a solid answer to these complex technical problems. Now, my team and I can finally focus on what we started out to solve, our customers' business problems!
"Configuring and Maintaining Replication," MSDN Library
CyberSavvy.NET DataPlace—Information and free trial
About the author
Peter Hammond is president of CyberSavvy, a Microsoft Preferred Vendor. He served two enlistments in the U.S. Air Force, during which he became interested in programming. After duty, he gained real-world experience working in various roles from network support to development consulting at several companies including Microsoft. In 2002, Peter recruited a team of exceptional professionals focused on building showcase solutions with Microsoft's latest technologies. CyberSavvy has built more than a dozen enterprise applications, notably the Enterprise Product Roadmap, used by more than 10,000 Microsoft employees for more than five years. Peter can be contacted at firstname.lastname@example.org
This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal Web site.