Summary: This article discusses how the analysis of the transaction properties of atomicity, consistency, isolation, and durability (ACID) are crucial to integration design. (7 printed pages)
When Woodgrove Bank merged with another bank, it was clear that it would take years for the retail- and business-account applications to be merged. It would take a long time just to decide what to do—move the data from one application to the other, build a new application, or buy a banking package—let alone to implement the solution. What was needed in the interim was for the branches from Woodgrove Bank to be able to access the accounts in the other bank, and vice versa, so that a customer could do business at either branch.
A banking transaction is not just a credit or debit on an account. Data for the branch and the cashier are also updated—primarily, so that the branch knows, at the end of the day, how much cash went in, how much went out, and, therefore, how much should remain. They then count it to check that no cash has gone missing. I will call the two subtransactions account update and cash update. Let's look at some alternative solutions.
Many people will look at this problem and immediately think of distributed transactions—in other words, having the two subtransactions synchronized by using the two-phase commit protocol. Figure 1 illustrates this solution.
Figure 1. Two-phase commit
I call these diagrams task/message diagrams. They are like UML sequence diagrams; but I also show the users and the main data updates. Their purpose is to analyze system issues—mainly, performance and recovery. The cash-update application in many banks is physically located in the branches; in Woodgrove Bank, however, it was located centrally in the bank's data center.
There are three practical problems with two-phase commit. Firstly, there is a performance hit. There are more network messages, and resources in the account-update application must remain locked until the commit phase. The account-update application was the more performance-critical server; so, I have put the "update cash total" before the "account update" to reduce the locking time. The other bank's staff was uneasy about using two-phase commit. But I could point to sites on which it works well; and the volumes would not be that great, as a distributed transaction would apply only when their customers went to the Woodgrove Bank branches.
The second practical problem is that a two-phase commit solution is only as reliable as its most reliable member and the network in between. Neither bank wanted to count on the other bank's IT system reliability. But, in this scenario, if the Woodgrove Bank system went down, the other bank's system was unaffected, and vice versa; there was no impact on the existing services.
Well, almost none. There is a small timing window. If the system that was running the cash-update application were to fail just before the commit, the account-update application would be left hanging, waiting for a commit message. While the application waited, it would be holding locks, and there would be a good chance that other programs would end up waiting on the same locks.
While the other bank was uneasy about the two-phase commit approach, what actually killed this solution was the third practical problem: Not all software supports two-phase commit. No interface that pretends to be a screen supports two-phase commit; and, at the time that this project was done, neither did Web services. Going down this route would mean introducing new middleware—with all of the consequential changes to the application programs.
So, why not use message queuing? We can support that, said the other bank. We played around with some options and came up with the solution that Figure 2 illustrates. The big "D" in the box (on the message between the cash-update application and the account-update application) indicates that the message is "deferrable." It does not mean that the message will always be deferred; it just means that, if there is a glitch, and the account-update application is not ready to process the message, the message will hang in a queue until the application is ready.
Figure 2. Message queuing
In this solution, performance is no problem. Reliability needs a bit of care. Message-queuing software normally has the facility of synchronizing the write to the queue with the transaction commit. This is important here. We want to be absolutely sure that if the "update cash total" has been done, there will always be one (and only one) message sent to the account-update application that will process it. The same has to happen on the return message.
Technically, a message-queuing solution differs from a two-phase commit solution in that it does not support the transaction characteristics of atomicity and isolation. Lack of atomicity is corrected by a reversal transaction. A reversal transaction is user-written undo code. If the account update fails, the cash update is reversed. Lack of isolation in the context of our example means that after the message is sent to the account-update application, another user is free to update the cash totals—or, "Oh, woops! We've given the cash to somebody else, and my till is now empty." There must be a mechanism to stop this. The easiest one in this case is for the cashier to wait for the transaction to finish before giving out the cash.
This looks like a good solution. "Oh, no, it isn't," said the IT staff of Woodgrove Bank. "We're happy with message queuing, but we're unhappy with the user end of things. The problems are:
· "Our cash-update application is not designed this way; it would need a rewrite.
· "Writing the code to receive the return message and send it back to the right workstation looks tricky.
· "What happens at the customer end, if there is a delay? How can the transaction be cancelled in midstream?" In our application, they said, the bank teller is in charge; the teller asks the system what has happened. It looks like Figure 3.
Figure 3. The Woodgrove Bank solution
There are two key features of their proposal. Firstly, the workstation application polls the cash-update application to see what the status of the transaction is. Secondly, the cash-update application maintains a log, which keeps track of the status of the transaction. The vertical line in the drawing that connects the log-update objects indicates that it is the same log-object instance in each case.
The staff of Woodgrove Bank pointed out that they have to maintain a log in any case, because, at the end of the day, they must perform reconciliation. If anything is wrong, they want an exact record of everything that has happened at the till. Note, too, that the "update cash total" happens at the end, instead of at the beginning. This means that there is no need for a reversal. But it also means that if the account update went fine, they must ensure that it is possible to perform the "update cash total"; there must be money in the till.
Actually, there is a problem with this design. If the system cash-update application fails after sending the message to the account-update application and before creating the log record, when the cash-update application finally receives a response, it will have no log record to update. As a general principle, it is almost always best to write log or audit trails (or whatever you want to call them) before the operation that is being logged, because it is easier to work out what happened when you have a record of the event.
What this application also illustrates is the importance of considering the end user in all of this. For instance, what happens if someone asks to withdraw money, but then suddenly leaves the bank? It is possible with a log to reverse a transaction that has just been done, assuming that the account-update application also supports reversals.
In the case of Woodgrove Bank, the solution was implemented, and it worked well.
Units of work that are made up of many transactions are often called long transactions. Typically, a long transaction is a dialog with an end user that results in one or more databases being updated. When implementing long transactions, you must consider:
· Atomicity. If there is a failure, how is the work going to be undone? Solutions usually require reversals.
· Isolation. How do we ensure that two parallel long transactions do not step on each other's data? Solutions usually either rely on some feature of the outside world (such as the customer being dealt with by one person at one time) or on marking ownership on records in the database.
· End-user behavior. The end user might want to abort the long transaction halfway through or might disappear (for example, go to lunch, go home, or be taken ill).
Drawing task/message diagrams is a useful technique for discussing the issues with interested parties.
The key question to ask on every element of the task/message diagram is: What happens if it fails? Follow-up questions are:
· What is the impact of the failure on the database?
· What is the impact of the failure on the end user?
· How is the failure detected, and how are the users and application programs informed?
You should also ask: What happens if there is a delay, and the end user gets bored and goes away or tries to abort the transaction?
Any good database book will explain transactions. The following book explains long transactions in more detail:
· Britton, Chris, and Peter Bye. IT Architectures and Middleware: Strategies for Building Large, Integrated Systems. Second ed. Boston, MA: Addison-Wesley, 2004.
The chapters on resiliency and integration design are particularly relevant.
Abort—The operation to fail a transaction (see Atomicity).
ACID—An acronym for atomicity, consistency, isolation, and durability, which are the characteristics of a transaction.
Atomicity—If the transaction ends successfully, all of the work is completed (or can be guaranteed to be completed later). If the transaction fails (or is aborted), all of the work is undone. In other words, partial transactions are not allowed.
Commit—The operation for end-transaction processing for a successful transaction.
Consistency—Within a transaction, the resources are allowed to be in an inconsistent state; but, when the transaction is finished (or undone), the resources are consistent. By consistent, we mean that it obeys various constraints. These constraints can be anything that you (or the database software) want. Examples are that indexes are correctly updated, referential integrity is maintained, or all required attributes have values.
Durability—When a transaction is complete, it stays complete. This is essentially a requirement for the updates to be written to disk or other durable media.
Isolation—Updates to the database are not visible to other transactions that run at the same time, until the transaction is complete. This is to prevent other transactions looking at inconsistent data or at updated data that might later be rolled back because of a failure. Many database systems allow various weakened forms of isolation to improve the transaction performance in the situation in which you can live with some inconsistency creeping in (such as gathering internal usage statistics).
Message queuing—An asynchronous message-sending middleware.
Reversals (or Reversal transaction)—A user-written transaction that undoes the effects of an earlier transaction.
Serializable—Another way of looking at isolation. If isolation is enforced, processing transactions in parallel is logically equivalent to processing them in sequence, in the order in which the transactions were committed.
Subtransaction—A transaction that is part of an overarching distributed transaction. Often, the subtransactions are performed on different databases on different machines.
Two-phase commit—A way of coordinating subtransactions, so that the whole appears like one transaction. The two phases are the working phase and the commit phase. If all of the working phases are successful, the commit phase is executed. If one of the working phases is not successful, all of the subtransactions are told to abort.
About the author
Chris Britton does some independent architecture consultancy, but spends most of his time building the modeling tool Polyphony. Chris used to work in the architecture group at Unisys. Before that, he had a long background in database and mainframe technology.
This article was published in Skyscrapr, an online resource provided by Microsoft. To learn more about architecture and the architectural perspective, please visit skyscrapr.net.