.gif)
Bill Barnes and Duke McMillin
November 2007
Revised February 2008
Summary: When they are developing systems, architects must keep operations in mind. (4 printed pages)
If history repeats itself, and the unexpected always happens, how incapable must Man be of learning from experience?
–George Bernard Shaw (1856–1950), Irish dramatist and socialist
Contents
Introduction
The Patriot Missile Incident
Conclusion
Lessons Learned
Critical-Thinking Questions
Sources
Introduction
If history repeats itself, and the
unexpected always happens, how incapable must Man be of learning from
experience?
–George Bernard Shaw (1856–1950), Irish dramatist and socialist
You might find this to be a shock, but the operators of the
system that you just developed probably did not get a copy of the use cases that
you used to create your design. At worst, your use cases were constructed by
Bob in Marketing without any actual input from a customer. At best, they were
derived from extensive customer input, but still bear little resemblance to how
someone might actually operate your system. Is it the software architect's fault,
if there are failures that are caused by operators doing things that you were
told they would never do?
From working with my father to watching Lt. Commander Montgomery
Scott in Star Trek V, the Final Frontier, I can
recall hearing, "How many times do I have to tell you? Use the right tool
for the right job!" As I tried to use a hammer to help remove a part from
a car—spraying shards of metal everywhere, and
making a huge mess of things—my father would
come to the rescue. Impatience gave way to taking the time to spray with WD-40
what was not moving and waiting a minute before attempting to remove the part
again. It might have taken a little more time, but the amount of energy that
was expended to get the part off was less; also, it was safer, and it likely
saved money. I wondered if other types of work—oh,
for example, software development—could have
issues with using the right tool for the right job...
The Patriot Missile Incident
During the 1991 Gulf War, an MIM-104 "Patriot" missile
failed to hit an incoming Scud missile at Dhahran air base. As a result, many
people were killed or injured. After some analysis, the cause was traced to a
known bug.
It could be argued that this was actually not a bug at all,
because this "bug" was only manifested when the system was being used
in a way for which it was never intended. To use another auto-repair example: The
operators were in need of a hammer, but all that they had was a screwdriver. Well,
a screwdriver can be a passable hammer, and a mobile antiaircraft missile
system can be a passable stationary antiballistic missile system. However, the
use cases that were used to create a "screwdriver" or a mobile antiaircraft
missile system were the basis for any architectural decisions that were made during
development. So, what use case was not anticipated?
What was it like for those software architects, the people who
wrote the software that controlled such a complex system? I imagine that things
were looking good, at first. Based on their use cases, their design, and their
user docs, they had a great product—or so they
thought. Maybe they should have known better. Maybe they should have asked the
customer and the operations support group more about how they were planning to
use the system. How could anyone plan for all of the corner cases that "could"
come up? However, when a corner case becomes the standard mode of operation,
only one thing could have occurred: disaster!
In the case of the missile system, the Patriot maintained a "time
since last boot" timer in a single-precision floating-point number. Time,
which is critical to navigation and system accuracy, was computed from this
number. The Patriot system uses a 100-millisecond time base. This 1/10-of-a-second
number cannot be exactly represented by a floating-point number. With 24-bit
precision, after about 8 hours of operation, enough error—about
.0275 seconds, enough to yield a 55-meter error—accumulates
to degrade navigational accuracy. After 100 hours of operation, the time error
increased to a third of a second—the equivalent
of 687 meters of targeting inaccuracy!
It turns out that the original use case for this system was to be
mobile and to defend against aircraft that move much more slowly than ballistic
missiles. Because the system was intended to be mobile, it was expected that
the computer would be periodically rebooted. In this way, any clock-drift error
would not be propagated over extended periods and would not cause significant
errors in range calculation. Because the Patriot system was not intended to run
for extended times, it was probably never tested under those conditions—explaining why the problem was not discovered until
the war was in progress. The fact that the system was also designed as an
antiaircraft system probably also enabled the inclusion of such a design flaw, because
slower-moving airplanes would be easier to track and, therefore, less dependent
upon a highly accurate clock value.
The system worked well, when it was used as designed; but the
customer used the system in a way that was not foreseen by the software
architect, and the result was a loss of life. The Patriot missile failure has
been a case study in how complex systems can fail in ways that nobody expected,
because of a series of seemingly unrelated events. However, it also shows that
operators of any complex system can be very "creative" and are likely
to do things that they just ought not to do. This is an extreme example of what
happens in any enterprise, every day; someone in operations is just trying to
get something done by using your system in a way that you probably did not even
know was possible.
Conclusion
So, what are you to do? You cannot design a system that works
under every conceivable use case. However, you can make a system that has very
well-defined limits of operation and fails in known (and easily understandable)
ways, when it operates outside of those limits. One way to help get there is to
include, at a very early stage, actual system operators in your design
meetings. They will, of course, help with the use cases that are supposed
to be supported, but they can also provide some interesting insight into how
the system might be used outside of those use cases. You could throw up your
hands and just say, "Don't do that"; or you could just understand the
reality of the operational environment, and try to make your system robust
enough to survive the unexpected—where "survive"
can mean failing in a known way. Your screwdriver is eventually going to be
used as a hammer.
Lessons Learned
· Ask many people who have a variety of roles in the company to
review your use cases, to get a variety of perspectives and inputs.
· Review your plan with the operations support team, before you
start writing the production code.
· Understand and document how the system might behave under use-case
scenarios that are known not to be supported.
Critical-Thinking Questions
· How could this product be used in ways in which I never intended
it to be used?
· Under what conditions will the system fail? Consider all
conditions, not just those that show up in an official use case.
· Someone from the Patriot manufacturer must have known that the
customer had decided to use this system in a nonstandard way. How do you foster
a relationship with operations that would allow this situation to be
communicated back to the software architects?
Sources
· Hughes, David. "Tracking Software Error Likely Reason Patriot
Battery Failed to Engage Scud." Aviation Week and
Space Technology, June 10, 1991.
· Ganssle, Jack G. "Embedded Systems
Programming: Disaster!"
Embedded.com Web site. May 1998. (Accessed January 9, 2007.)
· Marshall, Eliot. "Fatal Error: How Patriot Overlooked a
Scud." Science, March 13, 1992.
· Toich, Shelley. "The Patriot Missile
Failure in Dhahran: Is Software to Blame?"
shelley.toich.net/projects Web site. February 9, 1998. (Accessed January 9, 2007.)
About the authors
Bill Barnes has been involved in WAN and enterprise network engineering and customer support since 1995. He has worked for companies such as NorthWestNet, Verio, Internap Network Services, Lexis Nexis, and Boeing.
Duke McMillin has been working in IP networking since 1995 in a variety of customer-facing network-engineering and support roles, including the management of engineering development and capacity-planning organizations.
This article was published in Skyscrapr, an online resource
provided by Microsoft. To learn more about architecture and the architectural
perspective, please visit skyscrapr.net.