Software Architecture in the New Economy
João P. Reginatto
Summary: This article discusses the increasing importance of reliability, availability, and scalability for enterprise software applications that automate core business processes. (7 printed pages)
It was the mid-1990s, and I was getting some experience in designing enterprise applications. Having so far worked with a reasonable number of great applications, I was quite confident with regard to my skills as a software designer (at the time, the term "software architect" was not yet being widely used).
Then, the Internet came. It was a huge buzz. Companies started jumping into the Web, one after the other. Several companies were created, having their presence only in the virtual world. Many were doing business online; some were making money. I was already reading a lot about what would be called later "the New Economy," and, of course, my main focus was on the impact that this new platform would have on my job. "I have to surf that wave, too," I thought. I was counting the days when I would start playing with that new platform.
Unfortunately, I did not have the time that I wanted to prepare myself. It was a quiet afternoon in February (usually, a very calm month in terms of business in the consulting firm where I worked at the time), when my manager came to my desk speaking so quickly that I could hardly understand him.
"We know how to do the Internet stuff, right?" he asked.
"Well, I have already played a little bit with the main tools," I said, carefully.
"Well," he said, "We will have to make a decision, because I just got a call from Blue Yonder Airlines, and they want our help to go online."
That was my chance. I knew it was very risky to take over such a huge project without having the proper skills and experience. At the time, however, hardly anyone else had those skills. So, I accepted the challenge and said that, yes, we could help them.
Within a couple of weeks, we started having initial meetings in which we brainstormed on the important factors that were related to developing online applications for an airline company. Right at the beginning, I realized that it was a much bigger challenge than I had thought. I was especially concerned with the numbers that they were putting on paper. At the time, Blue Yonder Airlines was a regional airline with around 3,000 passengers boarding every day. They planned to expand their operations nationally that year, and to start flying to international destinations the year after that. Their growth estimates were way above market numbers—at around 40 percent, year over year. The company was approaching the Web as the platform for sustaining that expansion process in terms of information technology.
There were several high-level managers and even directors involved in the discussions, and they all had new ideas for the applications.
"From what I can see, we can use this new software platform to cut costs by offering the customer direct access to buying tickets and checking-in," said one manager.
"That's our basic idea," responded another. "Also, we have to realize that our presence on the Web will represent our main 'store,' running 24/7 and allowing us to expand nationally in an easier way."
"Now that you mention it, could we actually ask our partner travel agencies to start using our Web systems, instead of the systems that we have today? I mean, that would give us a lot of savings and the benefit of standardization," one director stated.
I was getting apprehensive during those meetings, because the team was having a lot of new and interesting ideas. The only thing was that each time that someone thought about a new approach to a given business process, I realized that it would imply one more technical challenge, especially on that new platform.
After one of those days that were filled with meetings and draft designs, I went out for happy hour with my friends. We usually met every other Thursday for a couple of beers and chat. It was usually entertaining, except when Larry started talking about his job.
Larry had been a friend of mine for a long time. We had gone to school together. He was now a competent engineer and the manager in charge of a new small factory in town that produced a given part of a car's engine. He loved his job and talked about it a lot. The thing is that not everyone liked his stories that much; but, after all, he was a nice guy.
Scalability and System Capacity
That evening, Larry started talking about how he managed to fulfill a huge order that the factory had received last week, which demanded five times the amount of products that they were used to producing on a daily basis. "The secret is that resource consumption cannot grow at the same pace as production requests," he said. "If my factory has to produce today five times more products than it did yesterday, it certainly cannot consume five times more resources. That is the secret of scalability."
"Hmm, interesting," I thought. As Larry talked, I was actually still thinking about my project. But when he mentioned scalability, he somehow helped me understand some of the challenges that I faced. Suddenly, it was clear to me that it was apparently hard to design this software, because it was directly connected to the core business processes of Blue Yonder Airlines. That project was definitely not about rewriting applications into the Web platform. In fact, we were discussing how to set up a new business. And businesses must be able to scale.
"Hey, Larry, come and sit here," I told him. "Tell me how you managed to solve that sudden large order issue."
"My pleasure," he said. "It is all about resource-usage optimization."
Larry explained to me the system that his company used for preparing the production line for scalability. Basically, they had studied the production cycle over time and identified the optimal number of production cells to achieve a given maximum-production capacity.
"We know today that a skilled worker operating the WT2000 machine, which is a stage in the product line, can handle a queue with up to six work orders," he said. "More than that and the worker starts making mistakes, because of the number of parts waiting for assembly, and because of the complexity involved. So, what we do is estimate how many workers and WT2000 machines we want in parallel to achieve our maximum capacity when needed."
Larry then explained to me how the production stage before the WT2000 machine assessed the number of work orders in each cell, sending the work to the cells that had smaller queues.
"That is also a load-balancing system," I told him.
"Well, I never heard of it with that name," he answered. "But, if you say so." He did not know it at the time, but he was giving me some tips for addressing my scalability requirements.
At the time, the technology that was available for building Web applications used an approach that did not privilege scalability. For each user who submitted a request, a process right on the operating-system level was triggered to handle it. Now, imagine this kind of approach supporting thousands of ticket sales everyday—possibly, growing to twice that volume within a short period. It would certainly not be an option.
What Larry told me that evening opened my mind to the fundamentals of scalability in software applications: system capacity and resource-usage optimization. It became clear to me that no matter what numbers the managers were drawing, every system has its maximum capacity—just like any business. Determining the scalability capabilities of an application is actually a matter of cost and benefit analysis.
For example, what we did at the time to address scalability requirements was to use Larry's ideas, creating a queue-based approach to resource usage. Using object orientation, we have implemented a pool of "request handlers"—just like the set of WT2000 machines in Larry's factory. Each request handler was responsible for a queue of requests, and that queue had a limited size, which was determined based on proof-of-concept tests (pretty much what Larry also did). If the whole pool of request handlers was busy and their queues were full, our system just started a new request handler.
By using that approach, we already had a favorable resource-usage scenario. As the number of user requests grew, the resource consumption grew at a much smaller pace. I knew that we were on the right path. We then planned maximum-capacity assessment and hardware sizing for the near future, and presented the whole idea to Blue Yonder. They understood and approved of the solution. I remember the meeting in which we presented the solution for scalability as one of my favorites. The following weeks were much easier.
Reliability and Availability
But not for long. At a given time, the discussion had shifted from performance and scalability to other quality attributes of the application. The management team suddenly became concerned over the fact that their business would have to actually rely on software systems 100 percent. All of Blue Yonder's expansion plans were now based on a set of applications for supporting ticket sales, the check-in process, and fidelity programs. Imagine the repercussions of one of those applications crashing for some reason. That would directly affect the company's financial health and image.
As I had already realized, that was the typical kind of problem for an application that directly automates core business processes. It was time for one of those chats with Larry. I wanted to know his opinion on reliability and availability.
Larry, as usual, was already at the pub when I got there. After some time, I called him over and said that I wanted to ask some questions about his factory. There was no need to say more. He came over instantaneously. "How do you approach reliability and availability on your production process?" I asked him.
"Well, it's funny that you ask that, because, although those are different topics, they usually come together," he answered. "And they usually have to do with scalability, also," he added.
Larry mentioned that in his factory, they usually approached production processes as a black box to assess reliability. The number of defects, failures, or deviations that were presented by a given process would classify it as either reliable or not. By breaking those production processes into smaller and smaller ones, and improving the quality of each one of them, the overall reliability of the factory would be in consequence affected. "We use Six Sigma and other quality frameworks to reduce our mean-time-between-failure (MTBF) and mean-time-to-repair (MTTR)metrics," he said.
"Wow," I thought. It was clear that I would have to study more than I already had.
"But there is one thing you can't forget," Larry mentioned. "If your business is not reliable, what else have you got?" I wanted to start a new subject; but, before that, Larry had one last lesson about reliability. "Remember, I told you this had to do with scalability," he added. "I wouldn't consider any business scalable if it's not consistently reliable over time," he said.
Larry was right. I should be concerned that as load increases, the application had to continue being reliable; otherwise, it would be of no use in handling additional requests.
I also wanted to know Larry's opinion on availability. "Well," he countered, "availability is basically about being there all the time. I run a factory that receives orders from all over the planet. I mean, my factory has to be up and running almost 100 percent of the time. How do I ensure that? Well, the answer has pretty much to do with redundancy."
Larry was right. There is not much for a software architect to do when it comes to availability requirements. But Larry had one additional tip about that kind of concern. "One thing that I found very tricky when setting up our factory was how to determine the optimal investment on redundant machinery, so as not to have idle production stages just for the sake of availability. The best idea is probably to use your redundant machinery as part of your regular resources, using them to achieve better performance and to scale smoothly."
Again, Larry was right. It was pretty clear for me that Blue Yonder's software systems would have to run on top of redundant hardware, so as to achieve expected availability requirements. But he also made me think that the best investment would be to come up with a cluster-like structure, which would deploy both availability and good results for performance and scalability. Today, this is a common approach for enterprise applications; but, at the time, it was not easy to design such a solution, from either a hardware or a software perspective.
After that conversation with Larry, I studied highly available systems, network configurations, quality frameworks, and other similar topics. It all helped me wrap up a solution that could address all of the management team's expectations for the new software.
We had focused a lot on defining a software infrastructure that would allow applications to consume resources optimally, communicate between different deployment nodes, and benefit from a robust transaction-control framework. The final high-level design was then proposed around three months after my manager had come to my desk and told me about the project. We suggested that Blue Yonder's applications would have to run on an architecture that was based on the idea of a request-handling pool that was prepared for running and benefiting from a clustered environment. We had shown them our proof-of-concept results and stated that, based on the budget that they said they had available for hardware, their systems would support the estimated company expansion during 2.5 years. It was another unforgettable meeting. Despite the usual problems, the project was one of the greater successes of my career.
Today, Blue Yonder Airlines has a solid online presence, and its IT department is known for its software-architecture expertise. Its applications no longer run on top of the design that we originally defined, especially because the ideas that we implemented at the time have now become the basis for every good enterprise-application server on the market.
As for Larry, he is no longer the manager of that factory. Instead, he holds a director position at the same company and is responsible for global production-quality standards. He still loves his job and still talks a lot about it.
The thing is that, as time goes, we are clearly building software applications that are more and more attached to customers' core business processes—especially, after the advent of the "New Economy." Sometimes, software architects find themselves helping to define business strategies, instead of providing design guidance. In this kind of scenario, the way in which we address quality attributes such as reliability, availability, and scalability can represent strategic differentiators for a company. As Bjarne Stroustrup once said, "Our civilization runs on software." I believe that he is right, more than ever.
Take over the challenge. As a software architect, help prepare the business of your company for the future. You will find it very gratifying.
· Does the last system that you have designed scale?
· How would you prepare a system for scalability?
· What should you do to design a reliable system?
· How many times have you heard the excuse that "the system is out of operation"? What do you think is the problem with such a system?
· Would you take over the responsibility of designing an air-traffic-control system?
· What about a nuclear-plant control system?
· Are those the only systems that really must be reliable, available, and scalable?
· Bass, Len, Paul Clements, and Rick Kazman.Software Architecture in Practice. Second ed. Boston, MA: Addison-Wesley Professional, 2003.
· Breyfogle, Forrest W. Implementing Six Sigma: Smarter Solutions Using Statistical Methods. Second ed. Hoboken, NJ: Wiley, 2003.
· Clements, Paul, Rick Kazman, and Mark Klein. Evaluating Software Architectures: Methods and Case Studies. Boston, MA: Addison-Wesley Professional, 2002.
Availability—The ability that a system can have of being accessible most of the time. By setting up an environment with redundant components, an individual component can fail, but the service can still be available.
Cluster—A set of computers (servers) that work together and, in many respects, can be viewed as if they were a single computer.
Load balancing—A technique that aims at spreading work among many servers (or resources in general), to have optimal resource utilization and improved computing time. It is usually based on an algorithm that determines how the balance should work.
MTBF (mean time between failures)—The length of time that a system runs before revealing a new defect.
MTTR (mean time to repair)—The length of time that a system must be out of operation for fixing a failure after it has occurred.
Reliability—The ability that a system has for ensuring integrity and consistency for all transactions.
Scalability—The ability that a system has for supporting the desired quality of service as load increases, without having to change the system. For a system to truly scale, it must be reliable.
Six Sigma—A quality-improvement methodology that originally was defined by Motorola to improve processes systematically by eliminating defects.
System capacity—Usually defined as the maximum number of processes or users that a system can handle and still maintain the quality of service.
About the author
João P. Reginatto is a software architect and lead developer who has worked with global, large-scale enterprise software applications for more than 10 years. He is interested in highly available and scalable systems. João is a fan of soccer, and he also loves teaching and traveling around the world. He can be reached at firstname.lastname@example.org.
This article was published in Skyscrapr, an online resource provided by Microsoft. To learn more about architecture and the architectural perspective, please visit skyscrapr.net.