Murphy's Law Manifests on Manic Monday
Summary: The Information Technology Infrastructure Library (ITIL) is an integrated series of best practices in IT service management that conform to the British Standards Institution standard (BS15000), with over 1,00,000 certified practitioners worldwide. ITIL's best practices can help plan and predict more efficient IT business, service, and resource capacities. (6 printed pages)
Miles to go before I sleep...
I nervously entered my cubicle at Datum Corporation. My first day as an IT infrastructure architect had turned nightmarish on this manic Monday morning. Murphy's Law had manifested at multiple points in Datum's aging IT Infrastructure.
Crawling computers—frustrations flying freely—slow systems—troubled tempers.
Datum, a midsized conglomerate operating in several industry verticals and geographies, was looking at me as a savior to fix its IT infrastructure, which was crumbling under the weight of rapid business growth and complexity. My predecessor, having abruptly quit last week, left me not with a happy handholding or even a hasty hand-over.
A troublesome network—slow hardware—erratic applications—still suffering from a Sunday hangover. Spirits were dropping like snowflakes on a cold winter's morning. The new marketing systems were supposed to go online last week. Customer calls and messages were flooding phones, PCs, and Blackberries as furiously as a hurricane.
A fuming CEO and irate users had confronted me. They were appalled at their IT inefficiencies. The peak holiday season was fast approaching. When would life return to normal? With a 90-day deadline to enhance and augment Datum's IT capacity, I was toying with several questions in my mind to get a grip on the gargantuan mess.
Where would I start? Where should I focus? Would I ever finish? Would I ever sleep?
The above questions troubled my unstill mind, till... till... till... It dawned on me.
Yes! The ITIL could help.
The Information Technology Infrastructure Library, or ITIL (pronounced eye-tee-eye-el by its British founders and eye-till by U.S. folks), is an integrated series of best practices in IT service management that conform to the British Standards Institution standard (BS15000), with over 1,00,000 certified practitioners worldwide.
Adherence to ITIL recommendations is becoming increasingly popular in the U.S. A global consumer company had shaved off over 100 million U.S. dollars in costs (10 percent of its annual IT budget) by using ITIL recommendations. A U.S. oil company had saved 6,000 person-days in their global PC consolidation project that involved 80,000 desktops. Consultants were promising to halve an organization's total cost of operations (or TCO) with ITIL.
Not very long ago, Datum had poured huge sums into a new IT setup. Being promised the moon, management naturally was dismayed at the dollar-draining IT. These drains had to be plugged rather quickly. In addition, the CEO told me, "We are planning to double sales over the next three years, and we might execute an acquisition or two. Keep that in mind, while you plan our new IT capacity."
What is capacity? Capacity is defined as a measure of the amount of work that a system can perform. What is capacity planning? One vendor describes it as a predictive process to determine future computing-hardware resources that are required to support estimated changes in workload by monitoring an existing system to spot its utilization trends.
ITIL's service management identifies two broad areas: service support and service delivery. Within the latter, Capacity Management looks at management of business, service, and resource capacities as separate subprocesses. The creation and ongoing maintenance of the capacity plan is a major deliverable of this process. So, this plan was going to be our first milestone. Where would I start? ITIL's Capacity Management processes gave me four broad steps on which to work.
Step 1: Monitoring Performance and Throughput of Datum's Existing IT Infrastructure
We dissected our existing IT computing and networking infrastructure to understand and monitor performance and throughput of all of our IT services in a staged manner.
First, we considered the individual computers, along with their immediately attached peripherals and devices. Second, we looked at our internal physical network (within the walled boundaries of Datum's head office). Third, we considered our entire Enterprise-Wide Logical Network—our network's interfaces with all other geographic locations that connect Datum to the trusted entities outside itself.
Finally, we looked at the remaining physical and logical connection points of Datum's IT infrastructure to the rest of the world (including the distrusted Web).
Within each of the aforementioned computing and communication machines and devices, for all software assets—operating systems, scripts, databases, programming languages, standard software, productivity applications, and custom-built applications—we identified the versions, licenses, and dependencies, and we documented them by using ITIL templates.
Step 2: Tuning Existing Resources to Make the Most Efficient Use of Them
Having documented our existing IT infrastructure in the first step, the next step was to enhance the same efficiently and productively. We spoke to our enterprise IT vendors about our studies and requirements. They were pleased with our systematic approach. It triggered a collaborative partnership with them. They willingly shared with us their internal proprietary tool sets, capacity-planning guidelines, and actual performance-metric extraction techniques for each of the IT assets that they supplied.
We realized that, in some places, the hardware had been configured incorrectly. We critically reexamined our CPU, memory, storage, and network infrastructure choices. Mismatched portions were identified. In a few cases, application changes and upgrades had resulted in an incorrect configuration, which led to subsequent performance degradation.
In one application, the number of users had increased dramatically, which resulted in peak-time system overloads. In a billing system, a monthly report was consuming over 14 hours to execute, in spite of running on the fastest server. When I scanned the application's source code, the algorithm seemed messed up. Our vendor reluctantly agreed to rewrite the code. Presto! The report time was slashed drastically to just a couple of hours. The users were overjoyed. Such moments of joy and improvements helped us gradually climb out of Datum's IT-problem pit.
For the first time in many weeks, I was beginning to feel confident.
Step 3: Understanding Current Demands and Future Forecasts for IT Resources
Having optimized the performance and throughput in Datum's current IT setup, we now had to anticipate the business demand, drivers, types, quantities, and timing of critical IT resource capacities that were necessary for meeting future forecasted workloads. User interactions gave us the business-application requirements for multiuser RDBMS, application, Web, file/print servers, computer-aided drafting/manufacturing/engineering CAD/CAM/CAE workstations, and stand-alone computers and network devices.
We also looked at the application workload type: online transaction processing (OLTP), interactive, enterprise resource planning (ERP), customer-relations management (CRM), business-intelligence (BI) analytics, decision-support services (DSS), batch, Web access, e-mail, file/print services, workgroup activities, as well as their different types of user interfaces. We also estimated the number of users, their task profile/work rate, and the average time expended both during normal and peak workload periods through detailed studies of their usage. We detailed master entities, transaction-document outputs (queries and reports), and other system operations. We also simulated a few critical applications under various business scenarios, to arrive at a predictive performance analysis.
The projected annual business growth in the years ahead was extrapolated to arrive at the likely gaps in computing resources after scientifically de-risking Mr. Murphy's likely impact. Trending, linear regression, mathematical queuing, and modeling techniques were used to plan the expected annual-growth momentum. This gave us a solid foundational perspective for our future IT infrastructure capacity.
Step 4: Preparing a Capacity Plan to Meet Quality Levels of Services, as per Defined Service-Level Agreements
Voilà! We were approaching our final goal. We had done our homework with all of the resources on hand to create a capacity plan that would enable us to plan our future IT resource capacities and give us increased efficiencies through cost savings, as well as reduce our risks considerably.
Using ITIL templates, we meticulously prepared the same, along with financial budgets. With all the above-ground plans in place, we confidently scheduled our meeting with our CEO to present our findings.
At the meeting, the air-conditioner was on full-blast—perhaps to freeze our frayed nerves). There was a bespectacled gentleman sitting next to my CEO; he was introduced to me as Professor Kelly. "Call me Bob," he said. (Did that sound ominous or what?)
We started our presentation. We highlighted our reasons for selecting ITIL; presented our detailed findings regarding our existing IT infrastructure, and the causes of our previous problems (Isn't hindsight a great teacher?); and expressed solid optimism for our future IT capacity-management and capacity-planning efforts.
Bob suddenly drew me back to the present moment. He politely asked me, "Are you aware of the major airline-scheduling problem during Christmas 2004?" I muttered something am sure you would have surfed the Web for the same. You will get the answers. "Have you have looked at Datum's future IT capacities to avoid a similar kind of a disaster?"
From the arsenal of papers in my hidden-from-my-boss-until-specifically-asked-for folder, I triumphantly spoke of my research on scalability. I mentioned the various definitions and metrics of scalability, failure causes, and trade-offs. I radiated hope at gradually improving our future capacity-management sophistication levels.
Both my CEO and Bob appeared satisfied. I got the go-ahead for our plans. Datum's planned IT infrastructure expansion would (I hoped) propel my career into a distant orbit. Folks, time for a party!
Did Robert Frost have IT and IT infrastructure architects in mind? It seems that he must have. IT never sleeps. However, there is one way to deal with the madhouse of stress. ITIL's best practices can help plan and predict more efficient IT business, service, and resource capacities.
If you feel like a short-order cook when coming up with plans, think about using ITIL. ITIL can help cook delicious IT meals for your perpetually hungry business users, 24 hours a day and 365 days a year—not just on maddeningly manic Mondays. Do not let the stress get to you. ITIL can help you to go home and sleep in peace.
- Is IT infrastructure capacity planning an art or a science?
- How do the various technology "laws" (Amdahl; Little; Moore, Metcalfe, Gilder, and Coase) influence IT infrastructure capacity planning?
- Barbacci, Mario, Mark H. Klein, Thomas H. Longstaff, and Charles B. Weinstock. Quality Attributes (CMU/SEI-95-TR-021). Pittsburgh, PA: Software Engineering Institute, Carnegie Mellon University, 1995.
- Computer Measurement Group (See paper titled "Six Levels of Sophistication for Capacity Management," by George Thompson, IBM Global Services.)
- IBM eServer pSeries Sizing and Capacity Planning: A Practical Guide
- ISO 20000 (BS15000/BS 15000) ITSM Standard
- Internet-CPG (Search "IT Capacity Planning Guidelines" for links to different IT vendor's products.)
- ITIL Glossary
- ITIL Survival
- ITIL Web site
- Weinstock, Charles B., and John B. Goodenough. On System Scalability (CMU/SEI-2006-TN-012). Pittsburgh, PA: Software Engineering Institute, Carnegie Mellon University, 2006.
- Amdahl's law
- Article on capacity planning
- Brief overview of four technology laws (Moore, Metcalfe, Gilder, and Coase)
- IT World article
- Little's Law/Guerrilla Capacity Planning (GCaP) article
- Various links on computer capacity planning
Capacity planning—(See http://en.wikipedia.org/wiki/Capacity_planning for a broader definition, not just within the IT context.)
Scalability—(See http://en.wikipedia.org/wiki/Scalable, http://www.webopedia.com/TERM/s/scalable.html, and http://computing-dictionary.thefreedictionary.com/scalable for various definitions.)
About the author
Mahesh Khatri is the founder and Software and Consulting Director for Kaytek Computer Services Private Limited, a two-decades-young IT systems-integration company in Mumbai, India. His interests are reading, writing, walking, and yoga.
This article was published in Skyscrapr, an online resource provided by Microsoft. To learn more about architecture and the architectural perspective, please visit skyscrapr.net.