A Knowledge Framework for Enterprise Application Systems Management

Nagaraju Pappu
Satish Sukumar
November 2007

Summary: Management of enterprise application systems is here to stay. IT teams have to quickly embrace newer techniques that will enable such systems to be managed effectively. Central to these techniques is to capture, build, and make available libraries of reusable knowledge about the systems they have to manage. The framework of knowledge areas presented in this document is a starting point for IT teams to organize these libraries. IT teams should extend this framework to include areas such as security, disaster prevention and recovery, capacity planning, and service-specific reporting. Creating a body of knowledge that can be used for both manual and automated application system management and providing the tools to allow IT teams to contribute and collaboratively enhance this body of knowledge are the keys to taming the complexity of enterprise application system management.

Contents

Introduction
Infrastructure Management vs. Service-Level Management
Complexity of Managing Application Ecosystems
IT’s Challenge
Organization of the Body of Knowledge
The Shape and Form of the Body of Knowledge
About the Authors

Introduction

Current generation application systems are complex ecosystems—the various applications within the ecosystem have many interdependencies among them and are generally integrated at a platform level; they are not a collection of independent applications using application level integration schemes. There is a strong interdependency between business systems, IT systems, software systems, platforms, and IT infrastructure. As a result, a holistic management of the entire ecosystem is required to achieve consistently high quality of service—they cannot be managed as independent silos.

The complexity of these systems implies that IT teams tasked with their management need to have specialized interdisciplinary knowledge and techniques beyond what has been traditionally used to manage individual elements of infrastructure.

This paper introduces a framework to categorize the knowledge areas required to manage complex, mission critical application systems, and presents ten different areas of knowledge.

The information in this article is useful to CIOs and operational managers engaged in the management of enterprise application systems.

Infrastructure Management vs. Service-Level Management

As enterprises have become increasingly dependent on services provided by software applications to function, managing the quality of these services in terms of parameters such as up-time, performance, turnaround time, security, and consistency has become ever more important. The business criticality of services is such that any disruption is no longer treated as an IT problem; service disruptions are immediately brought to the attention of the CXOs of the enterprise. The CIOs mantra today is to reduce planned downtime for the business critical services to zero—unplanned downtime or service disruption is a cardinal sin.

IT management today has been forced to rethink their methods of managing and operating IT infrastructure. For a long period of time, IT managers delivered against infrastructure-level SLAs, such as “if server XYZ had 99.9% availability” or “if network segment DEF had an average latency of less than 30 ms,” all was well. The standards for management, the operational processes, and the expertise built within IT management teams was geared toward the management of individual components of infrastructure.

However, IT management teams today are increasingly having to deal with SLAs at the level of application services and sometimes even business services. Today’s SLA statement is along the lines of “the hospital admission workflow process should have less than 1 hour of downtime a month” or “the average time taken to automatically adjudicate an insurance claim of a specific type from a specific provider should be no more than 20 seconds.” There is an enormous difference between the availability of an admissions process and the availability of a server. It is the difference between having to manage application systems and managing a piece of infrastructure that in the context of an application system is merely one of several parts. IT management teams today are struggling to use the existing techniques of managing infrastructure to cope with this difference.

Complexity of Managing Application Ecosystems

Twenty years ago, the applications that enabled business processes ran on mainframes or on client server systems. From an IT manager’s perspective, the number of moving parts were small and the interconnections between individual applications were limited and most often were based on one-to-one data exchange mechanisms. Management of these systems could be performed on an individual basis. Recently, this scenario has changed significantly.

Today’s application systems are fundamentally different and significantly more complex. On one hand, it is fairly common for applications of today to have thousands of objects. On the other hand, applications seldom function in silos; instead, they are integrated using a variety of application and process integration mechanisms. Application integration architecture focuses on the creation of enterprise-level platforms or integrated fabrics that subsume individual applications and the infrastructure they run on. Instead of individual applications, one has an integrated platform that exposes a set of “business level services” used to compose business process workflows. This translates into a large system that has many moving parts and a complex web of dependencies between these parts. It is no longer feasible to manage individual applications without considering the connections between applications.

These integrated platforms operate within environments that greatly add to this complexity. For example, the environment almost always has a set of multiple standards such as BASEL-II or ITIL that the business processes and the applications and the teams that manage them must adhere to. There is a large variety of direct users who interact with the applications; these include stakeholders, such as customers or suppliers, and business managers—any outage or degradation of performance directly impacts them and, hence, the business. Outsourcing business operations and IT systems management, global accessibility requirements, virtualization, and the need to scale on demand contribute their share to this complexity. The word “ecosystem,” which brings to mind organic complexity and the notion of several entities coexisting and interacting with one another, is singularly appropriate to describe the application system environments of today.

 

Application ecosystem

The complex web of closely integrated software applications, the infrastructure they operate on, and the processes and standards that govern their design and operation. Application ecosystems provide a set of “business level services” whose quality must be managed.

IT’s Challenge

To meet the service quality expectations, IT management teams must manage application ecosystems. They must be aware of the place and significance of entities such as servers, databases, applications, and network devices within the context of the ecosystem and the services it provides. IT teams need to transform and then apply the techniques and expertise gained from managing infrastructure to application ecosystem management.

It is also evident to IT management teams that manual management tasks no longer suffice; they must be replaced with automation—specifically in areas such as monitoring the health of the ecosystem and reacting rapidly to any events that seem to threaten this health.

However, in practice, IT teams are still mostly uncomfortable with managing application ecosystems in either an automated or in a manual mode. This scenario persists in spite of the availability of a vast array of systems management related tools today.

A key reason for this discomfort is that most IT management teams have not been able to build and then use adequate knowledge about their application ecosystems that enables automation and reduces people dependency. This knowledge is available and is often locked in with a few experts who have interacted with the system during its life cycle from construction to operation. Making this available to both IT management staff and applications that automate management tasks as a set of libraries of reusable knowledge is the first step in moving toward the goals of holistic and automated system management.

Extracting and organizing this knowledge is not a trivial task. This knowledge spans a large number of areas and activities and needs to be consolidated from pieces of information that are accompanied by different perspectives. The rest of this document presents a framework to organize this information into a coherent and usable body of knowledge.

Organization of the Body of Knowledge

The following diagram illustrates the primary categories or knowledge areas we use:

Bb897546.KnowledgeFramework4EAMv1_msdnsimple_image002(en-us,MSDN.10).jpg
We use a model with 5 primary categories and 10 subcategories. The 5 primary categories include knowledge that describes the following:

  • The topology of the system. This includes a complete description of the construction of the system in terms of its parts, the relationships and organization of these parts, the different versions of the parts, the interdependencies of the parts, and the configuration of the parts. Topology information in itself is created using a top-down modeling mechanism that is described later in this document.
  • The behavior of the system. This includes a description of the various application activities and usage patterns one could expect to see over a period of time. It is information that describes the “known” state of the system. This information serves as an index or benchmark of normal behavior.
  • Operational tasks. This includes a description of the various activities related to a system from installation of the system to deploying patches and releases or making changes to the system to routine administration tasks, such as performing data backups or rotating log files on a periodic basis.
  • Managing the health of the system. This includes knowledge that enables one to observe the different states of a system and then interpret theses states in terms of the health of the system. The word “health” here is used as a metaphor to indicate whether the system is functioning as desired. If there is deterioration in the system, it is considered to be becoming unhealthy—that would cause a disruption of service as committed.
  • The environment the system operates in. This includes knowledge about the environment and the expectations of that environment, both in terms of policies and rules that apply to the system, in addition to SLA expectations from the system.

These categories and subcategories are a means to organize the body of knowledge. The information within each of these categories is not distinct, but it is interrelated and organized with the topology as the fulcrum.

The implementation of such a body of knowledge calls for one to deal with several types of data from well structured data in databases to free form text that captures the subjective experience of application management teams. This data must then be made accessible through multidimensional queries. For example, users should be able to extract a configuration build file for a part of the system or to study the impact of a surge in utilization within a part of the system on the overall health of the system.

The body of knowledge is used across the spectrum of operational management tasks by both human operators and software systems. The following diagram illustrates this using ITIL as a reference:

Bb897546.KnowledgeFramework4EAMv1_msdnsimple_image003(en-us,MSDN.10).jpg

The Topology of the System

The topology is a model—a representation of the system’s design created from the perspective of managing the system. The topology brings out the construction of a system in terms of its important parts, the relationships and dependencies between these parts, and their structure and organization within the context of the system.

The information within the topology is the fulcrum around which the entire body of knowledge is organized. Therefore, it is vital to get the topology right. However, this is easier said than done for the following reasons:

  • The system that is being modeled is seldom static—it is dynamic. The parts of the system and, even more importantly, the relationship between the parts changes. The topology must reflect this dynamism.
  • Bb897546.KnowledgeFramework4EAMv1_msdnsimple_image004(en-us,MSDN.10).jpg
    While there are many tools to diagrammatically represent topologies and even languages such as UML or, more significantly, SDM to describe and share topology models, it is difficult to find information that describes a technique of how to model an application ecosystem. Terms such as top-down modeling, object-oriented models, and bottom-up models are used, but none of them truly describe how to model. This lack of information leads to scenarios where topologies are developed in an ad-hoc manner that leave out vital elements or leads to what we call the “scenario of the eternal model,” where the modeling keeps getting refined and detailed and can never be considered to be complete.

The method we use for modeling enterprise application systems uses a four-dimensional model, as illustrated below:

Bb897546.KnowledgeFramework4EAMv1_msdnsimple_image005(en-us,MSDN.10).jpg
The method is top-down, with each dimension acting as a realization of elements from its higher dimension. Each dimension has its own set of objects that are relevant to that dimension. Within each dimension, we model the reliability, availability, supportability, and performance characteristics relevant to the objects at that dimension.

Dimension 1 is the dimension of the business processes. Objects that are modeled here are workflows that realize the business processes, the stages, and participants within these workflows. For example, if the application system we were modeling enabled the processing of claims, the claims adjudication workflow would be modeled with stages such as sorting, coding, automated adjudication, manual adjudication, and payment processing. Availability at this dimension is impacted by the availability of the participants at each stage and performance is impacted, for example, by the latency at each stage or by the number of reruns that occur because of errors at a particular stage of the workflow.

Dimension 2 is the dimension of information processors and the flow of information. Information processors in real-world applications often include software-based systems as well as human beings. The objects modeled here are information structures and information flows—the sequence of transformation of information within a stage of the workflow. Reliability at this dimension is in terms of the correctness and completeness of information and its processing. In our claims adjudication example, objects modeled in this dimension include information within the claim, other information such as business rules, lookup data, references, and so on, that is required to process the claims as well as rule and transformation engines that manipulate and transform the information.

Dimension 3 is the dimension of software systems that realize the information processors. This is the dimension of application function A, database query B, and the sequence in which these are called as part of an information processor. Performance at this level is in terms of number of requests processed and reliability is in terms of the number of software errors seen and their distribution.

Dimension 4 is the physical dimension of actual computing and network resources used as part of the execution of software. Objects within dimension 4 are executables of software, operating system resources, network bandwidth, and so on. For example, at this level, we say that Database B runs on a specific instance of an RDBMS; in turn, that instance runs on a 4-node cluster running operating system Y deployed over a fiber-based backbone, and that each of these nodes are quad CPU servers with 2 GB RAM each and network-attached storage.

Whatever model of topology definition one may use; the purpose of the topology is to identify the relevant objects within the application system, to model the relationships between them, and to model how these objects and their relationships impact the reliability, availability, supportability, and performance characteristics of the system as a whole. This information is especially critical to:

  • Diagnose root causes of service quality incidents.
  • Make a prognosis about the impact that an incident within the ecosystem has on the service quality.
  • Evaluate the impact of change in any part of the system.
  • Deployment, installation, and configuration exercises.

Topology information is built from several sources, including vendor-provided information. Transaction monitoring tools, such as Introscope from CA or Identify from BMC, provide information about the details of usage of application and physical resources in response to a specific software transaction request. Dependency analysis and discovery tools, such as Relicore or BMC Discovery, also provide information relevant to building topologies for common enterprise application systems.

Configuration Information

The model of the system’s topology results in a manifest of the components of the system and the relationships between them. Configuration knowledge consists of detailed information (metadata) about each component. This metadata is primarily of two types:

  • Metadata that describes the components of the manifest and their configuration, such as:
    • Version information and history of all software and hardware used in the application.
    • Temporal information that describes deployment dates, support windows, warranty periods, maintenance windows, schedule, and end-of-life dates.
    • Parametric information that describes configuration parameters that must be set for individual components.
    • Support information related to each component, such as Help, known errors, and deployment instructions.
  • Metadata that is used to discover, build, and manage configuration information. This metadata includes information about the structure and means to access software configuration information in registries within the application, techniques of discovery specific to the application, locations of configuration files, and so on. For example, if an application provides a MIB, one would store the MIB entry points, community strings, and the protocol-specific information as part of this repository.

Both elements of metadata have an important role to play. Precise and accurate configuration information is necessary for planning any changes to the application system, such as the deployment of patches or upgrades to parts of the system. Likewise, information about how to access configuration data from within an application system is essential to enable tool-driven configuration management and ensure that system changes are discovered by tools.

The impact of incorrect configuration information is severe. Configuration mismatches in the system cause broken dependencies and make it hard to reproduce/diagnose bugs. A suspect configuration repository is a bomb waiting to explode. If the information inside the repository cannot be validated to be accurate, rebuilding the repository is often better than trying to patch it.

Application Activity

Detailed information that maps application activity and provides an explanation of application behavior at any point of time is critical to enable incident detection and diagnosis. Application activity is described in terms of several elements, including:

  • The patterns of application activity. This is the sequence of events that happen in response to user requests.
  • Temporal conditions, such as hours of operation.
  • Post-conditions and pre-conditions that are required for specific activities; for example, “all user activity must pause before a balancing of ledgers can take place.”
  • The known impact the activity has on the system described in terms of load on the system or any degradation of response that can be expected.
  • Scheduled and background tasks, such as audit or logging activity, asynchronous messaging, or end-of-day operation.
  • Activities triggered by other applications not in response to user requests, such as the production of a specific data extract for application interoperability.
  • Administrative tasks and activities, such as data backup, user management, pruning, and log rotation; and security-specific tasks, such as changing passwords.
  • Activities specific to the reconfiguration of the application system, such as adding or removing modules or shutting down portions of the system for upgrades.

Application activity provides a picture of the expected behavior of an application that is critical to any system of proactive management. The knowledge of application activity is built over a period of time, primarily by using measurement information driven by data from application monitoring. This knowledge is also augmented by benchmark-related information and testing-related information provided by vendors.

Usage Patterns

Usage patterns are closely related to application activity. Usage patterns characterize the different kinds of users and their priorities, and they paint a picture of how and when users use the application system. Usage pattern information in itself does not describe how the application reacts to the user demands, but it is key to explaining application behavior.

There are several types of users of an application system; each type has its own usage patterns. For example, administrative users usually use the system in a sporadic fashion, while users who generate reports normally do so at fixed time periods. The knowledge of what users do, the load they put on the system, and their periods of usage is critical to capacity planning. Likewise, information about the priority and importance of users from the business’s perspective is also vital to activities of incident management and problem prioritization. The following diagram illustrates some of these perspectives:

Bb897546.KnowledgeFramework4EAMv1_msdnsimple_image006(en-us,MSDN.10).jpg
Categorizing users into meaningful roles helps reduce the number of individual patterns that one needs to handle. Similarly, defining patterns of usage, such as known peak and idle periods or critical hours of use, aid in establishing a framework against which application usage can be evaluated. For example, claims adjudication systems like the example described earlier often have well-defined peak periods that coincide with days after holidays and through periods toward the end of a processing cycle. Considering these factors provides more responsive and accurate management than using a simple average-based statistic as a threshold.

Usage patterns are built over a period of time using a system of measurement based on observations of user behavior. Usage and application activity patterns are essentially statistical models that provide a perspective on what could be described as normal or expected behavior at any particular point of time. They serve as baselines against which a management system or an operations manager can evaluate an observed behavior pattern and determine whether it is an incident of abnormal activity.

Installation and Implementing Changes

Application systems go through continuous change as parts of the system are changed, replaced, or reinstalled. There is a significant difference between a fresh installation, an update, and a reinstallation. Reinstallations are usually performed in events of disaster recovery or when a major release/version upgrade needs to be performed. Updates include the application of patches and are more common and often are automated. Each type of installation requires specific information often provided by the vendor of the application. The knowledge required for installation of any release can be summarized in terms of:

  • Specific skill sets that are required to accomplish an installation.
  • Pre-installation checklists and required hardware and software environment.
  • A list of known issues that exist with the application system that is being installed.
  • A detailed installation process with checkpoints where the installation can be validated and either rolled back or continued with.
  • A list of recovery techniques and procedures to undo a deployment if the installation has to be aborted for any reason.
  • A list of validation test cases that can be either automatically or manually executed by the installers of the software to validate that it has been correctly and completely installed.
  • The exact process by which the configuration management system will be updated with post installation data.
  • Metrics related to installation.

Accurate and complete installation instructions provide all the information required for a person with the required skill to perform an error-free installation. These instructions enable software to be installed and upgraded repetitively without requiring expert-level intervention. This information also enables planning change, impact of an installation on the rest of the system, slotting installations into existing maintenance windows, and so on.

Incorrect installation information in the case of a new installation often has a small impact, normally in terms of wasted time and effort. However, at times, incorrect installations leave latent defects within the system that create hard-to-diagnose errors at run time. Incorrect installations in the case of updates can have significantly more serious impacts, including permanent loss of data and lengthy system outages. Installation information, especially in the case of custom developed software, can be incomplete or can make unstated assumptions about conditions, such as environmental variables. A best practice adopted by many operational management teams to guard against this is to perform tests of new installations, update installations, or reinstallations on test environments that closely mimic production environments to validate the correctness and consistency of installation information.

Expert management teams gather and use vital information from installation-related metrics, such as the time taken to implement a change or a new installation, the volume of change across different parts of the system, or the number of issues faced during installation as part of their change implementation and downtime planning process.

Operational Tasks and Administrative Activities

Operational tasks are required to be performed as part of the operations of the application system and the information required to accomplish these tasks is essential. Typically, these tasks are oriented toward preventive maintenance and include activities, such as log rotation, space management, proactive installation of patches, and service-level reporting. Many of these tasks have specific preconditions and constraints with which they must be executed. For example, many backup-oriented tasks require a suspension of all activity that can possibly modify data for the duration of the backup. Knowledge of the extent of operational tasks and administrative activities is also essential to estimate the cost of management of a particular system.

The Monitoring Model

Monitoring is the activity of making observations about the state of an application system. The objective of monitoring is to detect changes to the state of the system and make this information available to systems management. Monitoring in itself is not management, and monitoring information must be used along with the knowledge of application activity and usage behavior to make management decisions.

The knowledge required for monitoring an application system starts with identifying the monitorable aspects of the system from its model. For each monitorable part of the application system (and the system as a whole), you need the following information:

a)     What needs to be monitored:

·         The activities that need to be observed; for example, the target of monitoring could be to observe the start and completion of a transaction or to the outcome (success or failure) of an activity or the number of requests processed over a period of time.

·         Application monitoring also often includes following chains or a sequences of activities that happen in response to a user request. These chains in a typical Web-based application could go from a controller object to a piece of business logic to the database abstraction to the database itself as a query and then follow the same or a similar path back. Operational teams often earmark specific activity chains as the best targets to be monitored.

b)     The technique of monitoring or what must be done to gather the required data:

·         Details of APIs provided or third-party systems that are gathering the required information.

·         Details of any logs, database records, events that are emitted, network packets that must be observed, and so on.

·         Most of these techniques today are available in the form of prepackaged adaptors to commonly used systems that only have to be configured; it is seldom that new adaptors must be written.

c)     The policy and schedules of monitoring that determine when and how monitoring needs to be performed:

·         Information about the granularity of monitoring. Does every state transition and observed activity need to be recorded, or can specific filters and aggregation techniques be employed to only pick up activities of interest?

·         Monitoring schedules determine both when the system is under monitoring and when monitoring is turned off. This information is critical to formulating business rules specific to incident detection and service uptime calculations.

Knowledge specific to monitoring comes from several sources, including vendor-provided MIBs and third-party monitoring and instrumentation solutions. This is often augmented by operational teams who tune monitoring to focus on specific areas that are important to them.

The essential aspect of monitoring is to determine its scope and granularity. In other words, it is essential to determine what must be monitored and the granularity of the monitoring. Granularity that is too fine makes a monitoring system pick up more data than can be handled; granularity that is too coarse enables critical events to go undetected. Monitoring any system also has an impact on the performance of that system; the finer the grain, the more the impact. Expert teams often use an adaptive system of monitoring in which the grain of monitoring is relaxed during normal operations but intensified at the first sign of trouble.

There is a plethora of tools available for application, infrastructure, and system monitoring ranging from the transaction monitoring tools, such as Introscope referred to earlier, to more traditional SNMP-based management tools.

Incidents, Problems, Impact, and Response

Incidents are events that cause a disruption in service at any level. In the course of monitoring, several activities are detected that may or may not be incidents. Every observation of possibly abnormal activity or a threshold being crossed is not necessarily an incident.

The first element of knowledge in this area is information that enables a manager or a management system to detect an incident from a set of monitored observations. Incident detection mechanisms are based on sophisticated and adaptive rules that compare patterns of current behavior against known or expected behavior for a point of time. The rules used for incident detection vary, based on the part of the system being observed and the states being observed; for example, if the target of monitoring is a server and its state, a transition from an up to a down is a very good candidate to be considered as an incident that must be evaluated for impact and investigated for possible cause. However, if one was looking at CPU utilization, every instance of crossing a threshold need not be an incident; on the other hand, if the utilization crosses the threshold consistently, this may be an incident in the making.

The second element of knowledge is descriptive information about incidents, including their classification, their source, their history, and statistics, such as frequency of occurrence, conditions in which they were observed, impact on the rest of the system, and so on. This information is key to identifying patterns of incidents; identifying patterns of incidents often helps diagnose their cause. This information also helps in prioritization of incidents based on their impact and in planning proactive steps to prevent the incident from occurring.

The third element is a description of the process you use to handle incidents, from detection to evaluation of impact to communication to stakeholders to the application of a resolution.

The final element of this knowledge area is what is commonly known as a “known error database,” which is a list of previous incidents/problems and the mechanisms used to resolve them. In today’s environment, tools like Wikis make ideal environments to create such information-bases. They allow users of the system—in this case, the operational management teams—to collaborate and enhance the knowledge.

Information specific to incidents and the means of dealing with them is obtained from several sources, including events observed from log files and from user-reported problems. This information almost always needs to be augmented with human interpretation.

Knowledge of Evaluating Application Health

Knowledge in this area deals with knowing how to transform a set of observed behavior in the context of application system activity and usage patterns into a statement about the health of the application system.

The health of a system can be determined by carefully evaluating the state of the system observed from monitoring in the context of known information about the topology of the system and its known behavior. Certain patterns of activity or behavior are sure indicators of a problem in the system, others are less obvious indicators. The connection between an observed behavior and its possible impact on the system is a heuristic derived primarily from expertise of managing a system over time. This information is a part of almost every operation’s management setup and is what often enables good operations management teams to quickly react to potential issues, even before they begin to impact the system.

However, this information is often resident in the memories and past experiences of operators, not in a formally documented fashion. This information, when modeled appropriately and stored within a repository in a standard form, becomes the basis for proactive incident detection and diagnostics systems. Even in the absence of automated diagnostics systems, documenting observed activities, including context information of what the application and its users were doing at the point of observation and an explanation of what happened in the past when similar events were observed, gives you the basic information required for a manual diagnosis/prognosis model.

Environmental Policies and Standards

Policy information includes documented requirements of procedures, security, confidentiality, and so on, that must be followed in every aspect of application management. Policy information is driven by several factors, including the environment and legal compliance the enterprise business operates in (SOX-compliance or HIPAA), enterprise security, and IT policy, as well as governance and statutory requirements. Policy information is required to formulate operational processes and has a significant impact on the service levels that can be committed; it also has an impact on the design of the application system. For example, if there is a mandatory requirement for client financial information to be retained unchanged for a period of five years, part of the service infrastructure must include the necessary systems of encryption and write once/read many time storage, along with audit logs, etc.

Service Level Commitments

The knowledge of service commitments addresses the following areas:

  1. The SLAs and QoS commitments that the application system must guarantee
  2. The impact of not meeting these commitments
  3. The management requirements in terms of reporting, resolution of issues, and improvement of service quality

SLAs are statements that describe the quality of service required, expressed in clearly measurable terms. The measures within SLAs are often functions of the reliability, availability, supportability, and performance of the system. This knowledge is often directly used by SLA monitoring tools to set range-based thresholds for early warning against potential SLA breaches.

The impact of missing SLAs is usually defined in terms of criticality of service, impact created by service-affecting incidents in terms of financial loss, users affected, and so on, in addition to the number and frequency of such incidents. The result of a missed SLA is often a financial liability to the service provider, either in terms of direct penalties or missed rewards.

A critical part of the information required to manage service commitments deals with information specific to escalation and notification procedures when incidents occur. This information describes who is informed, when, and with what detail, as well as the process of escalation and information when normal service is restored.

The last aspect of knowledge required to manage service commitments is information specific to how and when reports of performance against promised service levels must be provided to the owner of the service as well as all supporting information that is required, including format of reports, the amount of detail to be provided, and the frequency of reports.

The Shape and Form of the Body of Knowledge

The body of knowledge is best implemented as a software model with a well-defined interface that allows it to be queried and updated by other software applications that automate systems management tasks. There are several benefits to doing this, including the following:

  • It serves as a single source of all configuration information, including:
    • Prescriptive configurations.
    • Best practices.
  • It allows for the capture and tracking of configuration states and changes.
  • It provides a standard platform for:
    • Operational management.
    • Coordination and consistency.
    • Design and development.
  • It aids change management via:
    • Model-based validation.
    • Dependency and impact analysis.
  • Configuration management and updates are dynamically and automatically performed, ensuring accuracy.

The information in the knowledge base is of different data types. Portions of it, such as the objects in the topology and their attributes, lend themselves to be managed as records in a database; other parts of it, such as descriptive information about processes, are text, while certain aspects, such as the creation of a configuration file, may be expressed in programming terms.

The knowledge base has to grow and evolve over a period of time and capture the enterprise’s experience of managing its systems. The knowledge base also has to adapt and change as the systems in the enterprise change. The realization that a large part of the information in the knowledge base has to be collaboratively created by its users is key its growth and evolution.

The organization and the extensibility aspects of the knowledge base are well addressed by the adoption of Web 2.0 techniques of collaboration and semantic organization. Internally, the knowledge base is often implemented using several tools, including Wiki-like tools for collaborative documentation, workflow tools for incident management, and data structures implemented using relational/XML/object databases for topology representation, automated incident detection, and so on. Today’s CMDB tools do not yet provide all the structures and interfaces required to represent and use the knowledge base. They form an essential part of the implementation and capture details, such as the system topology and the configuration information, but the information in them must still be augmented.

Standards in this space are still emerging with the System Definition Model and Dynamic Systems Initiative providing possibly the closest answer to a widely accepted standard.

About the Authors

Nagaraju Pappu has more than 15 years of experience in building large scale software systems. He worked for Oracle, Fujitsu, and also for some technology startups. He holds several patents in real-time computing, enterprise performance management, and natural language processing. He is a visiting professor to Indian Institute of Technology, Kanpur, and International Institute of Information Technology, Hyderabad, where he teaches courses on software engineering and software architecture.

His areas of interest are enterprise systems architecture, enterprise applications management, and knowledge engineering.

Satish Sukumar has more than 13 years of experience in software architecture, design, development, implementation, infrastructure management, and customer support. He has held various positions in Microland and Planetasia over a ten-year span. He has spent three years with Veloz global Solutions’ R&D center in Bangalore as their Vice President of Engineering.

Satish specializes in enterprise software architecture. His research interests include knowledge representation, performance measurements and management, real-time data analytics, and decision support and workflow/agent–based distributed computing.

Nagaraju and Satish are currently independent technology consultants, working out of Bangalore, India. They can be contacted via their company Web site: www.canopusconsulting.com. They also maintain a blog—www.canopusconsulting.com/canopusarchives—where they regularly write about topics related to design theory, software architecture, and technology.

 

Show: