"It's simply not fair," says the executive to his IT manager, "every time this happens, it's their fault and not ours, right? I know it's our responsibility to plan around their popularity, but this is getting ridiculous."
"Well, most of the time," responds the IT manager, "Sometimes it really is our fault. This is a complicated system, one that we've been incrementally upgrading and expanding over the years. I mean, by this version, some of that code has got to be spaghettied together so poorly that it'll be impossible to figure it out."
The executive continues, "But isn't there anything we can do to predict this sort of excessive activity, to plan for it?"
"No. At least not with what we've got today. I'll argue that the site is technically still up and running. We're not 'technically' down."
"I find that hard to believe…"
Frustrated, the executive sits back in his chair and swings to look out the window as the IT manager leaves his office. It is just this sort of conversation that irritates him about the technology focus of this business. But what can he do about it?
Dan Bishop is COO of TicketsRus.com, a national retailer of tickets for concerts and sporting events. TicketsRus.com is one of the largest online retailers of tickets, selling tens of thousands each and every day for everything from the smallest backwater rock band to the hottest pro basketball teams during their endseason games. Although TicketsRus.com sells a portion of tickets through their phone operations facility, a massive phone bank located in Waco, Texas, today's vast majority of tickets are sold through their online Web site currently homed in Denver, Colorado.
That Web site is the singular greatest piece of TicketsRus.com's intellectual property, and a primary reason behind their "convenience fees," which serve as the profit for the business. The TicketsRus.com Web site is a massive, customdeveloped online service that tracks incoming requests, sells tickets, and enables "gatekeeper" functions to prevent overload during periods of high traffic. It even prints and mails the tickets once customers check out. An almost fullyautomated system, you could argue that TicketsRus.com's Web site is TicketsRus.com.
And, today, that Web site is having a problem.
You've experienced situations like this before. Dan's problem with his company's primary mechanism for making money is a situation felt all across the IT landscape. In today's online e‐commerce climate, more and more companies are leveraging the Internet as a primary—if not singular—location for hosting their wares. With the Internet, inventory and labor costs are dramatically reduced, as are the costs of the brick‐and‐mortar storefronts that now no longer must be leased and maintained. With the automation brought about by a computing infrastructure, far more productive work can be done with far less manual intervention.
Yet moving one's operations to an online facility incurs its own set of risks. In the case of brick‐and‐mortar, the loss of a store due to a power outage, a massive snowstorm, or a run on available products means that customers can still go elsewhere for their needs. In the case of online, the Internet presence is singular. It must operate at all times, with an acceptable degree of performance, and in such a way that it gives confidence to its customers that they're getting value out of the experience.
This is exactly today's problem with TicketsRus.com, and it's not their fault. In today's problem, an extremely popular artist has come out of retirement for a new tour, and that artist's adoring fans have completely overloaded the system to look for tickets. The simultaneous inrush of new business has effectively shut down the site, turning what should have been a profit success into Dan's current operational nightmare.
And he's not sure if his IT Manager really understands the gravity of the problem…
There's a central problem intrinsic to many IT organizations. This problem relates to IT's ability to consider itself an integral part of the business, and ultimately the profitability of that business. The problem isn't necessarily sourced from IT itself. In its relatively short history, "the people who fix the computers" have long served as a secondary function of business. For a very long period of time, the only time IT professionals were needed—or even seen—on the business floor was when something broke. Having a problem with your computer today? Call the IT Help desk line (see Figure 2.1) and someone will magically appear at your desk in a few hours.
Figure 2.1: IT is seen as "the people who fix," a common sight in many businesses.
When the business didn't need IT, these groups of people usually found themselves shuffled away to other parts of the building. Taking over closets and storage rooms behind locked doors, there this group awaited the next problem to be fixed.
Over time, this break/fix mentality begins to grow deeply ingrained into both the members of IT as well the rest of the business who rely on them for services. When IT operates in a break/fix mode, they usually find themselves reacting to problems. A critical server is down today? Here come the IT "white knights", riding in to work through the night and ultimately save the day.
But at the same time, the break/fix mentality's "hero effect" actually becomes a liability to the business. IT organizations that see themselves as the heroes to be called when problems occur probably aren't spending the right amount of time preventing those problems from occurring in the first place. If that critical server was actually reporting a problem for weeks before it finally crashed, IT is no hero in getting it running again— they're actually the problem.
Why this disconnect between IT and the business? Other than a historical position inside the company's locked storage closets, what are the causes behind IT's reactive mindset? Differing responsibilities and mismatched priorities with the rest of the business, a lack of common vocabulary, and a missing vision into the business' dollars and cents are all common factors.
In the story at the beginning of this chapter, the business of TicketsRus.com is brokering tickets between artists and sports teams and their end consumers. TicketsRus.com makes its business by providing a convenient service to its customers, making it easy for them to find and purchase the tickets they want for the events that interest them.
To this end, TicketsRus.com likely has a massive marketing department. The job of that team is to make potential customers aware that their service exists. They probably have a sales department who find new events, artists, and teams to sell on their Web site. Their executive management team's primary responsibility is to ensure that the company runs optimally with good profit and expected return. Each of these groups has a primary mission that aligns with creating and maintaining the flow of TicketsRus.com's business.
In contrast, TicketsRus.com's IT department has a different goal entirely. Their stated goals are quite distinctive in scope. TicketsRus.com's IT department is responsible for and charged with maintaining the operations of the computer systems for the rest of the company. That charge includes the massive online presence where the company makes most of its profit. When TicketsRus.com makes a profit, the IT department continues to keep the computers running. When TicketsRus.com doesn't make a profit, the IT department continues to keep the computers running.
It is this mismatch of responsibilities where many problems with prioritization can occur between IT and the business. When break/fix trumps profitability, the business ultimately loses in the long run. In its current state, the priority of TicketsRus.com's IT department is to ensure that their computing infrastructure is up and operational. As a major part of that infrastructure, maintenance of the online presence is a primary responsibility as well.
However, there's a problem when the metrics associated with what is considered "up and running" are not well defined. Whereas the IT Manager sees the current situation as a temporary hiccup in the otherwise smooth running of the online system, this individual likely isn't aware that this short hiccup could become the source of major revenue loss for the business. Because he hasn't planned for such a contingency, he truly isn't aware of the gravity of the situation.
Figure 2.2: At a high level, APM can measure when user load negatively impacts overall system performance.
This isn't necessarily to say that the IT Manager's lack of planning is entirely his fault. When the IT Manager hasn't been handed down the correct kinds of metrics to use in measuring success, he won't be looking in the correct places to find it. As you'll find in this guide, one of the tenets of Application Performance Management (APM) is to provide a mechanism for defining just those metrics. Lacking a system in place that can look at system performance as the sum of its parts (see Figure 2.2), it is difficult or impossible to accurately measure the success of that system. APM and the solutions that enable it provide just those measurements.
IT also suffers from its high level of technical vocabulary that isolates it from other members of their business. The graph in Figure 2.2 makes sense when it is defined within the scope of metrics that make sense to IT: % Processor Use, Transactions/Second, Java JRE Method Timeout, and so on. Metrics such as these, however, are useless when attempting to provide information to the non‐technical members of the business. This breakdown in communication further illustrates the chasm between IT and the business because business leaders cannot relate their desired goals to IT in subjective terms that translate to objective metrics.
This common vocabulary isn't necessarily limited to technical versus non‐technical members of the business. Intrinsic to the IT organization itself are various disciplines, each of which has its own vocabulary for describing the system as they see it:
A major problem with this stratification of IT personnel is that no one group can alone comprehensively describe the behaviors across every component of a system. If a system problem spans multiple domains, teams must work particularly hard towards finding a resolution.
Figure 2.3: APM provides a type of Rosetta Stone, aligning each IT discipline's focus under a unified solution.
An APM solution assists with this language problem by providing what could be considered a Rosetta Stone between each IT discipline, their individual focus, and their own vocabulary. Although individual integrations within an APM solution are likely to be managed by their responsible discipline—network integrations by network teams, code optimization integrations by developers, and so on—the unified system provides a central gathering point for all metrics. This centralization provides a single location where an application can be measured across each of its IT disciplines at once. Such an analysis can be further correlated across all disciplines as well.
The end result is that a fully‐realized APM solution enables IT to operate as a single unit, with system problems being quickly directed to the teams that have the greatest capability to fix the problem.
Technologists are called thus because of their focus on technology. And while a technology focus and a budgetary focus needn't necessarily be mutually exclusive, they often tend to become so as an individual's depth in technology increases. Unlike virtually every other function of business, the individual members of IT are often not privy to the specifics that make up the business' or their department's budget.
In the long run, this lack of financial information removes IT's empowerment to solve problems based on their budgetary impact. When IT is incentivized towards resolving broken system components, they'll fill their day with accomplishing just that task. Those repair operations, however, might not be the best thing for the system over the long haul:
Figure 2.4: APM and its integrations provide the raw data that feed Business Service Management's financial view of the system.
The relation of a service's quality to the IT and business budget technically falls within the purview of Business Service Management (BSM), a topic that will be discussed in Chapter 9. However, there is a very important relation between BSM and APM in that BSM requires the metrics gained from an APM solution to populate its business‐centric view of the system. You'll find that although APM provides the technology metrics, its combination with business financial logic is what powers BSM's view of the world. Figure 2.4 shows an example of the linkage between these key components.
Lastly, and most importantly, a reactive approach to maintaining a system is ultimately a drain on the business' ability to get the job done. When IT looks at problems and solutions from the limited perspective of up/down or functioning/non‐functioning, they're not looking into the deeper issues. These deeper issues may not necessarily manifest themselves into actual visible outages but are a drain on the application's ability to complete its mission.
Consider again the problem situation first explained in Chapter 1. There, a slow response time between the mainframe and the application server eventually grows to impact the system as a whole. By nature, these kinds of system events often occur over a period of time, growing in scope—and delay—until a minor situation becomes a major problem.
Figure 2.5: An APM solution's highlevel client network server view can illustrate where areas of delay may soon cause a problem.
Figure 2.5 shows an APM system's view of aggregate transaction performance between the application server and the mainframe. With this view and others in place, it is possible to draw a trend line towards a future failure before the failure actually occurred. This capacity for defining possible pre‐failure states enables IT to resolve problems before they actually happen and before users notice. As you'll learn in the next section, this proactive approach to operations is representative of an IT organization at a high level of maturity; one that drives value back to the business rather than reacting to it.
For an organization to efficiently make use of the kinds of information that an APM solution can provide, it must operate with a measure of process maturity. IT organizations that lack configuration control over their infrastructure don't have the basic capability to maintain an environment baseline. Without a baseline to your applications, the quality of the information you gather out of your monitoring solution will be poor at best and wrong at worst.
But how does an IT organization know when they've got that right level of process in place to best use such a solution? Or, alternatively, if an organization recognizes that they don't have the right level, how can an APM solution help them get there?
One way to evaluate and measure the "maturity" of IT is through a model that was developed in 2003 as part of a Gartner analysis titled Transforming IT Operations into IT Service Management. This groundbreaking white paper defined IT across a spectrum of capabilities, each relating to the way in which IT actually goes about accomplishing its assigned tasks. An IT culture with a higher level of process maturity will have the infrastructure frameworks in place to make better use of technology solutions, solve problems faster, plan better for expansions, and ultimately align better with the needs and wants of the business they serve.
Process maturity within an organization is defined as quite a bit more than simply having the ability to solve problems. Within Gartner's maturity model, the capacity of IT to solve— and prevent—ever more complex problems was defined largely by its level of process maturity.
It is perhaps best to explain this concept of immaturity through the use of an example. Consider an organization that completely lacks any documentation of its internal systems. Such an organization is also likely to lack formal change control processes by which others are notified about changes to those systems. In such an organization, a system can be configured and later reconfigured at the whim of a single administrator. If an administrator or developer finds a problem on the system, they resolve the problem as they see it, notify no one, and continue about their day.
At first blush, the rate at which problems can be identified and resolved in an environment can seem extremely beneficial. Administrators or developers who find issues can quickly resolve issues as they see them, without the need for complex and time‐consuming paperwork, workflow, approvals, and documentation. Such an organization can run exceptionally "lean and mean" with their infrastructure, as the overhead associated with the process itself is nonexistent.
However, such an organization also lacks accountability. It lacks crosscommunication between members. It also lacks the basic infrastructure necessary to validate the configurations on each component in the IT environment. If one administrator is working on a problem and a second finds the same problem, time is wasted as the two individuals enact simultaneous change.
Often, the lack of cross‐communication causes further problems down the road. Perhaps the problem condition was actually necessary for the troubleshooting of a completely separate problem. Perhaps the problem wasn't a problem at all, but a symptom of a much larger problem. In the worst of cases, the lack of configuration control inhibits IT's power in seeing the signs of problems before they cause impact to the user population.
In short, although an immature IT organization might be more agile in actual problem resolution (for example, clicking the right button), they'll achieve those gains at the cost of dramatically less agility in preventing the problem in the first place.
Gartner defines five stages in which an IT organization can exist: Chaotic, Reactive, Proactive, Service, and Value. At each stage (see Figure 2.6), the IT organization is defined by a set of characteristics. These characteristics illustrate the types of activities and behaviors that are seen in the organization's culture. Though not necessarily an objective checklist, it will likely be obvious to which stage your own organization fits.
As organizations move from one stage to the next, they will find more documentation of processes with less replication of work, greater and more advanced levels of configuration control, different incentives for determining what is considered success, greater maturity in monitoring, and the implementation of toolsets that enable richer planning and more effective budgeting. With Figure 2.6 in mind, let's take a more detailed look at the phases, how organizations behave, and what benefits they get from each.
Figure 2.6: Gartner's fivestage IT maturity model with relevant characteristics at each stage.
The Chaotic stage is arguably the best defined by its name alone. Organizations as well as IT infrastructures in the chaotic stage experience a level of (barely) controlled chaos. Servers, desktops, and network infrastructures are all individually managed with no documentation of their configuration or areas in which changes can be announced to others inside and outside the organization. Although Chaotic‐stage environments are relatively common in smaller businesses, they are by no means defined by size.
In the Chaotic stage, IT organizations tend to focus on the use of native or freeware tools for managing their infrastructure. They are constantly putting out fires within technology they don't understand. Monitoring and management elements are not in place, which generally means that IT is notified about problems when users call to complain. IT organizations in this phase tend to lack the basic understanding of the systems they manage, let alone the deep understanding necessary to do well with an APM solution. Due to the break/fix approach to problems, the rare APM implementation here often goes unused once implemented as no time exists to actually employ its capabilities.
At some point, IT organizations and the processes that bind them eventually begin to grow the very basics of structure. In the Reactive stage, organizations begin to actually implement tools for assisting then with the management workload. Problem resolution in this phase is yet accomplished through a break/fix mentality; however, the level of consideration for environment‐wide solutions begins to grow beyond zero.
In the Reactive stage, simplistic problem management solutions such as work order tracking systems may be incorporated. Yet in this stage, the specifics of their use are often not enforced through an agreed‐upon set of rules. Work order tracking systems here are used for the individual technician workflow, not necessarily for the tracking of configuration changes. In the Reactive stage, monitoring may be implemented, but in this stage, that monitoring is limited to the core availability of the device itself.
In this stage APM solutions will not necessarily drive a direct benefit to application performance, as performance is not yet valued over simple availability. Environments here are yet focused on managing the inflow of problems as they come in, and as such, don't have the time to actually focus on analytics and problem prevention. Smart organizations can leverage the implementation of more basic APM integrations during this period as a mechanism to quickly drive the organization to a higher level of maturity. Such an implementation in this stage will require the corresponding process and workflow necessary to turn APM data into useful product.
Once an IT organization's culture makes the conscious decision to move away from firefighting as a way of life, it can be considered on the path to the Proactive stage. It can be argued that most IT organizations today exist somewhere between the Reactive and the Proactive stages, with varying levels of process and workflow in place.
A major determinant between these two stages is related to the number of individuals who have successfully removed themselves from the direct resolution of problems. The time for these individuals is freed towards looking at rational, automated, and environment‐wide solutions for preventing problems before they impact the user population. Here, the proper levels of monitoring are likely in place to validate more than simple up/down availability. Usage trends are monitored and analyzed, with thresholds for alerts in place to notify responsible individuals. Automation tools are additionally used to enable repeatable actions to occur on systems when conditions occur. Automated remediation capabilities may be introduced in this stage as well. Found also in this stage are mature processes for problem management as well as asset, change, and configuration control.
For organizations here, the implementation of an APM solution can arguably have the greatest benefit to their business. Once fully in this stage, the IT organization understands the wholesale changeover from the "break/fix" to the "keep it running" mentality. Lacking in this stage are the real linkages between individual devices and components of the greater system. As such, a system‐wide view of applications and business services is still lacking in maturity. Implementing APM here can quickly move an organization to the next level of maturity.
A major determinant between the Proactive stage and the Service stage is related to the organization's primary focus. When an organization continues its focus on individual technologies as opposed to how those technologies integrate into a deliverable whole, that organization remains in the Proactive stage. There, they are proactively resolving problems, but they are still focusing on the problems and problem prevention. When that organization leaps towards managing the services they deliver to the business in whole, they have successfully arrived in the Service stage. In this phase, you'll often see IT delivering their own customized services to the business with unique names and focuses rather than merely referring to product names they acquire from vendors.
Getting to the Service stage can be critically important to today's businesses, especially those who have a large stake in e‐commerce operations. Service‐oriented thinking and the service‐oriented approach to monitoring and management means that the loss of individual elements has less of an impact. Redundancy and compensating mechanisms are usually in place to ensure service reliability in the case of a single failure. In this stage, IT also finally begins to understand not only their costs but also their quantitative benefits back to the business. They can value their services appropriately and back up those valuations with analytic data arriving from their own monitoring solutions. Although Service Level Agreements (SLAs) are often seen in previous stages, it is only here where their quantitative fulfillment can be truly recognized, often in real time.
In this stage, an APM solution—or one that functionally resembles it—is likely in place. Solutions like APM are necessary in order to gain the situational awareness IT needs to best manage its environment as an overarching system. At the same time, IT find itself using that system with the goal of recurring improvement, looking for and resolving nonoptimized areas before users are impacted.
Once IT fully loses its identity as a separate function of business, they can be considered a partner with that business as opposed to merely servicing its interests. In the Value stage, IT metrics are business metrics, as is the reverse. The role of IT is as enabler for business processes, and as such, those processes are not considered without IT as a primary stakeholder. IT is also a co‐equal in business planning, as new endeavors invariably include a technology component.
Most organizations never achieve the Value stage of IT, as recognition is required both from IT as well as the surrounding business for elevation to this stage to occur. However, those organizations that make it to this stage find their solutions—in this case, both APM and BSM—provide them with real‐time validation of success or failure. In the Value stage, IT can be considered no longer just the "utility" but as the business itself.
All things considered, an organization at higher levels of maturity will tend to have a greater capacity for understanding and using its APM solution data. Thus, an APM solution can be useful for both validating an organization's existing maturity as well as assisting in the rapid movement from one stage to the next.
For an example of this, take a look at Figure 2.7. There you'll see an example visualization from an APM solution. The information in the figure displays the expected response time for a specific Web service call, broken down among the amount of time consumed by the client, network, and server components of the request.
Figure 2.7: An APM solution's Response Time Predictor visualization.
Completing a request of this type will require an amount of time from each of these three elements of the IT environment:
Intrinsic to this request are a number of variables that require an IT organization with a high level of maturity if the information is to be of value. To gain the greatest amount of value from this information, such an organization must have:
The chapters of this guide that follow will discuss how these visualizations can be used in greater detail. For now, know that an APM solution provides a data‐driven benefit to the business in two ways. First, an APM solution provides the necessary level of monitoring to enable IT to better facilitate the needs of the business. This reason is what this chapter is all about. By implementing an APM solution, you very quickly gain the ability to drill deep into the individual components of your business applications towards fixing problems or finding areas of improvement.
Secondly, and arguably more importantly, smart organizations can leverage an APM solution itself to rapidly develop process maturity in an otherwise immature organization. By reorganizing your IT operations around a data‐driven approach with comprehensive monitoring integrations, you will find that you quickly begin making IT decisions based on their impact to your business' applications. You will better plan for augmentations based on actual data rather than the contrived anticipation of need. You will better budget your available resources based on actual responses you get out of your existing systems.
In the end, leveraging an APM solution for your business services and applications will make you a better IT organization.
With IT's movement from one stage to the next, the entire culture of the organization changes as well. IT at higher levels of maturity has the capacity to accomplish bigger and better projects. But IT at higher levels of maturity also thinks entirely differently about the tasks that are required. Figure 2.6 does a good job of explaining how that thought process evolves.
As the maturity of IT's tools grows, so does the predictive capacity of those tools. It was discussed in Chapter 1 that solution platforms such as those that fulfill APM's goals extend their monitoring integrations throughout the technology infrastructure of a business. Because APM's reach is so far into each of a business application's components, it grows more capable than point solutions for finding the real root cause behind problems or reductions in performance.
Consider again the situation in Figure 2.7. A complex performance issue in a business application can occur across client, server, and network components in the environment. The client can experience delay due to other processing or issues with the underlying client infrastructure. The network can be oversubscribed with traffic, or client network requirements can be greater than existing network components can handle. Servers can be non‐optimized in their processing or simply be overloaded.
Independent point solutions generally monitor only one of these three components at a time, making the consolidation of data across separate systems with separate databases, consoles, and formats extremely difficult. The resulting graphic in Figure 2.7 that breaks down such an issue by its impact in each problem domain presents a way to quickly identify the location of the problem.
Focusing further into that graphic presents the new picture that is Figure 2.8. This image shows how such a graphic might be constructed through monitoring integrations installed to network devices, servers, and even to the clients themselves. The result is a holistic picture of transaction time itself, broken into its disparate elements. For more information, drill‐down capabilities intrinsic to the interface provide a way to discover more details about each portion as necessary to resolve the situation.
Figure 2.8: Transaction timing occurs across client, network, and server components.
When IT and the business are aligned in their goals and expectations, everyone benefits. IT finds itself assisting the business in actually creating and maintaining business rather than simply focusing on the health of systems. The business gains because a set of highly‐skilled technologists now participates equally in identifying business opportunities, optimizing processes, and sharing in the goals of everyone else. An aligned IT organization thinks less about which device to purchase or fix and more about how that device integrates with the rest of the business infrastructure.
This alignment happens along a number of technology axes. Alignment enables IT to better scope projects for greater success, defocusing on projects or technologies that enhance IT but stand in the way of business workflow. This business impact to IT projects ensures that those projects are visible to business leaders. Such visibility enables those leaders to be a greater stakeholder in IT projects, further ensuring that their incorporation makes sense for the future. Lastly, alignment provides a way to convert a reactive IT organization to a proactive one.
It has been said that 70% of the average IT budget is earmarked for existing projects, leaving only 30% for new projects on an annual basis. For the new projects, roughly 60% fail to meet their original goals or schedule. Primary reasons for failure include cost overruns, missing schedule goals, and end solutions that are "riddled with defects and don't accomplish the business goals for which they were designed."
A major source of the problem occurs when IT isn't capable of scoping projects in a way that makes sense for the rest of the business. This scoping problem can relate to:
APM solutions provide a metrics‐based approach for identifying the before‐ and end‐state of entire systems. Individual database transactions can be traced back to specific requests. The performance of newly‐inserted lines of code can be compared with those from previous versions to ascertain their efficacy. Mainframe and server processing can be measured over extended timeframes to validate improvements. The aggregation of individual monitoring integrations provides a platform for validating the success of IT projects and preventing their impact on others.
It is. This chapter has focused heavily on the maturity gains that can be achieved through the implementation of an APM solution. Yet if you look through the various APM solutions on the market today, you'll quickly see a heavy technology focus. As future chapters will discuss, APM's technology focus is its bread and butter. You'll find yourself using an APM solution for determining transaction rates and isolating network performance issues, among others.
At the same time, you'll find that APM's expanded situational awareness enables the smart IT organization to become more business‐aligned. Nowhere is this more pronounced than in businesses that rely heavily or exclusively on their technology infrastructures. Ecommerce businesses are particularly impacted. This is the case because in businesses like e‐commerce, the technology is the business. As such, having that enhanced vision into your business' technology underpinnings means knowing your customers—and your storefront—that much better.
Chapter 3 will continue this introduction with a more technical look at underpinnings that comprise APM's monitoring integrations themselves. You'll understand the history of monitoring as well as how monitoring has evolved over time to become what it is today. Chapter 3 will discuss how multiple levels and types of monitoring are necessary to gain that holistic awareness you want out of an APM solution.