Anyone working in any kind of business is familiar with this scenario: Users complaining that their applications have slowed down, managers getting tense that their employees aren't working, and the IT team running around trying to pinpoint the cause of the problem. In this Essentials Series, I'll share some of my own experiences in troubleshooting slow applications in a major enterprise, and try to unearth some of the reasons that troubleshooting seems to always be so slow and inefficient.
But let's be clear about something: I'm not discussing application performance management. APM is a huge IT discipline in and of itself, typically involving large, complex tool sets and very specialized skills. APM often involves measuring performance in somewhat unusual ways, such as running an agent on end‐user systems as well as on backend systems to measure the response times between the two. APM is certainly useful, but what's equally useful is being able to do something about applications that are experiencing some kind of performance setback. And perhaps more importantly, being able to do something quickly.
"Quickly" is incredibly important. In a 2008 study entitled ServiceLevel Management and Application Performance Management, Forrester Consulting found that only two‐thirds of application performance issues were resolved in what the business considered an acceptable time period.
Let's start by setting the stage, and looking at the first ways in which the troubleshooting process begins to bend toward massive inefficiency. Normally, you'll hear about a slow application directly from its users, often through your company's Help desk. If you're a fan of IT Service Management (ITSM) practices, this notification becomes an incident; in more generic terms, you might just call it a trouble ticket.
But these incidents aren't as cut‐and‐dry as "my computer blue screened" or "the printer isn't working;" users are seldom able to be more specific than "my application is suddenly running really slowly," or "my application stopped responding."
Ideally, your first‐tier support folks will try to verify the problem, and may be able to perform preliminary troubleshooting steps: Does the problem affect just a single user or are multiple users impacted? Does rebooting the user's client computer or closing and reopening the application help resolve the problem? Are other applications on the same computer affected? Is the problem limited to a single remote office or to a single department within the company?
Sadly, that's usually where the first‐tier team runs out of options. Unless an application or computer restart solves the problem, all the first tier can usually do is verify that the problem is occurring and perhaps try to define the scope of the problem. Because applications tend to be both complex and, in most cases, not specifically designed for performance troubleshooting, the first tier often has no choice but to toss the problem up the chain of command, and that's where the first major inefficiency kicks in: Who, exactly, should get notified of the problem?
Your first‐tier support might initially notify a software developer, but the problem almost always escalates from there. Think about all the technology disciplines involved in a modern application: client application developers, component developers, database administrators, database developers, and more. In addition, applications rely on many different shared infrastructure elements, such as the network, domain name services, routers and switches, and more. So before long, an entire army of high‐end technical experts are examining the problem: developers, database administrators, network administrators, infrastructure administrators, and more.
I recall one troublesome application we had when I was working for a large telecommunications company in the US. One day, everyone seemed to think the application was running a bit slow, but we couldn't really pin down anything. By the end of the next week, the application was positively crawling, and managers throughout the company were screaming for support. Our Help desk was at a complete loss: They'd actually gone above and beyond their normal responsibilities and scheduled server reboots, router reboots, and more, but nothing they did seemed to be helping. So they finally kicked the problem upstairs.
We could have thrown a pretty good party with the number of people who became involved at that point: I was in charge of the network infrastructure, although I had specialists in charge of much of the infrastructure equipment, like our routers and switches. One of the software developers started looking at the application itself, and another started looking at its various subcomponents. Our Unix admin started examining the Unix‐based firewall and DNS services. A database administrator started poring over the SQL Server, while another developer started poking around the server‐side code within the database. One of our desktop support specialists started looking at the client computers to see if maybe there was some kind of resource conflict there. Our managers, of course, hovered over everything and got in the way. Nine or ten people, taken away from extremely important projects, all to fight a single fire.
Everyone on the team starts looking at the slow application with their own particular tools:
Pay special attention to that list because you're looking at the exact reason why troubleshooting a slow application is so inefficient. I'll get to the specifics in a bit, but let's stick with the chronology and talk about what happens next: Someone takes a guess.
Maybe the software developer has a tool, like the one shown in Figure 1, that breaks down application performance within the application's code.
Figure 1: Analyzing application code performance.
The developer says, "aha, this frmOrderInquiry method is taking a long time and generating a lot of network activity—this must be it." The developer makes some tweaks to the code, runs it through his performance tool again, and sees an improvement. "I've found the problem!" he announces to the team. Everyone sighs in relief, the developer compiles a patch, and it's deployed to a few client computers (usually the ones with the users who've been complaining the loudest).
And then the bad news comes in: It didn't fix it. In my experience, the first guess at why the application is running slowly is wrong about 40% of the time. So the team groans, drops what they're doing, and jumps back on the problem. They drag out their tools and start looking at their own little domains until someone comes up with a better guess.
The root cause of this troubleshooting inefficiency is that modern applications are really, really complicated. Even seemingly‐simple applications can, in reality, be incredibly complex. Think about the average custom, in‐house application:
And aside from all these individual components, you've got the single complex entity called an application, which from a holistic viewpoint may exhibit what I call synergistic performance problems. In other words, sometimes a few well‐oiled components can conspire to create poor overall performance. You can't point to any one component as the cause of slow performance, but taken together, the various components do, in fact, run slowly. Many of these components may only run slowly some of the time, which makes troubleshooting even more difficult—problems that can't be repeated consistently are much more difficult to solve.
When applications are running slowly, users aren't productive. In one of my past jobs, we estimated that each hour of application downtime cost us $45,000 in lost productivity, diverted technical expertise, and more. On average, "my application is running slowly" would take 48 hours to finally resolve, costing the company $360,000. That's just an average, though—I remember one problem that took us over a month to finally sort out—a theoretical cost of millions of dollars. Ouch.
In their study, Forrester found that 43% of businesses estimated downtime costing at least $10,000 per hour, and 10% cited more than $1 million. Research from Enterprise Management Associates suggests that an industry average of $45,000 per hour is accurate, although they say it is higher for transaction‐intensive verticals such as financial services, where the cost can exceed millions per minute.
The point is that time is money, and you can't afford to keep doing the "look, guess, fix, repeat" cycle. You need to be able to look and to fix—definitively, without the guessing and repeating. So why is it so difficult?
The problem is that first bullet list I showed you: All those IT experts using their individual domain tools. None of them was troubleshooting the application: They were troubleshooting individual bits of it. Sure, eventually one of them will stumble across the answer, but it's horribly inefficient.
When you hear a loud, annoying sound in your house, you don't immediately call in a plumber, HVAC technician, structural engineer, and an exterminator. You take a few minutes to try to narrow down the problem. You listen to the house as a whole, turn the taps on and off, shut the HVAC down, and so forth. When you've narrowed the problem a bit, you call the one specialist that can handle that problem. That's how you should be able to deal with a slow application: The Help desk—your first tier of support for the incident— should be able to direct the problem to the one technology expert who can actually fix it.
So how do we resolve application performance problems more quickly? How can we ensure that "my application is running slowly" isn't met with a barrage of technical expertise that doesn't actually yield results very quickly?
The answer is not to buy more domain‐specific tools. Most organizations already have plenty of tools within each technical "silo"—network analyzers, database profilers, and so forth. They're not helping because each tool only looks at one potential part of the problem. Also, those "silo tools" don't come into play until after your first tier of technical support has tossed the problem up the chain of command. What we really want is a way to make the first tier of technical support more effective because that will also make subsequent tiers more effective and efficient.
And I know what you're thinking: "Our Help desk folks are smart but not that experienced—that's why they're on the Help desk!" We often think that making the Help desk more effective means giving the Help desk more tools and teaching them how to use them. That's an education approach, and sometimes that's not practical. First, your first tier of support may well lack the background experience to use those tools. Second, you'll just be giving them the same silospecific tools you were already using. Moving the tools to a lower tier of support isn't the answer, either. So what is the answer?
The answer is to enable the Help desk, or whatever you call your first tier of support. Not necessarily educate them and not try to make them as smart as your second‐ or third‐tier experts, but to enable the Help desk. Specifically, enable the Help desk to do a better job of verifying slow applications and routing the incident to the correct portion of your second or third tier.
What we need is a tool that can pinpoint exactly which bit of a complex application is causing or contributing to a slowdown. That lets them refer the problem to the experts responsible for that bit: No more looking, guessing, fixing, and repeating. The tool needs to look at the entire application, including all of its dependencies, back‐end components, and other elements. That tool needs to understand the shape of the application and how the various components fit together, and it needs to be able to view their performance characteristics in real time so that it can spot those temporary slowdowns that bother users and frustrate technical experts.
We IT experts have to stop thinking of our applications as a collection of building blocks, and instead think of them as a single entity composed of many different contributing elements all working in tandem. A building engineer doesn't think of girders, concrete, and windows; he thinks about the building. Sure, girders are a part of it, but once welded and riveted together, those girders behave differently than they did as individual units. If the building's exterior starts to show cracks, the engineer has to consider not only the girders but also how they've been attached, how they're affected by the load of the building's walls and contents, and so forth, all as a single, interconnected set of elements. He—that one engineer—has to consider all the elements that make up the building, even if he isn't personally an expert in, say, windows.
That's where we need to take our application troubleshooting. Everyone, from the Help desk on up, needs to be able to see the entire application not just their individual silos. Everyone who troubleshoots a slow application needs to be aware of every element of that application, how those elements interact, how they fit together, and how they might—as a unit—be exhibiting performance problems that aren't immediately apparent in any individual component taken by itself.
Here's another non‐computer example: Suppose you have a car that's getting really poor gas mileage. You disassemble the car and start testing each component—pretty much exactly how we troubleshoot slow applications today. The engine seems to be running well, the drive train is in good shape, and the tires look to be properly inflated. So where's the problem? Well, maybe you take a guess and rebuild the engine. You reassemble the car and the problem is still there.
Looking at individual elements of a system will almost never reveal the cause of a performance problem because those elements work differently on their own than they do as part of the system. To find the problem with the car, you need to examine those components when assembled as a car, not individually. You need to put little sensors and stuff on each component, and monitor them as the entire car runs. You might find out that the car's chassis is too heavy and that places stress on the drive train, which makes the engine work harder—something you'd never see if you examined individual components by themselves.
To start troubleshooting our applications as systems, rather than as individual components, we need to focus less on our silo‐specific tools such as database profilers and network analyzers. Instead, we need to find tools that can look at the entire application—every element, every component—and all in real time.
The idea of looking at an entire application's performance is hardly novel. One of my first IT jobs was as an AS/400 system operator, and we had a tool that could, in real time, analyze each aspect of a running process' performance. It would tell us if a slow process was slow because of memory, processor time, database responsiveness, and so on. Of course, an AS/400 is a completely self‐contained system, so our troubleshooting tool wasn't terribly complicated.
Modern tools exist that extend the same concept to today's complex, multi‐tier, distributed applications. They understand—or can be configured to understand—that the application may start with a single process running on someone's client computer, but that the "application" also consists of an underlying network, a back‐end database, numerous other processes and components, and so forth. These tools can check the performance of each element in real time and can do so continuously, actually alerting you in advance when an application element's performance strays outside normal parameters.
These tools enable a holistic, system‐wide look at the application. They can make it easy to see which bit of the application isn't performing properly, and they do that while looking at the entire application.
There's a problem with some whole‐application monitoring and troubleshooting tools, though: installation and configuration. If the tool requires you to manually explain to it how the various application components fit together, you're less likely to use the tool. You're less likely to keep it updated as the application changes. You're more likely to make mistakes or miss components, giving the tool a lopsided, inaccurate view of the entire system. So with that in mind, I'll propose a sort of "wish list" for what a good, system‐wide application troubleshooting tool might be able to do.
First on the list is automatic. Ideally, a tool should take a few minutes to install, and you should be able to point it at a running application. It should then take over, figuring out on its own what the application looks like and what dependencies that application has. The result is an application map, and it might offer views like the one in Figure 1.
Figure 2: Looking at the entire application.
This view offers a list of applications on the left, listed by name. When reports of slow applications start coming in, the Help desk can select that application from the list, and look at the performance of each component that makes up that application.
That leads to the second item on the list: intuitive. This tool has to be usable by the Help desk, preferably with as little education as possible. It should call the Help desk's attention directly to slow application components so that the Help desk can simply notify the highertier expert responsible for those slow components.
Third on the list is real time. The tool is useless if it can't display the current state of the application's components. When the Help desk opens this tool, it needs to be up to date and ready to go. Those little performance charts should be moving and changing in real time so that the Help desk can easily spot system elements that are performing outside normal parameters.
The result? We always used to feel it took us 80% of our time to find the problem, and only
20% to fix it. A tool like this can really cut that 80% back a lot because it eliminates the "look, guess, fix, repeat" cycle. You can focus on what is actually causing the application to slow down, and spend your time fixing it, not guessing.
The fourth item for my wish list is smarts. If a tool can do the first three things, there's no reason it can't proactively alert someone when a system element's performance begins to show signs of a problem.
That takes us from reactive application troubleshooting to more proactive application management—a wonderful thing that any organization will benefit from.
If you can find an application performance troubleshooting tool that automatically discovers and maps your application's various elements, monitors their performance in real time, and presents simple, intuitive views that help highlight performance problems for faster resolution—well, then you're ready to move beyond mere reactive troubleshooting. You're ready to start adding real value to your business, and ready to start preventing slow applications.
Let's face it: Most companies regard IT as overhead. Technology is expensive, the people that support it are expensive, and it seems to need so much support sometimes. No matter how much you spend on IT, something always breaks—which is why so many companies have estimates of how much money an hour of downtime costs them.
But when IT can start anticipating problems, solving them before they become noticeable, and preventing downtime—well, then IT has just become valuable to the business. And with the right toolset, you can do exactly that for slow applications.
The idea is simple: If you can have your toolset notify you when an application starts to exhibit out‐of‐bounds performance, you can immediately jump on the problem. You can identify the specific system elements that are causing the slowdown, fix or mitigate the problem, and keep the application's performance from ever slowing to the point where it significantly impacts your production components.
You may think that you have, or have seen, enterprise monitoring applications that do exactly what I'm talking about: They monitor the health of network services such as email, and they know how to also monitor dependent elements that support the service. All true, but they are typically neither automatic nor capable of a sufficient level of complexity. Let me explain.
The idea of monitoring multiple elements of a system is definitely nothing new; enterprise monitoring tools have been doing so for years. Figure 1, for example, is the type of display many enterprise monitoring consoles might offer, showing various network services—such as an email server—and their dependencies. When a given element exhibits out‐of‐bounds performance, the system can flag it—and flag anything that depends upon that element. It helps trace the root cause of the problem, and helps direct efforts to the right location to solve the problem more quickly.
Figure 3: Enterprise monitoring tool.
But there are a couple of things that these enterprise monitoring systems often lack: First is automation. If you're going to be monitoring an application, rather than just a network server or a network service such as email, you don't want to have to dig through the various application elements and dependencies. That should be done automatically by observing the application to ensure that you're getting all the application's elements.
Second is complexity. An email service may depend on a server, which may depend on a network, which may incorporate a router. Fine—that's within the realm of an enterprise monitoring tool, which can watch each of those elements. But when you dive into an individual application, things get a lot more complex. You've got database calls, multiple sub‐components, and other minutiae. These elements span multiple disciplines—network, software development, database, infrastructure, and more. A special kind of tool is called for, one that is specifically designed to deal with complex applications.
But that special kind of tool can provide some of the same advantages that an enterprise monitoring system would provide: It can understand performance thresholds so that it knows what levels of performance are considered "good," "borderline," and "bad." When real‐time performance monitoring shows elements' performance moving beyond the border of "good," it can send out email or pager alerts. In fact, such an application monitoring tool might even be able to raise alerts into an enterprise monitoring tool, helping keep the entire IT organization informed of a problem.
I'll add one item to my wish list from the previous article: reporting. If your application performance monitor is already collecting performance data in real time, there's no reason it can't store that data and produce management reports—much as an enterprise monitoring solution might produce reports for overall network performance over time. Figure 2 shows what such reports might look like.
Figure 4: Application performance reports.
Again, these help move you from the realm of reactive troubleshooting and into the better world of proactive application management.
Another valuable item for our wish list: load curves. These are designed to show your application's overall performance as workload increases. Your developers may have produced something like this during application development, but in reality, those are just estimates and best guesses. Truly usable load curves come from monitoring application performance, down to the system element level, in a real, production environment. A properly‐done load curve will show you how much workload your application is capable of handling while still exhibiting acceptable performance, and how much workload it can handle before the application grinds to a complete halt. Those curves can help you project application growth, plan for application expansion, and stay ahead of the curve—pun intended—so that the application is always ready to handle whatever the business needs to throw at it.
Most organizations I've worked with are still firmly in what I call the "silo world:" When an application slows down, everyone tackles their own independent elements, trying to guess where the problem might be. That approach is largely driven by the silo‐centric tools that IT experts rely upon; change the tools, however, and you change the world. With tools that focus on the entire application, rather than on specific elements, the root cause of performance problems jumps right out at you.
That means the first tier of technical support can not only verify a problem but also identify the specific silo needed to fix the problem. That silo can be engaged to solve the problem— letting everyone else in the IT team continue working on their own valuable projects. There's no guessing; the right experts can move on to fixing the problem.
With that kind of capability in place, you not only significantly reduce application downtime but also enable a more proactive style of application management—one that offers real additional value to the business as a whole. Start seeing problems before they result in downtime. Start fixing those problems before anyone notices. Start managing the application over the long term with performance statistics, reports, and load curves— staying in front of the business, rather than running behind, trying to catch up.
It all comes from having the right tools: Ones that look at the entire application, automatically discovering the individual elements of a system and intuitively displaying performance in real time. That's efficient troubleshooting.