Developing and Building APM Visualizations

Its 10:22p and TicketsRus.com COO Dan Bishop finds himself at yet another hotel bar, celebrating the end of a successful day at the National Ticket Sellers Conference. At the conference full of business executives like himself, Dan's been spending the week learning about new technologies and tactics in servicing his customers.

One of those new tactics he now cradles in his hand as he orders a second drink. Through his PDA's Web browser, Dan is showing a fellow conference attendee some of the visualizations from his new APM solution.

"So, here's our internal Web site," Dan explains, "On this site, I can take a look at the rate of incoming orders. I can see which events are tracking to expectations and which ones might need a little extra help in getting sold. Over here, I can track revenues on a per­month basis, per­day, or even down to the individual hour."

The other attendee is Lee Mitchell, CEO of Dan's closest competitor and a long­time personal friend. Lee leans in close to peer at the PDA's not­entirely­tiny screen. On that screen are metrics he's used to seeing in his own systems. But something is different, and he can't put his hands on it.

Lee counters, "That's great, Dan. But we've been able to pull these metrics for years. I've got something fairly similar back at the office, although I'll admit that pulling it up on the PDA nets you extra 'gee­whiz' points. My guys have built something like this that pulls reports right out of our accounting system to get me the same kinds of information."

Dan smiles because this is exactly the road he's been leading Lee down for the past 20 minutes. His old system was a lot like Lee's in that he could pull metrics. But those metrics always needed to be pulled. Once a report was generated, it was only a static representation from a single system. With his APM solution, everything is real time and integrates not only with the accounting system but also his entire IT infrastructure.

"A­ha!" proclaims Dan with a grin, "I know what you're talking about, and that's exactly where we were about 12 months ago. Now, take a look at this…"

Dan clicks on a few places on the screen and switches the view to a geographic representation of his multiple data centers. Each data center shows a stoplight chart—green, red, or yellow—displaying each data center's representation of overall health. Dan continues, "In this view, I can see which of my data centers are meeting which parts of their SLAs, and which aren't servicing my customers."

He clicks to drill the view into the metrics for his data center in Rochester, New York, which curiously shows a yellow condition. Seeking more details, he discovers that his Rochester ISP is experiencing a problem that impacts his bandwidth to the Internet.

"So, here you can see that we've got an issue in Rochester," Dan continues. "Some Internet device is probably having a problem, which means that fewer people are connecting through that point of presence."

Lee scratches his head, "I'm still not seeing the a­ha moment here, Dan. I've got this kind of data as well. My crew would be looking at this in the NOC right now if we were having a problem, which, come to think of it, we might in Rochester if you are too."

Dan chuckles, "Here's the a­ha. All of this data is being gathered from all of my systems, crunched through some service quality as well as business logic, and presented to me all at once. Want to see something really impressive?" Dan clicks a few more links on the page, "This screen aggregates my revenue impact data with that system performance data. It tells me exactly how many users are impacted by the Rochester situation, how much money I'm losing, and where my data center manager needs to send teams to fix the problem. The entire infrastructure is completely visible, right here through Web pages I can access on my PDA.

"It gets better. Even my developers use it to trace specific lines of code that aren't working correctly. Everyone from the techies to my aging brain gets the visualizations they need," Dan stops as the formerly­yellow light turns green, "Hey, looks like they've fixed the problem!"

Lee's eyes widen as he realizes the complete vision such a system brings, "Alright, you win. Drinks tonight are on me. Now, tell me more about this system."

Visualizations Are the Core of APM

If to you pictures tell more than a thousand words, this chapter is one not to miss. This guide's growing explanation of APM has introduced each new topic with an end goal in mind. That end goal—both for this guide as well as APM in general—is to gather necessary data that ultimately creates a set of visualizations useful to the business.

It is the word "useful" that is most important in the previous sentence. "Useful" in this context means that the visualization is providing the right data to the right person. "Useful" also means providing that data in a way that makes sense for and provides value to its consumer.

The concept of digestibility was first introduced in this book's companion, The Definitive Guide to Business Service Management. In both guides, the digestibility of data relates to the ways in which it can be usefully presented to various classes of users. For example, data that may be valuable to a developer is not likely to have the same value for Dan the COO. Dan's role might care less about the failure of an individual network component compared with how that component impacts the system's customers. Each person in the business has a role to fill, and as such, different views of data are necessary.

Yet what's interesting is that each of these different types of data must still be gathered to be useful. Lee's solution gathers data from only a single location, namely his accounting system. As a result, his results can only be based on the quality of data available from that system. If, for example, a network problem occurs in a data center, Lee's accounting system can't factor the problem into its reports. As a result, Lee's data isn't fully representative of the actual conditions "on the ground" across his entire customer services infrastructure. You'll find that this integration of business metrics into traditional monitoring represents a key way in which APM impacts business decisions.

Even when dollars and cents calculations aren't part of an APM Web page, visualizations assist the business in other ways: They provide a mechanism for finding faults in the environment. They enable traceability from the initial discovery of the fault down to its actual root cause. They also enable an otherwise‐impossible glimpse at the medium‐ and long‐term health of the system, displaying hard‐number metrics that report on the quality of service being delivered to customers.

In this chapter, you'll step through a series of mock‐up visualizations that illuminate these situations and others. The goal here is to show you smart ways in which visualizations can be generated out of the monitoring integrations we've laid into place in previous chapters.

This chapter's pictures show how Web‐based dashboards and their visualizations bring value to raw data. Being Web‐based, these dashboards are designed to show a large amount of data at a single glance. As such, they can appear very small in print. This is done intentionally to illustrate how much data can be consolidated into a traditional browser‐based view.

For our purposes, the design of the visualization and the ways in which it represents its data is more valuable than the actual data.

Useful Visualizations for Every Data Consumer

No visualization is effective unless it is created first with its consumer in mind. If that consumer can't digest what's being presented to them, the information being displayed is valueless. Think about the types of consumers who in your business today might benefit from the data an APM solution can gather:

  • Service desk employees and administrators gain troubleshooting assistance and an improved view into systems health.
  • IT managers are assisted in positioning troubleshooting resources to the most crucial problems as well as plan for expansion based on identified problem domains.
  • Business executives gain a financial perspective and better quality data that is formatted specifically for their needs.
  • Developers are able to dig into specific areas where code is non‐optimized or requires updating.
  • End users are proactively notified when problems occur, maintaining their satisfaction with your services.

Each of these individuals is benefitted in some way through higher‐quality information. The next section will look at each of these data consumers in detail with an eye towards the types of visualizations that are digestible to each. The first and most obvious groups are service desk employees and administrators, as they represent a class of data consumer who needs to know first when applications or application components break.

Service Desk Employees

Figure 7.1 shows one example of data that can be useful to this class of consumer. Here, a stoplight visualization has been created that shows a number of top‐level applications. These top‐level systems are represented both by period of time as well as by end user. In this visualization, the value for the system changes from green to yellow or red when the application is not meeting its expected levels of service.

Figure 7.1: A top­level stoplight chart that alerts when systems violate established metrics for service quality.

A visualization like this is useful for service desk employees as well as administrators because it answers the top‐level question "Are you functioning?" If all the lights are green in this graphic, administrators and service desk employees can be assured that the system is and has been functioning to expectations.

In contrast, when any of the visualization's cells changes, it can be assumed that some change has also occurred in the application. Both "when" and "where" questions are answered at the same time, with the top representation showing the location and count of affected users and the bottom showing how long the problem has occurred.

In this example, the second line associated with the online banking system began experiencing a yellow condition at roughly 4:00a, which escalated to a red condition at around 6:00a. Not all users are impacted, with those in EMEA experiencing a greater number of impacted users than others.

The Data Needs are Enormous

You might be wondering why this guide's APM discussion has waited until this chapter to start its conversation on visualizations. With the visualizations being APM's true value to the environment, delaying this discussion until Chapter 7 appears outwardly counterproductive.

Yet, consider the amount of data that is required in order to even generate a visualization like the one shown in Figure 7.1. Health metrics for each and every server, network component, user experience element, and application analytic must be identified, monitored, configured, and tailored before a toplevel visualization like this could ever be created. The tasks to get to this point require no small amount of effort, with the data gathering and calculating requirements equally as comprehensive.

APM solutions are uniquely capable of creating what appear to be simple graphics because of the sheer magnitude of their instrumentation. Simply put, underneath this simple graphic are dozens, if not hundreds, of individual calculations that occur in real time to determine when a green light turns red.

The information here is only the start of an enlightened service desk's triaging process. You'll see in Figure 7.1 that the problem relates in some way to the online banking system. Although that information is useful for knowing that a problem exists, it provides nothing for helping troubleshooting administrators track down where it exists. Needed is more detail that a service desk can use in reporting the issue to those teams.

That information comes through an APM solution's drill‐down visualizations. In Figure 7.2, the visualization for the online banking system has been drilled down to view a few of the different technology elements that enable its functionality. Here, servers, databases, the network infrastructure, and software elements are all shown with a slightly‐greater level of granularity over the very simple graphic shown in Figure 7.1. The additional bits of information provided here help the service desk identify that the problem is likely due to a server fault, helping them identify which group of individuals may be best suited to resolve the situation.

Figure 7.2: Service quality details associated with technology elements.

Yet this level of data is still not something that is useful for a troubleshooting administrator. At this point, the presence and domain of the problem have been identified, but its location within that domain remains unrecognized. In order to determine that information, even deeper monitoring integrations are required.

Remember that an APM solution gathers its metrics from multiple sources. Those sources can be the instrumentation within the applications themselves, they can come from various network components and probe devices, or they can come from the actual server metrics themselves.

Figure 7.3 shows the information in Figure 7.2 can be expanded further to view the actual server monitor that initially tripped the alarm. Here, red lights are seen for the online banking system as originally seen in Figure 7.1, and ultimately that system's infrastructure elements. One of those elements is the Web server at 10.4.224.42. Drilling down into the details of that element, it appears that the server is experiencing a CPU overuse condition.

Figure 7.3: Tracing a problem condition through a service tree.

With this information in hand, the service desk can now transfer ownership of the problem to the correct set of administrators for its resolution.

Administrators

Top‐level visualizations like the ones previously shown are useful for an IT environment's first responders. Once an alert associated with a problem system has been raised, administrators can be notified to track that problem to its root cause. It is very obviously within this step where APM can provide optimizations to the triaging process.

But triage and resolution are two different things. Recognizing that a CPU overuse condition has occurred on a server does nothing to assist in bringing that issue to resolution. That task lies with the business' administrators, who must first identify its root cause, a process that can also be assisted through APM visualizations.

First, consider a fully‐unmonitored environment. In such an environment, the root cause identification activity tends to consume the largest part of the troubleshooting process. This is the case because a fault in a system often doesn't manifest directly into something that is observable by an administrator. Tracing the recognized issue to the actual problem requires skill and experience, and often a bit of luck or trial and error. Alternatively, it can be accomplished through a data­driven approach.

Figure 7.4 shows an example of an administrator drill‐down that analyzes multiple perspectives of the business system at the same time. In the top perspective is data relating to the system's front‐end performance, with Web server and application server metrics being displayed in the middle and lower sections.

Figure 7.4: Using visualizations to trace a fault across multiple tiers.

For each system component in this graphic, multiple types of data are presented. The frontend server's count and rate of transactions are specified, along with the C‐N‐S Spread for those transactions. Web server and application server transaction details are also aligned, providing—like before—a single glimpse of system health across each element.

For this example, assume that metrics were laid into place prior to the fault. These metrics quantified the acceptable and unacceptable behaviors across each monitored system component. For example, the metric for HTTP server errors might have been configured to alert when the count of errors grew beyond zero. As a result, a troubleshooting administrator can quickly identify by the red‐colored columns that a greater‐thanacceptable quantity of HTTP errors are occurring.

Further, the amount of C‐N‐S "Server time" spent by both front‐end and Web servers is greater than expected. The combination of these two pieces of information helps the administrator further track down the possible CPU overuse condition.

It is important to recognize that any APM solution is equipped with a dashboard designer. This designer enables these visualizations to be modified as necessary to suit the needs of the consumer. Thus, if your business needs a view like the one in Figure 7.2 but with different data, it is possible to create a slightly different view.

IT Management

Directing these IT personnel in a cohesive manner is another activity that can be an ad hoc exercise without the right data. Consider the situation when multiple problems occur at the same time in an IT infrastructure; a situation that isn't terribly uncommon. When multiple parts of a large and complex system experience problems at once, directing personnel to resolutions that have the greatest impact is exceptionally important.

You can, for example, send out a team of administrators to fix a database problem when that team's time might be better served in fixing a simultaneous email server problem. As systems grow in complexity, determining the right way to provision your human resources can be as problematic as running the system itself.

An alternative way to handle the provisioning of resources to problems uses the same datadriven approach as the previous example. Rather than making educated guesses on which problems impact which users, APM can build this information out of the data it gathers from your system components. Figure 7.5 shows how this information might look in a resulting visualization.

Figure 7.5: Affected users by system component.

The module shown here is one piece of a larger visualization, with the availability and service quality information for each application not displayed. Figure 7.5 does, however, show a list of applications that may or may not be in a degraded state. For each, the count of affected and total users is displayed for those which are experiencing a problem.

Graphics like this one quickly assist IT management with directing troubleshooting resources to the applications with the largest impact on operations. Here, the online banking system's outage impacts more than 50% of its total users, making it a greater priority than the email system (or any other outage) for resolution.

Visualizations also help IT management with the planning and budgeting aspects of their job. In this case, historical data can be used to create visualizations that document where IT is spending the majority of its time. In Figure 7.6 shows a Pareto chart has been created to document the number of outages over a period of time for a set of business services. Pareto charts are used to highlight most‐important issues among a set of potential issues. The bar chart for each business service documents the number of issues for that service, while the line graph shows the cumulative frequency of occurrence.

Figure 7.6: A Pareto chart shows a historical breakdown of outages.

In this case, a historical Pareto chart gives IT management the data it needs to identify where the majority of issues occurs within an environment. In this example, Trading, Citrix, and Credit Services represent the top‐three issues seen by the environment over the measured period of time. Because these three services are experiencing the highest count of issues, they make excellent low‐hanging fruit for expansion or re‐architecture activities.

Business Executives

Some situations can arise that are not technical in nature and as such are the purview of business executives. Perhaps an Internet connection from a particular service provider experiences a problem that is caused by the actions of the service provider itself. There is no technical problem with the connection; it is merely not meeting its Service Level Agreement (SLA) obligations.

Traditional monitoring solutions might overlook these types of situations due to their heavy focus on technical metrics. Yet an APM solution's widespread reach across systems, networks, and even external connections can identify when executive‐level support is necessary for solving what ends up being a contractual problem. Figure 7.7 shows a sample dashboard module that merges business contract logic with availability information to alert when SLA conditions are in breach of contract.

Figure 7.7: SLA fulfillment that is measured with actual data from monitoring instrumentation.

Even during nominal activities, business executives struggle with the need for information that they often have no capability to understand. In our chapter example, Dan the COO might not understand what a network router is when it fails, but he absolutely needs to be notified when that router failure causes an impact to his business operations.

It is this dissonance between the data that executives can get and what the data they need that is a primary motivator for APM incorporation. APM solutions—most especially when they are installed as part of a much larger BSM solution—enable the executive to view information that they can digest and that they can truly care about.

Figure 7.8 shows what could be the most simplistic visualization for a business executive, presenting the instantaneous performance of the IT environment in a dial format. It shows very simply an aggregate percentage of how well that business executive's services are meeting the needs of their customers. The availability of the overarching service itself (via an internal perspective) as well as the availability of the service to its users (via an external user perspective) are noted in these twin graphics.

Figure 7.8: A simple dial module that represents availability metrics.

Yet these two graphics only show the performance as it occurs at the moment it is read. To get an executive‐level view of service history over a medium‐term period of time, another module is commonly used similar to that shown in Figure 7.9. This graphic extends the instantaneous representation of availability over a configurable period of time. Notice how the easy‐to‐read format draws the eyes to situations that most business executives want to prevent.

Figure 7.9: Another module that displays service quality over a period of time.

Dial and bar chart modules like these are commonly used as components of much larger dashboards. In Figure 7.10, they and a number of other high‐level modules are consolidated to create a single‐glimpse view that is useful for the business executive. This dashboard includes visualizations that show service and user availability alongside service quality and impact metrics for the business system's various components. Notice how each module presents its information in a slightly different way.

Figure 7.10: An executive­level dashboard that contains multiple visualization modules.

Dashboards like these provide information that gives business executives the confidence that their systems are meeting the needs of their customers. With information that is updated in real time, the business executive can reduce their need for operational status reports from each component owner or manager. The result is that executives can spend more time on value‐added activities while reducing the level of attention necessary to daily operations.

Executive­Level Dashboards Needn't Be Detail Free

It is important at this point to mention that executive‐level dashboards shouldn't necessarily obscure the details. Dashboards and their visualizations are by definition designed to be drill‐down capable. This means that the top‐level view within an executive‐level dashboard can include the basic stoplight‐style charts seen earlier. At the same time, more detailed information can be enabled through clickable elements on that dashboard.

This ability to reposition the executive's perspective at every layer helps to create a more‐educated executive while at the same time assuring good quality data in visualizations.

Code Developers

With business executives requiring a very high‐level view of the environment, their polar opposite is your group of code developers. This group requires an extremely detailed view of the individual functions being run within a system, broken down into detailed explanations of inter‐device conversations and transaction data. You can argue that the data needs of this group go even further than what is needed by systems administrators, because code developers actually create and manage the code that creates your business system.

To that end, businesses that consider an APM solution must be careful of the capabilities that such a system can provide for this group of individuals. For example, traditional monitoring solutions tend to suffer from the "shrink‐wrap support" phenomenon. Here, a monitoring solution very openly offers support of many common technology products and platforms, such as specific databases, middleware applications, or network devices. But your business service is likely comprised from as much custom code as these off‐the‐shelf applications. Thus, the ability to drill into the specifics behind the inter‐application communication is as important as the applications themselves.

For example, consider our previous situation where a Web server was experiencing a CPU problem. Knowing that that Web server was experiencing an increase in CPU utilization is less valuable than recognizing the exact Web page or code method that hasn't been processor‐optimized. An effective APM solution should have the capability to peer directly into database transactions to find such optimizations and present them via visualizations to your developers.

Such a deconstruction is shown in Figure 7.11. Here, a very simple table has been generated that contains details about the performance of the front‐end Web site. The specifics here relate to a series of end user operations and their effective performance. Transactions associated with login pages, search and view policies, search processes, and logout operations are aligned with their effective rate of performance.

Figure 7.11: User experience monitoring for a front­end application.

Like before, the acceptable transaction rates for each of these activities has been preconfigured within the APM solution's logic. As a result, in this image, a developer can quickly see that each of the measured activities is performing to desired expectations.

Charts, graphs, and tables are never enough with this group of data consumers, as their role is to always look for areas in which to improve the application. As such, even when an application is performing to desired specifications, there are always places in which database queries can be further optimized, Web pages can be accelerated, and applications can be given more power to accomplish their jobs. Code developers are also charged with continuing this process even as changes are requested to their applications, whether those changes be updates to Web sites or deep‐level code updates to support new lines of business.

One central issue with this dynamic rate of change with many business services is in measuring their effective performance over time. Today's slowdown in performance might be related to a run on a new product with hundreds or thousands of new customers coming in as new business. It could also be related to a bug fix that was implemented only to discover that the fix caused more damage than improvement to the system. To that end, multi‐view visualizations like that shown in Figure 7.12 provide yet another way in which multiple APM monitoring integrations can be tied together in a time‐oriented way to track down performance issues.

Figure 7.12: A multi­view of application performance.

In Figure 7.12, three different views of a business service are gathered on a single pane of glass. The top view shows the rate or volume of pages that are being rendered over a 24hour period of time. During that same period of time, the load time of those pages is related along with an overall representation of application performance.

By aggregating each of these views over an equivalent period of time, a code developer can quickly identify where correlations occur between different system activities. In this example, it is easy to see that between the hours of 8:00a and 12:00p and again between 1:00p and 4:00p there is a substantial spike in the volume of Web site pages being requested by clients. This volume changes from nearly zero to over a hundred Web pages being rendered per minute by the Web server. At the same time, however, the load time for these Web pages remains relatively steady. The application performance index of that Web server also remains consistent over the monitored period.

With these three graphs aligned, a code developer can quickly determine that there appears to be no correlation between the volume of rendered pages and their effective load time (for the volume of pages that were monitored). Such a developer can then be assured that the volume of pages being rendered is not impacting CPU performance, and as such, does not need code optimization or a hardware expansion.

If, however, the developer does find that some issue with the actual code is causing the problem, alternate visualizations can be brought to bear to break down the processing of that Web page into its disparate elements. As was first discussed in Chapter 3, any Web page is rendered as the sum of a large number of individual parts. Those parts can gather their data from internal databases or file structures, or can rely on external sources for data. As such, when one of those parts or external sources is not performing to the level needed by the Web server, the result is a reduction in performance.

Breaking down those transactions can be accomplished through a chart that looks similar to Figure 7.13. In this chart, the individual parts that make up a Web page—graphics, HTML code, scripts, and so on—are deconstructed by filename. Each file is rendered as part of the Web page in a particular order, with some components overlapping. With the quantity of elapsed time shown on the bottom, it becomes very easy for a Web developer to see where delays in page rendering are impacting the overall perception of Web server performance.

Figure 7.13: A transaction breakdown chart for rendering a Web page.

Remember that each individual Web page is the sum of its parts. With each one requiring different parts for its complete execution, some pages can perform well while others experience unacceptable load times. The problem in the unmonitored environment is in tracking down which are the problem pages in and among those that are performing well. To the unaided eye, tracking down one problem Web page can be an extremely timeconsuming process.

An APM solution can leverage its end user experience monitoring to keep records of page performance on all pages at the same time. Aligning those pages to an index of performance creates a table similar to Figure 7.14. Here, dozens of pages or more can be ranked by their performance against each other. Pages that experience the highest levels of performance are given a rating of one, with all other pages given a decimal rating below that number. Pages with the lowest ratings are experiencing the worst performance, and as such, require the greatest amount of attention by developers.

Figure 7.14: Measuring the Application Performance Index across multiple Web pages at once.

End Users

Lastly are the end users themselves. This often‐ignored class of users is your ultimate consumer; however, they're often forgotten when problems occur. Using a real‐world metaphor, keeping your Internet customers informed about situations is just as important as the airlines notifying you when your flight will be late. An uninformed customer is an unsatisfied customer, so keeping them aware of situations with your Internet‐facing systems is critical to keeping them coming back.

The problem in many organizations is in relating the information to end users in ways that are digestible to them. Also problematic is relating the right amount of information: not too little so as to annoy users or create situations of distrust, and not too much so as to give away proprietary information to competitors.

For these reasons, end user dashboards similar to Figure 7.15 can be some of the most difficult ones to correctly configure. You'll see that dashboards of this type often include the lowest resolution of information, while at the same time presenting enough data to users so that they know when problems occur. Typically, three types of data points are given to end users when creating dashboards like this one: information about current outages, current status of the infrastructure, and data about any upcoming or planned outages in the future. With these three pieces of information in hand, end users find themselves empowered enough to know when problem situations are actively occurring, and when they should expect to return to access the business service with its full capabilities.

Figure 7.15: An example end­user dashboard.

APM Visualizations Bring Quantitative Analysis to Operations

This chapter has indeed been all about the pictures. This is a necessary discussion for this point in the guide because it is those pictures—the way they are designed, the data they carry, the people for which they are tailored—that are the primary value generated by an APM solution. With the right monitoring integrations in place, data can be gathered to fill out these pictures with a quantitative view of your business operations. Deciding what to do with the result is up to you.

It is that determination of "what to do next" that is the topic of this guide's next chapter. Taking a new approach to the APM topic, Chapter 8 will depart from the traditional conversation to instead tell a story. That story continues the saga of Dan and John and the rest of TicketsRus.com, but over the course of an entire chapter. Chapter 8 will give you the opportunity to see an APM solution in action, showing how TicketsRus.com's APM implementation can be used to solve a major problem from start to finish. Told in the narrative format as each chapter's story, you'll learn a bit more about the company. You might also learn a bit more about your own business, and how it goes about solving similar problems today with—or without—an APM solution's objective analysis capabilities.