Utilizing KPIs and Alarm Methods

Introduction

One of the key capabilities of Lakeside Software's SysTrack is the ability to accurately score an environment's end user experience through the application of digital experience management metrics and event correlation and analysis. The end user experience score provides an overall indicator of the digital experience that users are getting from their IT resources. This is driven by a series of underlying Key Performance Indicators (KPIs) that give direction on what may be driving problems that are impacting the day to day experience of users. The use of this summarizing method provides a great way to trend changes over a long period to see the trajectory of quality in the environment. This is also a great contextualizing starting point to dig in to solve currently occurring problems – an area where event correlation and analysis is critical.

Event correlation and analysis is accomplished with the help of alerts corresponding to triggered events that provide immediate information targeting the sources of impact as they happen. The combination of long term user experience scores and alerts provide strong objective measurements of the end user experience and actionable data to improve the experience operationally, allowing an organization to create and maintain a successful end user environment. So, how does each piece work?

Long Term User Experience Trends

Digital experience management and tracking is a central component of workplace analytics. IT needs to have a quick and easily understood metric to track the trajectory of user experience and service quality across all their devices, and that's a very integral part of the SysTrack methodology. Simply put, SysTrack calculates a score across devices as a simple grade to rapidly evaluate how the environment is performing. The score itself ranges from 0 to 100, with a score of 100 representing the best possible end user experience, and is an aggregation of 13 different KPIs that summarize aspects of impact. These KPIs help lead an IT administrator towards potential sources of user impacting problems and roughly fall into categories of resource issues; networking problems; system configuration issues; and infrastructure problems. More specifically the KPIs are: CPU; memory; disk; network; latency; startup time; virtual memory; virtual machine; software installment; software update; events; faults; and hardware.

Each KPI consists of even more granular items that point toward the root cause of any impact. Figure 1 displays an overall view of the end user's experience of the environment and the ability to monitor the evolution of those impacts over time. The graph on the top right displays the trend of performance for a specific system and highlights where the user experience dropped below what's considered normal.

Figure 1: End User Experience Score based on daily impacts

The key to the user experience score, and the utility of SysTrack as a platform, is the idea of using context to help establish relevant areas on which to focus. Rapid degradation in the user experience can be compared to past performance and previous drivers to help uncover underlying reasons why a user may be having problems. This approach, however, is more about the longer-term analysis of impacts and analytical insight into the function of an environment. Quick response and operational management can make use of this for insights into immediate problems, but additional value comes from the other aspect of SysTrack, event correlation and analysis.

Detecting Sources of Impact

Event correlation and analysis is supported with the use of alerts. Alerts are more tailored for active management and used to target specific operational problems. In principle, this means that alerts meet a complementary need to the longer-term trends revealed by the user experience scoring. Basically, alerts represent a more active, pressing concern that's triggered based on many of the same criteria that lead to poorer performance. The general idea is to provide proactive notification of evolving problems for rapid response. Figure 2 demonstrates an environment with active alerts. The top chart displays some of the classes of alerts which include: change management; event log monitor; application fault; custom alerts; disk; system network; boot time; and memory leaks. The dashboard also displays the type of alerts per class, the number of alerts per system, and the top five alerts per system and type.

Figure 2: Alerts and their various categories

This is just one potential interface that an operational user may use, but this displays some of the categorization (e.g. Disk or System based alert items) that correlate alert items to the user experience scoring mechanism. This targeting of categories means that for proactive management it's simple to relate specific underlying causes (e.g. higher disk active time driving slower response) to the user experience KPI that's impacted for easier troubleshooting. It's also much more immediately actionable, meaning that someone is notified at the actual time of the problem occurring. It also helps answer questions like "is this a problem that this user always has?", or "is this the only user that has this problem?". To illustrate the point, let's look at a concrete example that uses both these questions for analysis of a problem.

Finding Root Cause with Event Correlation and User Experience Scoring

One of the most common ways alerts and user experience scores are used together to maintain a successful environment is with a feedback loop. The loop begins with an IT administrator checking the user experience score trend of an environment that might be showing signs of problems. The administrator notices the user experience score has dropped below what is considered normal and looks further into the specific KPIs that will lead to the reason why the end user may be having problems on their system. After the administrator determines which KPI is causing the largest impact upon the system, it is possible to drill down into the largest impacted day providing the best example of how that KPI is causing problems and with views of already placed and potentially triggered correlating alerts. These alerts will help the administrator obtain a better understanding of the source of problems impacting the end user. Let's say an IT administrator, Bob, decides to see how a system is performing. He notices on the quality trend chart that this system has been around average quality most of the time, but it appears to be following a downward slope as displayed in Figure 3. This concerns Bob because while this system appears to be performing well most of the time, it is currently showing signs of poor user experience.

Figure 3: Overall end user experience trend with impacts

By looking further into the various KPI categories, Bob notices that the network KPI on the Total Impact chart has the highest overall impact on this system. Bob is now aware that most system impacts involve the network. He continues to look further into the day where the network impact was largest because it is the best representation of the problem source. Upon selection of the specific day, he can view the various alerts set up for that category as displayed in Figure 4.

Figure 4: Network with alerts on selected day

Bob can see that an alert is currently being triggered and discovers that it's the retransmit rate for one network interface that seems to be causing problems. Now Bob can conclude that his system is starting to perform poorly due to network problems revolving around a high retransmit rate. As he continues to investigate, he will select network data to look at in detail and will look at other categories that involve network such as connections as displayed in Figure 5.

Figure 5: Network and Connections System Data

After looking at connections, Bob decides to look at the systems applications network traffic because the Connections category wasn't informative in determining why the retransmit rate is so high. Bob can also graph the retransmit rate as displayed in Figure 6, and this allows him to see any correlations and patterns that will lead to the source of the negative network impact.

Figure 6: Graph of retransmit rate

With alerts as the last clear step of targeting the source of impact, Bob has the tools and knowledge to find the source of impact causing the user experience score trend to rise.

Importantly, alerts don't always have to be used to monitor current ongoing problems. An IT administrator can create alerts to closely track any potential sources of impact for a system by assigning them custom thresholds, or an IT administrator can curate unique alerts to track items like specific event log entries or others. These various alerts will allow the administrator to have immediate feedback on potential KPIs that need extra work. This gives the administrator the ability to proactively maintain both the environment's upkeep and a positive end user experience. For example, an IT administrator might want to monitor a worker's CPU usage and make sure it never goes higher than 80%. This is not necessarily a problem that is currently impacting his environment, but it may become a problem if the CPU goes above that threshold. The IT administrator is monitoring a potential negative impact, but with the aid of alerts he will be able to prevent the user experience score from declining.

Alerts provide immediate information on the sources of impact, while the end user experience score provides an overall indicator of the digital experience users have. While one can be used without the other, together they provide ease and clarity to the source of system impacts. The integration of both digital experience management and event correlation and analysis, allows systems to become self-healing, ideally correcting problems before end users ever notice the impact.