As IT admins and application owners tear their hair out over another costly outage that has the boss breathing down their necks, a familiar cry rings out throughout the enterprise IT organisation; "monitoring sucks!!".
But why does monitoring suck and what can you do about it?
Monitoring has typically been the remit of small, specialised IT Operations teams - but the tools they've been given are rarely fit for purpose in today's demanding environments. Heavyweight, centralised monitoring tools tend to lead with an infrastructure-centric monitoring approach, which is great if you want a huge amount of granular detail about the health and availability of individual infrastructure components, but not so great if you want to understand that in the context of application and service availability.
These tools are typically complicated, noisy, long-in-the-tooth and have deep tentacles across most of the datacentre infrastructure. Although now extremely dated, particularly when it comes to UX, their sheer scale and complexity – with massive deployment times involving complex security and networking considerations - means they're hard to replace. Especially when an organisation has invested further time and effort in tuning monitoring for their specific requirements.
However, that same complexity and poor user experience means that although, in theory, they can monitor right across your infrastructure stack, in reality, the data they collect is rarely consumed outside of the core monitoring team. Because of this, these tools also suffer from a lack of specialist expertise; whilst the IT Operations team know their own monitoring tools in depth, they don't necessarily know much about the individual applications and workloads they've been tasked with monitoring.
In order to get that expertise, they need active investment from other specialist teams.
But with those teams turned off by the generic nature and poor user experience of these tools, everyone is caught in a vicious circle. The lack of a specialised monitoring experience means users turn elsewhere to get the visibility they need, meaning they don't invest in centralised monitoring, meaning it's not sufficiently tuned, meaning the information they receive isn't adequate for their needs. And so the cycle continues...
Disengaged from centralised monitoring, individual teams instead turn to specialist tools to monitor their own niche tech stacks. Because they're highly specialised, these tools are characterised by detailed metrics, deep subject matter expertise and a user experience tailored to their core audience - making them an extremely attractive proposition for individual users.
However, whilst they excel within their own narrow focus, these tools suffer from having a siloed and isolated view of the world. In turn, teams become lost in their own little worlds, disengaged from the core issue of application and service availability and, as a result, are inevitably highly reactive.
Faced with an application outage, everyone's been in the "war room" scenario where the boss turns to the assembled experts, trying to nail down the root cause of the issue only to be faced with half-a-dozen people logging into half-a-dozen different monitoring tools and all saying it's not them that's to blame. And this is the problem that stems from the common failings of centralized monitoring. Everyone has their own blinkered view, and against this backdrop, it's no wonder that meantime to resolution (MTTR) is often so painfully slow.
However, when push comes to shove, the business doesn't really care about a long-running SQL query or noisy neighbours within your virtualised infrastructure. Knowing these things in isolation isn't enough - it's only part of a bigger, connected picture. When this information is tied to applications availability, well, then you've struck gold.
The fragmented monitoring challenge is well known - something that was recognised by the IT monitoring industry - and has since lead to the next generation of monitoring tools.
Application Performance Management (APM) tools start from the perspective of the end user and work their way down. They trace user interactions and help organisations pinpoint the root cause for application performance issues. Whether that be underlying infrastructure failures or a poorly performing database query within the application's code.
In comparison to the fragmented monitoring scenario, APM solutions typically offer outstanding functionality, a first-class user experience and give a holistic, applicationcentric view of the world. In many cases they go as far as to include purely businessfocused metrics like revenue generated, or the number of shopping basket transactions, alongside traditional IT performance metrics. And at the other end of the scale, APM solutions also drop right down to the code-level to provide deep insights into the cause of performance bottlenecks so that those can be fed directly back to application development teams.
Sitting at the cutting-edge of monitoring, there are an array of awesome APM solutions on the market. However, they're also extremely expensive and very complicated, meaning they're generally only deployed for a tiny handful of vital, high-revenue generating apps which can justify the required investment. After all, if you're delivering $500 million worth of business through your online store, wouldn't it be worth paying $1 million for an APM solution that helped ensure it was always available to your customers and highly responsive?
In addition, the nature of their evolution and focus on consumer facing web apps and SaaS applications means they're far better suited to monitoring web apps than any other. Similarly, the code-level insights they provide are only actually valuable if you're actively developing the application on an on-going basis and can make changes based on those insights. Otherwise they have little value.
Unfortunately, all these factors mean that APM isn't appropriate for the vast majority of enterprise applications, either because the cost is too high for the relative value of each of the many hundreds of applications an enterprise relies upon, or because those applications are simply too varied in terms of their technology stacks, their legacy within the IT organisation and the size of the team that is assigned to look after them. Rather than being actively developed in agile, DevOps-style circumstances – which is the main stay of APM – most enterprise IT applications were either developed many moons ago or purchased from third party vendors. And far from having dedicated teams tending to each one, as well as the business itself directly focused on the application's health, often little is known about the actual make-up of these applications beyond some Visio diagram from 2008 that's still stuck on an office wall.
So, whilst there are a lucky few blessed with an outstanding APM solution for their application, for most users and applications, monitoring still sucks.
It's can be easy to point out failings, but the truth is that monitoring is complicated and it's hard to get it right. Most applications are deployed across a tech stack that traverses half-a-dozen or more siloed teams, none of whose job is to focus on the application's availability or how to monitor it. Monitoring itself can encompass a huge array of factors, from the bread-and-butter of infrastructure monitoring like CPU, memory and free disk space, to application specific performance metrics, through to looking for specific errors in log files and much else in-between. There's a deep rabbit hole to dive into if you want to deliver exhaustive monitoring and it can be a long journey to get to the bottom of it.
All this is made harder by the fact that, outside of DevOps scenarios, there's normally a significant disconnect between whoever's responsible for the app and the team responsible for monitoring it. With similar disconnects to individual infrastructure teams, to the service desk, and to the actual consumers of the application. What's more, despite all those audiences being, to one degree or another, invested in the availability of the application, they're probably all looking in completely different tools and at completely different metrics to gauge that availability.
It doesn't help that monitoring is usually the last thing that anyone thinks about. How many times have you heard, "Right, so how are we going to monitor this thing…?" just days or even hours before an app is about to go into production? Against that backdrop, is it any wonder monitoring sucks?
Firstly, you need to face the fact that your big lumbering tools aren't going anywhere anytime soon and that just moaning about them doesn't help anyone. For all their flaws, there's a good reason you invested in and implemented them in the first place and their scale and complexity means that rip-and-replace is rarely practical. Instead, you need to focus on ways to modernise those tools, gain additional value from them and drive user engagement. Otherwise your users will continue to their turn back towards their siloed views.
Secondly, acknowledge that there's a place for those individual tools, because it's often going to be necessary to go deep into a specialist tool to fix a specific problem. However, their use can't be allowed to come at the expense of having a holistic, centralised, application-centric view of your estate. On their own, these tools aren't going to give you a clear picture of the things the business cares about and will end up with everyone huddled in the war room, all finger-pointing but getting nowhere fast.
Because fragmented monitoring also means tool bloat, there will be plenty of overhead associated with their upkeep. For that reason you need to ask hard questions about whether the specialist tools really do something that your centralised tool can't do. Could your administrators just be more comfortable using their own tools?
For centralised monitoring to be successful, you need your subject matter experts to invest their time and knowledge into the process, which won't happen if they're entirely focused on their own tool, so do everything you can to make sure centralised monitoring offers an attractive user experience and one that is tailored as closely as possible to individual needs. For your users to buy into the process, they're going to need tooling they actually want to use.
It's also important not to be blinded by the shiny promises of APM tools. Yes, there are awesome APM solutions out there, but you need to ask hard questions about when and where they're appropriate. Unfortunately, just because the DevOps team running your new online banking app is having great success with an APM tool, that doesn't necessarily mean it's going to solve the wider problems of application monitoring within the enterprise IT environment. For net-new or cloud-native web apps, APM is likely to be a nobrainer, but for other problems, you'll need to look elsewhere. In addition, you need to keep a careful eye on the wider landscape of your operational tooling; in a hybrid cloud world, where the majority of your estate is still traditional on-prem and your cloud resources are split across multiple vendors - because who wants to end up with the vendor lock-in that comes from putting all those eggs in one basket – you don't want to end up introducing even more tools to monitor each of your cloud providers.
We also need to acknowledge that monitoring is difficult. It isn't just about tools - process plays a big part too. Monitoring almost always needs to be thought about earlier than it is and become an inherent part of an application rollout.
And whilst that will get you a long way, unfortunately, it's only going to help in the case of improving monitoring for new apps. For existing applications, start to focus on high value but easy wins, like getting availability monitoring - either simple or complex - in place for all your applications. Once you're able to accurately answer the questions "is this application available for its users right now" and "what this application's availability over the last 30 days", you're on the right path. Being able to present that same information not just to IT but to key stakeholders, whether that's the application's users themselves or upper-level management, is vital in gaining their trust and being able to better demonstrate the value of IT.
It's much better to have one key metric that's accurate and widely visible, than dozens of low-level metrics that no-one cares about or can even get to. And in acknowledging the complexity, look to solutions that offer not just new software, but come with deep, specialised expertise and vibrant communities. The issues involved in delivering successful monitoring for potentially hundreds of wildly differing applications are sufficiently complex that neither you, nor a vendor, are going to solve alone. You need to be able to draw upon the broader expertise and experience of your peers to help cut through the complexity.
Cutting through the complexity also means looking for simple solutions that allow you to deliver rapid, visible results quickly, not ones that are going to promise the earth but need years and years to implement. If you or your users are already thinking "monitoring sucks", then you don't want to be stuck in that situation for another three years. The benefits of successful monitoring are simply too high. You need to implement a solution that will change that mentality fast - something like Enterprise Application Monitoring (EAM); an affordable, scalable version of APM - without the code-level insights.
Because once you do change the percpetion of monitoring, you can pull yourself out of the vicious cycle and into a virtuous one. A place where everyone appreciates the value of monitoring and starts to willingly invest their own time and energy into it. Leading to incremental improvements, greater satisfaction and improved engagement by all.
And so the virtuous cycle continues...