Picture this scenario: You get to your desk at 9:00 a.m. sharp, having had a great morning workout, followed by a shower, a fantastic cup of coffee, and a frustration‐free drive to the office. You're fresh and focused and ready to make a serious dent in that growing to‐do list, which includes curious items like users complaining that "the Internet" gets really slow every so often, and the CFO thinks we're overpaying for WAN bandwidth. How much are we using?
Logging on to your PC, you notice that no emails have come in overnight. "That's odd," you're thinking. Seeing you arrive, your buddy now walks over and says, "Looks like something's wrong with email." You log on to the email server and find out that it's . . . well, you don't actually log on to the email server. The remote desktop won't make a connection. You try pinging the box, and there's no response. You wonder to yourself if the problem is in the network or somewhere else in the system. With a sinking feeling, you make the long journey to the computer room. All hope of working on your to‐do list is now gone as you stab a finger at the server's power switch. A few moments later, you're logged on at the console. A pop‐up alert on the screen tells you that one of the drives is completely full.
Much (much!) later in the day, a picture forms of what happened. Sometime during the night (2:30 a.m. to be exact) the data drive filled up, causing mail services to stop. Shortly after that, errors on the system drive reached a critical point, and the entire system crashed. Meanwhile, in the heat of fighting this fire, you didn't dig deeper to note that the data drive has been hovering at 95 percent capacity for over a week. And the drive that contains the operating system has been throwing read/write errors every 15 minutes for the last 17 days.
About this time, your manager, who's been keeping a respectful distance while you worked, lets you know that the CEO is back from his contract discussions overseas. During the flight home, the CEO needed to send some follow‐up documentation to the customer. When the corporate email wasn't responding, he resorted to creating a professional‐sounding Gmail account and sent the files from there. The three of you are scheduled to sit down and debrief the situation in 30 minutes. You start to pull some notes together for what you predict will be an uncomfortable conversation. Well, it was going to be a great day.
The situation you read above may be a typical one for you in your Information Technology (IT) monitoring scope. If you can relate, then this book is for you! Network Monitoring For Dummies, SolarWinds Special Edition, provides an introduction to IT monitoring for someone who is familiar with IT in general but not with monitoring as a discipline. As such, (almost) no former knowledge or experience is required before delving into the chapters of this book. If you already have experience with monitoring, this may not be the book for you. But then again, couldn't we all use a refresher? It couldn't hurt.
We have attempted to make this book tool‐agnostic. The purpose of this book is to give you a basic understanding of why you need monitoring, what the monitoring tools are, and some best practices of networking monitoring.
Monitoring as a discipline" means devoting your focus as an IT professional to ensuring your network, servers, applications, and so on are all stable, healthy, and running at peak efficiency. It means not just being able to tell that a system has crashed, but more importantly to tell when a system will crash, and intervening so the crash is avoided.
This chapter gives you insight into monitoring as a discipline, the benefits of monitoring, and the difference between monitoring and managing.
About a decade ago, there were no InfoSec professionals, no "white hat hackers," no pen testers. Network security, such as it was, was typically handled by a network or server admin who was drawn to security issues, who had an interest, and who felt passionately about keeping his or her environment safe. Ten years later, no company would think of excluding information security from the list of must‐have in‐house expertise.
We believe that the same is happening for monitoring professionals. Currently, many IT shops run without any significant monitoring solution. Others go about it in a piecemeal fashion, allowing teams or even individuals to deploy solutions with no thought to interoperability, scalability, or standards.
But in the not‐so‐distant future, we imagine a world where the idea of having a monitoring team is as natural as the teams of network, server, virtualization, storage — and yes, security — administrators we have today.
To get to that future, people who are drawn to monitoring, who have an interest and a passion for it, need the information to get up to speed on common terms, concepts, and techniques, and then they need the tools to turn that knowledge into results. This book is dedicated to imparting knowledge and experience gleaned from years of focus on building up our monitoring expertise, and from thousands of engagements with customers who had the same goal as you do.
If you've worked in IT for more than 15 minutes, you know that systems crash unexpectedly, users make bizarre claims about how the Internet is slow, and managers request statistics, which leaves you scratching your head wondering how to collect them in a way that's meaningful and doesn't consign you to the headache of hitting Refresh and spending half the day writing down numbers on a piece of scratch paper just to get a baseline for a report.
The answer to all these challenges (and many, many more) lies in effectively monitoring your environment, collecting statistics, and/or checking for error conditions so you can act or report effectively when needed. This goes well beyond a passive "make sure everything is green" approach to one that includes resource optimization, performance optimization, and proactive prevention and remediation.
Industry studies peg the cost of downtime in the hundreds of thousands of dollars per hour so the benefits of monitoring are indisputable:
Attaining the benefits of monitoring (see the preceding section) is easier said than done. Saying "let's monitor our IT environment" presumes that you know what you should be looking for, how to find it, and how to get it without impacting the system you're monitoring. You're also expected to know where to store the values, what thresholds indicate a problem situation, and how to let people know about a problem in a timely fashion.
Yes, having the right tool for the job is more than half the battle. But, it's not the whole battle, and it's not even where the skirmish started.
To build an effective monitoring solution, the true starting point is learning the underlying concepts. You have to know what monitoring is before you can set up what monitoring does.
Network monitoring is the phrase used to describe the practice of continuously monitoring the network and providing notifications to an administrator (probably you if you're reading this book) when an element of the network fails. Monitoring is usually performed by software or hardware tools and doesn't have an effect on the operation or condition of the network. Monitoring can be performed passively or actively:
to observe, record, or detect (an operation or condition) with instruments that have no effect upon the operation or condition
This is in contrast to management in which the administrator governs or controls the environment:
to handle, direct govern or control in action or use
Every monitoring system, regardless of the vendor or packaging, utilizes basic monitoring principles and technologies. This chapter lays out those core techniques and then gives you a deeper look into monitoring your network.
A few fundamental aspects of a monitoring system exist across the board, no matter what software you use, or the protocol, or the technique. These basic technologies used for monitoring include the following:
Regardless of what monitoring vendors will have you believe, a finite and limited number of technologies can be used to monitor. Where the sophistication comes in is with the frequency, aggregation, the relevance of displays, the ease of implementation, and other aspects of packaging.
Ping sends out a packet to the target device, which (if it's up and running) sends an "I'm here" type response. The result of a ping tells you whether the device is responding at all (up) and how fast it responded.
Simple Network Management Protocol (SNMP) has a few pieces that combine to provide a powerful monitoring solution. SNMP is comprised of a list of elements that return data on a particular device. It could be CPU or the average bits per second transmitted in the last five minutes. SNMP provides data based on either a Trap trigger (when one of the internal data points crosses a threshold) or an SNMP poll request.
The Internet Control Message Protocol (ICMP) is used by network devices like routers and switches to send error messages indicating that a host isn't reachable along with some diagnostics.
Syslog messages are similar to SNMP traps. A syslog service or agent takes events that occur on the device and sends them to a remote listening system (Syslog destination server).
An application or process writes messages to a plain text file on the device. The monitoring piece of that comes in the form of something that reads the file and looks for trigger phrases or words.
Event log monitoring is specific to Windows. By default, most messages about system, security, and (standard Windows) applications events are written here. Event log monitors watch the Windows event log for some combination of EventID, category, and so on, and perform an action when a match is found.
Performance monitor (or PerfMon) counters are another Windows‐specific monitoring option that can reveal a great deal of information, both about errors on a system and ongoing performance statistics.
Windows Management Instrumentation (WMI) is a scripting language built into the Windows operating system that focuses on collecting and reporting information about the target system.
Running a script to collect information can be as simple or complicated as the author chooses to make it. In addition, the script might be run locally by an agent on the same device and report the result to an external system. Or, it might run remotely with elevated privileges.
Internet Protocol Service Level Agreements (IP SLAs) are a pretty comprehensive set of capabilities built into Cisco equipment (and others nowadays, as they jump on the bandwagon). These capabilities are all focused on ensuring the WAN, and more specifically VoIP, environment is healthy by using the devices that are part of the network infrastructure instead of requiring you to set up separate devices to run tests.
Standard monitoring can tell you that the WAN interface on your router is passing 1.4 Mbps of traffic. But who is using that traffic? What kind of data is being passed? Is it all HTTP, FTP, or something else? Flow (most commonly referred to as NetFlow) monitoring answers those questions. It sets up the information in terms of conversations and monitors who, what, and how network traffic is being used.
Monitoring your network allows you to be alerted to possible pot holes before your users hit them at top speed. In this chapter, we provide insight into monitoring your network.
In most modern network monitoring systems, devices are monitored for the following:
Monitoring here relies primarily on SNMP and ICMP with more advanced monitoring taking advantage of packet inspection. Some of the key metrics that you should look at include response time and packet loss, CPU load and memory utilization, and hardware health details.
Understanding how network bandwidth is being used is critical in ensuring the availability and performance of business services. Bandwidth and traffic usage are most often monitored using the Flow (most commonly referred to as NetFlow) technology that is built into most routers by looking at " conversations" between devices.
When monitoring traffic and bandwidth, pay attention to
You may not own the WAN between your sites and remote locations and can't directly monitor the fault, availability, and performance of the devices within the WAN. If that's the case, you can use a technology such as IP SLA to generate synthetic traffic or operations to measure the performance between two locations or devices, determining the performance of the WAN.
IP SLA is especially beneficial when monitoring applications that are particularly sensitive to delay, jitter, or packet loss such as VoIP or video streaming.
A network can have thousands of IP addresses in use at any given time. A duplicate IP assignment, exhausted subnet or DHCP scope, or misconfigured DHCP or DNS service will cause a network fault.
Look for a solution that monitors these IP resources and that can proactively alert you of problems to help you plan for orderly expansion.
After all is said and done, you still need to buy or build a tool or set of tools that help you monitor all the elements of the IT stack. This can be done with discrete specialized tools that monitor a specific element (for example, network monitoring, storage monitoring, virtualization monitoring, and so on) or with a fully integrated suite of products that provides a common platform across the entire stack. Each approach has its advantages and disadvantages.
Regardless of which approach you choose, all software vendors are selling solutions that work from the same basic playbook. What should you look for as a differentiating factor? What is it, exactly, that makes brand X so much better than brand Y? The answer has as much to do with you and your organization as it does with how monitoring gets done.
Will your monitoring team be one person who is also your server team and network team and helpdesk team and database team? If so, you probably need a tool that sacrifices comprehensive options for simplicity and manageability. Does your organization need absolute flexibility so that the monitoring solution is the one‐stop‐shop for all your needs? You will pay more, and require more staff, but at the end of the day (or month, or more likely year) you will have a software suite that fits you like a glove.
With all of that said, the nontechnical items you should consider include the following:
You can spend all the money in the world on fancy monitoring solutions, but if you don't follow some key best practices of network monitoring (we give you the top ten), your effort is bound to fall short of expectations — both yours and the people depending on you.
To be able to identify potential problems even before users start complaining, you need to be aware of what's normal. Baselining behavior over a couple of weeks or even months will help you understand what normal behavior is.
Keep an inventory of the network devices, ports, and interfaces being used for network connections, network hardware (links, network controllers, power supply, and so on), servers, virtual machines, and SAN devices.
Alerts help you monitor proactively. Most alerts are automatic email notifications when particular metrics thresholds are crossed. For each alert, you can set critical and warning threshold values. These threshold values are meant to be boundary values that when crossed indicate that the system is in an undesirable state.
When an alert repeatedly triggers (a device that keeps rebooting itself, a disk drive that hovers on the edge of full, and processes keep deleting/creating temporary files so that one moment it's over threshold, the next it's below), that condition is known as flapping or sawtoothing.
Here are a few techniques that can be used to avoid this, depending on the toolset:
If you find yourself setting up a rule in email to filter or even delete alerts, you're admitting you've failed to set up the correct alert. You're bound to ignore critical alerts.
In some cases, such as when monitoring disk space, you may not be interested in a specific numeric threshold (alert when the disk is more than 90 percent utilized). Instead, in some cases, what you want to monitor is the delta, or the rate of change. Here, you might be interested in knowing that disk utilization has gone up by more than xx percent over yy minutes, which may indicate a spike in consumption.
Provide the Details
It is not enough to set and generate an alert if there are not enough details to begin troubleshooting. The more in‐depth the alert, the faster the troubleshooting. In addition to the obvious ones (name and IP of the affected device, time of failure, statistic of the failed component at the time of failure, and so on), some other items that might come in handy include the following:
One of the reasons why potential issues become an actual problem is because the alerts triggered based on a threshold are ignored or the right person isn't alerted. When setting up monitoring and reporting, the organization should have a policy on who has to be alerted when a malfunction occurs or a potential problem is detected. Based on the policy, the right person who administers the aspect that is having an issue can be alerted.
Parent‐child relationships (which typically have to be set up manually for each set of devices) are a way of telling the monitoring system what's connected and how. This way, when a parent device is down (the router is the parent of the switch, the switch is the parent of the server), any alerts related to the child devices are suppressed. When the router is down, the switch isn't (necessarily) down. It may be simply unreachable.
In more sophisticated tools, the software might actively check the upstream parent before marking a device as down, and continue all the way up the chain until it finds the highest level device that is down. This is known as upstream verification (or conversely, downstream suppression).
Event correlation is a big topic. It's much bigger than one section of this document can accommodate, but it's important enough to merit a brief discussion.
Event correlation tools can perform the flap detection and suppression, as well as parent‐child correlation. In addition, event correlation tools might perform the following:
If a problem occurs once, it's negligible, but repeated occurrences indicate the presence of a problem.
Your users don't care that a switch went down or a routing change was made; they only care that an application is performing poorly or failing. You need to view network traffic and performance and their impacts on the application. This can be accomplished using techniques such as packet inspection.