Event Monitoring: Why we get it wrong. How to get it right.
Why we get it wrong.
First of all we should recognise that event Monitoring is hard. We know we need monitoring but we need to make sure that we are monitoring the right things at the right level. Often those monitoring decisions are driven by SLA’s or OLA’s in a contract. Sometimes they could be driven by an internal requirement that helps drive business (like your company email system. It’s not got SLA’s and financial penalties but it’s pivotal to business operations).
We should also be vigilant about the cost of monitoring (not just the dollar cost but the cost of labour). The dollar cost varies hugely and will be driven largely by the purchase of licences and tools to perform the monitoring and the expertise needed to help deploy and configure monitoring tools. However, and this is where we usually fail…the cost of labour to deal with the alerts that are generated. Too many alerts and you’ll be swimming in incidents (often affecting other SLAs, I’ve seen SLAs that are breached by having too many incidents in a queue regardless of how quickly they are dealt with). Now overlay things like 24×7 coverage that might be required to react to those events. Worse still if you have a requirement to notify major incident management teams or management / leadership teams out of hours, you’ll soon find your overtime bill skyrocketing. Even if we aren’t dealing with high severity incidents do we really want techies to walk into a bunch of low severity, meaningless events every Monday morning? You are going to drain the moral of your technical teams by bashing them with events that are repeated time and again. Eventually they’ll get ignored (the boy who cried wolf).
Now I hear you say “well I wouldn’t be silly enough to configure meaningless alerts in the first place”. Well sorry, you’re wrong. We all do it. Here’s why…
Often the technical team, who has the responsibility for xyz infrastructure or application, are the ones responsible for designing the monitoring requirements. They know the application best right? They have built the infrastructure so they know what ‘bad’ looks like? They are exactly the right people to configure the alerting? Surely? Well yes and no.
Yes because they do understand how their kit works. No body is positioned better.
No because they are often detached away from the SLA and they probably aren’t the people responding to the events at 3am.
The technical team nearly always create too many events. They care about every minuscule detail, every application process, every server, every http error. They live in a world where everything is urgent and their baby should never be ignored. They unknowingly create downstream incident management headaches by configuring everything to alert and often the events are set at a higher priority than they need to be.
How do we get it right?
Start with the contract. If we don’t have a contract (as mentioned earlier…the internal company email system isn’t SLA’d), then talk to the stakeholders. Do we really care that email isn’t working at the weekend? Does the clock on your SLA for the customer application stop at 5pm?. The most important thing in monitoring is the service design. Don’t let the techies design the monitoring requirements (they’ll unwittingly cause problems). The business people need to supply the technical teams with what they want to see. The technical teams need clear guidance about what is required from alerting configurations.
Before lifting finger on monitoring tool deployments / solutions we need to clearly understand the business requirements. Document what’s needed and use this document as the starting point for the monitoring solution.