Skip to main content

It's not about the uptime - time to throw away the RAGs


All network management products miss the point. Their primary focus is on monitoring uptime. This is often referred to as a RAG tool. Red, Amber, Green where Red signifies down,Amber signifies intermittent connectivity problems and Green signifies good connectivity. This serves no business purpose and cannot justify any return on investment. All the development is in looking at when the situation is acceptable but none when you are in dire straits. How is this monitoring? All it does is give you a comfortable feeling. With this approach there is no difference in the value proposition of 'ping' or a network management framework product worth a million bucks.
What is required is to monitor downtime. Outages require a special view that is serialized and has the outage time. However, it is not the length of time of the outage that is important but the crucial time periods within the outage the provide the metrics. These metric times are aligned with the expanded incident lifecycle.

Thus where is the real worth in reporting 99.9% availability? This would translate to 0.1% outages but this has limited meaning because:

  • do we know which services were affected
  • what was the business impact
  • what were the prevailing conditions during the outage
  • what was the resolution (and was a workaround implemented)
  • did we annotate the visual, proximate and root causes?
  • do we know the resources assigned to work on the outage
  • how was the outage classified and prioritized
  • what was the risk assessment and mitigation

A metric of 0.1% answers none of these questions and no network management product in existence reports on them. As an example, does a historical snapshot exist of other outages at a certain time of a major incident?

Crucially, the monitoring of downtime needs to record the delta in the time when the actual outage happened against when the outage was acknowledge and escalated to the responsible operator via a notification process. If there was an outage with no notification then this needs to appear in an exception report. Every network management product vendor bases their sales pitch on the fact that they will provide a faster detection time for an outage. This is absolute bull and the reason is that in most cases, when the downtime of an outage is analyzed, the detection time is the smallest contribution to the overall length of the outage downtime. Why would you buy a tool that's primary purpose is to address the smallest requirement?

The other misnomer is that a monitoring tool will somehow miraculously prevent an outage. The reality is that 5h1t happens, and if your tool is only focused on prevention it will fail when you try to use it to cure. The value of a tool is how it deals with an outage because the reality is that this is the major factor in dealing with service improvement. Optimal management of outages results in two trends:

  • the time period between outages increases
  • the length of the outage reduces

There is no value in your monitoring tool if it cannot influence these trends and worse if it cannot even report or measure them.

The next step is diagnosis. Diagnosis monitoring is not that difficult. What often happens is that an outage occurs; a techie diagnoses it and after an investigation comes up with a solution. Often this knowledge remains in the techie’s head and is not annotated in a separate knowledge base. If this was consistently done, then the diagnoses time would reduce and immediately a tool that provided this ability would have a justified return on investment.

The Repair-recovery-restore time periods are not difficult to measure if a network management product is service aware, as these metric times will be obvious.

The output required is an analysis of the consequences of downtime:

  • what is the total outage time and what was the time period of degradation (brown outs versus black outs)
  • what is the deviation off the norm for this outage compared against the histrical trend
  • what are the average incident times for: detection, diagnosis, repair, recover and restore.
  • what are the top services affected by outages
  • what are the top causes of outages
  • what is the MTTR, MTBF and MTBSI?
  • what is the Manhattan analysis?

There are many possible causes of extended downtime periods. How does a RAG (Red-Amber-Green) tool help to assist in resolving:

  • Long detection times or even misses.
  • Inappropriate diagnostics.
  • Logistic issues delaying repair.
  • Slow recovery, like having to rebuild from scratch as there is no know last good configuration.
  • Slow return to service even though the device is recovered.
  • No workarounds being available or documented.

Network management needs to step up a gear from RAG, to different shades of grey. As least fifty according to my mommy porn friends!

This article was published over on LinkedIn: It's not about the uptime - time to throw away the RAGs


Popular posts from this blog

Why Madge Networks, the token-ring company, went titsup

There I was shooting the breeze with an old mate. The conversation turned to why Madge Networks which I wrote about here went titsup. My analysis is that Madge Networks had a solution and decided to go out and find a problem. They deferred to more incorrect strategic technology choices. The truth of the matter is that when something goes titsup, its not because of one reason only, but a myriad of them all contributing to the negative consequence. There are the immediate or visual ones, which are underpinned by intermediate ones and finally after digging right down, there are the root causes. There is never a singular root cause for anything but I'll present my opinion and encourage everyone else to chip in. All of them together are more likely the reason the company went titsup. As far as technology brainfarts go there is no better example than Kodak . They invented the digital camera that killed them. However, they were so focused on milking people in their leg

Flawed "ITIL aligned"​ Incident Management

Many "ITIL aligned" service desk tools have flawed incident management. The reason is that incidents are logged with a time association and some related fields to type in some gobbledygook. The expanded incident life cycle is not enforced and as a result trending and problem management is not possible. Here is a fictitious log of an incident at PFS, a financial services company, which uses CGTSD, an “ITIL-aligned” service desk tool. Here is the log of an incident record from this system: Monday, 12 August: 09:03am (Bob, the service desk guy): Alice (customer in retail banking) phoned in. Logged an issue. Unable to assist over the phone (there goes our FCR), will escalate to second line. 09:04am (Bob, the service desk guy): Escalate the incident to Charles in second line support. 09:05am (Charles, technical support): Open incident. 09:05am (Charles, technical support): Delayed incident by 1 day. Tuesday, 13 August: 10:11am (Charles, technical support): Phoned Alice.

A checklist for troubleshooting network problems (22 things to catch)

  Assumptions! What is really wrong? Is it the network that is being blamed for something else? Fully describe and detail the issue . The mere act of writing it down, often clarifies matters. Kick the tyres and do a visual inspection. With Smartphones being readily available, take pictures. I once went to a factory where there was a problem. Upon inspection, the network equipment was covered in pigeon pooh! The chassis had rusted and the PCB boards were being affected by the stuff. No wonder there was a problem. In another example, which involved radio links. It is difficult with radio links to remotely troubleshoot alignment errors. (I can recall when a heavy storm blew some radio links out of alignment. Until we climbed onto the roof we never realised how strong the wind really was that day!) Cabling. Is the cable actually plugged in? Is it plugged into the correct location. Wear and tear on cabling can also not b