Skip to main content

The importance of downtime in problem management

Juniper Networks wrote an interesting white paper titled, What’s Behind Network Downtime? Proactive Steps to Reduce Human Error and Improve Availability of Networks. Downtime is the most important metric for problem management as it is part of the most crucial metric, TIME.
The paper states that according to an Infonetics Research study, large businesses lose an average of 3.6 percent in annual revenue due to network downtime each year. Another reference but using Gartner reseach is this article in Network World.
The Infonetics site has this release in 2006 about smaller businesses: "In a new study on network downtime, Infonetics Research found that medium businesses (101 to 1,000 employees) are losing an average of 1% of their annual revenue, or $867,000, to downtime. The study, The Costs of Downtime: North American Medium Businesses 2006, says that companies experience an average of nearly 140 hours of downtime every year, with 56% of that caused by pure outages."
When an outage occurs it is possible to calculate the cost of the associated downtime. Here is a web based downtime calculators.
H.L. Mencken made the following statement: "There is always an easy solution to every human problem—neat, plausible, and wrong." The fundamental truth behind this statement is that when the problem has occurred then "die koeël is al deur die kerk", which means the action has happened and nothing can be done to change it. It is prudent to learn lessons from problems and have in place mechanisms to prevent their future recurrence. Fixing human problems with human solutions is insanity (Rita Mae Brown: Insanity is doing the same thing, over and over again, but expecting different results.) Example: issuing disciplinary letters to data centre technicians for procedural faults caused by fatigue. Yes, the insanity is there!
The lesson is, crash responsibly. Coding Horror states: "I not only need to protect my users from my errors, I need to protect myself from my errors, too. That's why the first thing I do on any new project is set up an error handling framework. Errors are inevitable, but ignorance shouldn't be. If you know about the problems, you can fix them and respond to them." The error handling framework should be embedded in not only the software or code but throughout the whole system and solution. Many technology malign themselves with the byline that they are not box droppers but a services company. The core characteristic of a services company, is an error handling framework. Sadly, this is where all the 800 pound gorillas fall short!
One method for this error handling framework in services is ITIL's expanded incident lifecycle. Straight from the book, ITIL v3, Continual Service Improvement: "(Availability Management) Detailed stages in the Lifecycle of an Incident. The stages are Detection, Diagnosis, Repair, Recovery, Restoration. The Expanded Incident Lifecycle is used to help understand all contributions to the Impact of Incidents and to Plan how these could be controlled or reduced."
The diligent recording of times during a major incident enables a company to identify causes that can be proactively addressed. These translate into reduced downtime, which equates to moolah. There are many possible causes of extended downtime periods and these include:
  • Long detection times or even misses.
  • Inappropriate diagnostics.
  • Logistic issues delaying repair.
  • Slow recovery, like having to rebuild from scratch as there is no know last good configuration.
  • Slow return to service even though the device is recovered.
  • No workarounds being available or documented.
Diligence around timings in the expanded incident life cycle is of crucial importance in analysing downtime.

Read further about the importance of investigating downtime in this article here.


Popular posts from this blog

Why Madge Networks, the token-ring company, went titsup

There I was shooting the breeze with an old mate. The conversation turned to why Madge Networks which I wrote about here went titsup. My analysis is that Madge Networks had a solution and decided to go out and find a problem. They deferred to more incorrect strategic technology choices. The truth of the matter is that when something goes titsup, its not because of one reason only, but a myriad of them all contributing to the negative consequence. There are the immediate or visual ones, which are underpinned by intermediate ones and finally after digging right down, there are the root causes. There is never a singular root cause for anything but I'll present my opinion and encourage everyone else to chip in. All of them together are more likely the reason the company went titsup. As far as technology brainfarts go there is no better example than Kodak . They invented the digital camera that killed them. However, they were so focused on milking people in their leg

Flawed "ITIL aligned"​ Incident Management

Many "ITIL aligned" service desk tools have flawed incident management. The reason is that incidents are logged with a time association and some related fields to type in some gobbledygook. The expanded incident life cycle is not enforced and as a result trending and problem management is not possible. Here is a fictitious log of an incident at PFS, a financial services company, which uses CGTSD, an “ITIL-aligned” service desk tool. Here is the log of an incident record from this system: Monday, 12 August: 09:03am (Bob, the service desk guy): Alice (customer in retail banking) phoned in. Logged an issue. Unable to assist over the phone (there goes our FCR), will escalate to second line. 09:04am (Bob, the service desk guy): Escalate the incident to Charles in second line support. 09:05am (Charles, technical support): Open incident. 09:05am (Charles, technical support): Delayed incident by 1 day. Tuesday, 13 August: 10:11am (Charles, technical support): Phoned Alice.

Updated: Articles by Ron Bartels published on iot for all

  These are articles that I published during the course of the past year on one of the popular international Internet of Things publishing sites, iot for all .  These are articles that I published during the course of the past year on one of the popular international Internet of Things publishing sites, iot for all . Improving Data Center Reliability With IoT Reliability and availability are essential to data centers. IoT can enable better issue tracking and data collection, leading to greater stability. Doing the Work Right in Data Centers With Checklists Data centers are complex. Modern economies rely upon their continuous operation. IoT solutions paired with this data center checklist can help! IoT Optimi