The importance of downtime in problem management
The Infonetics site has this release in 2006 about smaller businesses: "In a new study on network downtime, Infonetics Research found that medium businesses (101 to 1,000 employees) are losing an average of 1% of their annual revenue, or $867,000, to downtime. The study, The Costs of Downtime: North American Medium Businesses 2006, says that companies experience an average of nearly 140 hours of downtime every year, with 56% of that caused by pure outages."
When an outage occurs it is possible to calculate the cost of the associated downtime. Here is a web based downtime calculators.
The lesson is, crash responsibly. Coding Horror states: "I not only need to protect my users from my errors, I need to protect myself from my errors, too. That's why the first thing I do on any new project is set up an error handling framework. Errors are inevitable, but ignorance shouldn't be. If you know about the problems, you can fix them and respond to them." The error handling framework should be embedded in not only the software or code but throughout the whole system and solution. Many technology malign themselves with the byline that they are not box droppers but a services company. The core characteristic of a services company, is an error handling framework. Sadly, this is where all the 800 pound gorillas fall short!
One method for this error handling framework in services is ITIL's expanded incident lifecycle. Straight from the book, ITIL v3, Continual Service Improvement: "(Availability Management) Detailed stages in the Lifecycle of an Incident. The stages are Detection, Diagnosis, Repair, Recovery, Restoration. The Expanded Incident Lifecycle is used to help understand all contributions to the Impact of Incidents and to Plan how these could be controlled or reduced."
The diligent recording of times during a major incident enables a company to identify causes that can be proactively addressed. These translate into reduced downtime, which equates to moolah. There are many possible causes of extended downtime periods and these include:
- Long detection times or even misses.
- Inappropriate diagnostics.
- Logistic issues delaying repair.
- Slow recovery, like having to rebuild from scratch as there is no know last good configuration.
- Slow return to service even though the device is recovered.
- No workarounds being available or documented.
Read further about the importance of investigating downtime in this article here.