Skip to main content

Grading the resouces involved in major incidents

It is recommended that the resources that are involved in handing a major incident are graded as part of a continuous improvement program. This is a means of doing that grading.
The maximum possible score is 32 and the grading is calculated by totaling up the scores from the eight different areas and representing it as a percentage of the maximum. The eight areas are:
  • Identification and business impact – have the resources correctly identified the major incident and described in the correct level of detail what happened. Has the correct service impacted been identified from the service catalogue? Was the business impact obtained or measured?
  • Conditions – what were the business, IT or environmental conditions present during the incident and did the resources describe these to a suitable level of detail.
  • Expanded Incident Lifecycle – are all the times in the expanded incident lifecycle recorded and are they realistic. Were these recorded in the incident reference at the service desk.
  • Resolution/ Workaround – how suitable was the resolution and was a workaround implemented to reduce the time the service was unavailable.
  • Classification – have the resources correctly classified the impact to the company and was the incident handled with the correct level of prioritization.
  • Outage – have the resources recorded and classified the outage times correctly.
  • Risk – has a suitable risk assessment of the service, asset and process been conducted?
  • Escalations/ Communications – did the resources escalate the incident and was communicate during the process suitable.
Each area scores a maximum of 4 points with a minimum of 0.

Read further about the MIP process here.

Comments

Popular posts from this blog

Why Madge Networks, the token-ring company, went titsup

There I was shooting the breeze with an old mate. The conversation turned to why Madge Networks which I wrote about here went titsup. My analysis is that Madge Networks had a solution and decided to go out and find a problem. They deferred to more incorrect strategic technology choices. The truth of the matter is that when something goes titsup, its not because of one reason only, but a myriad of them all contributing to the negative consequence. There are the immediate or visual ones, which are underpinned by intermediate ones and finally after digging right down, there are the root causes. There is never a singular root cause for anything but I'll present my opinion and encourage everyone else to chip in. All of them together are more likely the reason the company went titsup. As far as technology brainfarts go there is no better example than Kodak . They invented the digital camera that killed them. However, they were so focused on milking people in their leg

Flawed "ITIL aligned"​ Incident Management

Many "ITIL aligned" service desk tools have flawed incident management. The reason is that incidents are logged with a time association and some related fields to type in some gobbledygook. The expanded incident life cycle is not enforced and as a result trending and problem management is not possible. Here is a fictitious log of an incident at PFS, a financial services company, which uses CGTSD, an “ITIL-aligned” service desk tool. Here is the log of an incident record from this system: Monday, 12 August: 09:03am (Bob, the service desk guy): Alice (customer in retail banking) phoned in. Logged an issue. Unable to assist over the phone (there goes our FCR), will escalate to second line. 09:04am (Bob, the service desk guy): Escalate the incident to Charles in second line support. 09:05am (Charles, technical support): Open incident. 09:05am (Charles, technical support): Delayed incident by 1 day. Tuesday, 13 August: 10:11am (Charles, technical support): Phoned Alice.

A checklist for troubleshooting network problems (22 things to catch)

  Assumptions! What is really wrong? Is it the network that is being blamed for something else? Fully describe and detail the issue . The mere act of writing it down, often clarifies matters. Kick the tyres and do a visual inspection. With Smartphones being readily available, take pictures. I once went to a factory where there was a problem. Upon inspection, the network equipment was covered in pigeon pooh! The chassis had rusted and the PCB boards were being affected by the stuff. No wonder there was a problem. In another example, which involved radio links. It is difficult with radio links to remotely troubleshoot alignment errors. (I can recall when a heavy storm blew some radio links out of alignment. Until we climbed onto the roof we never realised how strong the wind really was that day!) Cabling. Is the cable actually plugged in? Is it plugged into the correct location. Wear and tear on cabling can also not b