
The methodology is as follows:
- Construct a Tiger Team. A Tiger team is an expert problem solving team. Checkpoint: Do we have all the right skills to deal with the problem? The Tiger Team does not have hierarchy (all are equal)!
- Do a basic risk assessment. Identify at a high level which of the following entities are involved in the problem: People, process, partners and/or products. It is important to document the visible and/or immediate causes. Also document the exact conditions (environment) in which this problem occurred. After having listed the entities above, conduct a suitable risk assessment on them using a methodology like CRAMM. Checkpoint: Is it worth moving on? Are the risks low and mitigated? Is there justification in investing further time and resources or should it be documented and parked.
- Do you have a checklist that covers the area in which this problem has occurred? Conduct an investigation using the checklist. Note any positive hits on the checklist. If there are no hits the note the five most likely hits on the checklist from lessons learned. Another alternative is to query a knowledge base. A large proportion of problems are no unique and have occurred before and have been documented. If a knowledge base does not exist then it it even suitable to type a description of the problem in Google and review the search results! Both Cisco and Microsoft have technical knowledge bases. Checkpoint: Has the issue been concluded with a positive hit on the checklist or is there a requirement for further root cause analysis?
- We now have a small sample of potential causes but need to expand the list. However, some groundwork needs to occur. We need to create a full and detailed inventory of all people, partners, processes and products. Take the high level list that has been created in step 1 and expand it to as much details as possible. As described in the methods of Roald Amundsen, it is important that the correct equipment and tools exist and that it is tested. As part of this process a full set of diagrams of the setup involved in the problem needs to be made available or else created. It serves no purpose to roll out a tool for the first time while dealing with a problem. Familiarity with the tool should already exist before it is used (which is the methods that Amundsen used). Although problems are not always network related, the network is a good place to further investigations. The tool sets available are network management tools like Nedi or NTop. The tools use underlying technologies like ITU-T Y.1731 Performance Monitoring, IPSLA or Netflow, to highlight areas of potential causes. This will either confirm a network related issue or discount connectivity problems as a cause. Note any issues that are detected that could be a secondary influence to the problem. As an example, Microsoft provides an excellent set of tools for Exchange. Checkpoint: Has a network related issue been discovered and is it resolvable.
- We know what is involved from the above steps, so now we need to dig deeper! The items related to the problem should be recorded and available in a CMDB. Upon interrogating this CMDB we should be able to extract a list of related incidents (especially major incidents), work requests, changes and associated problems. Note down a short list of about five of each of the above types. Investigate whether there is an item in the list that is directly related to the problem. Especially focus on new changes as these are often candidate causes. Try an determine if there is multiple failures related to a specific component? Is there a reliability issue related to those components? Checkpoint: Has a change or related request or incident been the cause for the problem? Is the problem related to component failure?
- By now the problem is becoming more difficult and we need to find the patterns and break the code, as Turing (a master at breaking codes) would have done. Investigate the version of software and hardware being used (this information should be recorded in the CMDB). Review the release notes for the latest hardware and software versions. Investigate what software upgrades, patches and bug fixes have been applied. Often a fix for one problem causes another. Note down any deviations and references to issues that match the problem. Checkpoint: Is the problem related to a change in revisions?
- Create accurate timelines! It is useful to work backwards from the problem and the way to do this is to create a timeline. Time lines are often used in FTA (Fault tree analysis). Checkpoint: Is the problem time dependant?
- Turn the focus on production. Review the SOPs (Standard Operating Procedures) for the services that are involved in this problem. Perform a gap analysis on what is happening in the LIVE production environment and what has been document in the SOP. Note these differences. Checkpoint: Are these deviations in the SOP a cause of the problem? Is the SOP itself incorrect?
- When you come to this step, the problem is a real dog. Investigate the people aspect of the problem both from a vendor/supplier and customer/user view. Talk to the vendors and users about the problem. List and document any of the follow issues: perception, interpretation, decision-making (knowledge-based, rule-based), and execution (skills). Checkpoint: Are we dealing with human error?
- Time to prioritize (Parento). We have either sufficiently determined the cause or have a list of potential causes. The causes include items from the checklists, knowledge base, network issues, change issues, component failures, release issues, SOP gaps, and human errors. Brainstorm this list and produce a list in order of most likely cause to least likely cause. Following the Pareto principle concentrate the analysis of causes to the top 20% of this list. Don't discount the rest of the list but use it as a stimulant to the analysis process should the process come to a dead end. Checkpoint: Has the needle in the haystack been discovered? If not don't throw in the towel as the process can be repeated at a later stage with more data and information as time is not an issue.
Comments
Post a Comment