Skip to main content

BMX - a problem management methodology

BMX is a draft problem management methodology that I am working on and devising. It was similar and overlapping steps to basic project management.  Read about a basic project methodology I proposed here.
The methodology is as follows:
  1. Construct a Tiger Team. A Tiger team is an expert problem solving team. Checkpoint: Do we have all the right skills to deal with the problem? The Tiger Team does not have hierarchy (all are equal)!
  2. Do a basic risk assessment. Identify at a high level which of the following entities are involved in the problem: People, process, partners and/or products. It is important to document the visible and/or immediate causes. Also document the exact conditions (environment) in which this problem occurred. After having listed the entities above, conduct a suitable risk assessment on them using a methodology like CRAMM. Checkpoint: Is it worth moving on? Are the risks low and mitigated? Is there justification in investing further time and resources or should it be documented and parked.
  3. Do you have a checklist that covers the area in which this problem has occurred? Conduct an investigation using the checklist. Note any positive hits on the checklist. If there are no hits the note the five most likely hits on the checklist from lessons learned. Another alternative is to query a knowledge base. A large proportion of problems are no unique and have occurred before and have been documented. If a knowledge base does not exist then it it even suitable to type a description of the problem in Google and review the search results! Both Cisco and Microsoft have technical knowledge bases. Checkpoint: Has the issue been concluded with a positive hit on the checklist or is there a requirement for further root cause analysis?
  4. We now have a small sample of potential causes but need to expand the list. However, some groundwork needs to occur. We need to create a full and detailed inventory of all people, partners, processes and products. Take the high level list that has been created in step 1 and expand it to as much details as possible. As described in the methods of Roald Amundsen, it is important that the correct equipment and tools exist and that it is tested. As part of this process a full set of diagrams of the setup involved in the problem needs to be made available or else created. It serves no purpose to roll out a tool for the first time while dealing with a problem. Familiarity with the tool should already exist before it is used (which is the methods that Amundsen used). Although problems are not always network related, the network is a good place to further investigations. The tool sets available are network management tools like Nedi or NTop. The tools use underlying technologies like ITU-T Y.1731 Performance Monitoring, IPSLA or Netflow, to highlight areas of potential causes. This will either confirm a network related issue or discount connectivity problems as a cause. Note any issues that are detected that could be a secondary influence to the problem. As an example, Microsoft provides an excellent set of tools for Exchange. Checkpoint: Has a network related issue been discovered and is it resolvable.
  5. We know what is involved from the above steps, so now we need to dig deeper!  The items related to the problem should be recorded and available in a CMDB. Upon interrogating this CMDB we should be able to extract a list of related incidents (especially major incidents), work requests, changes and associated problems. Note down a short list of about five of each of the above types. Investigate whether there is an item in the list that is directly related to the problem. Especially focus on new changes as these are often candidate causes. Try an determine if there is multiple failures related to a specific component? Is there a reliability issue related to those components? Checkpoint: Has a change or related request or incident been the cause for the problem? Is the problem related to component failure?
  6. By now the problem is becoming more difficult and we need to find the patterns and break the code, as Turing (a master at breaking codes) would have done. Investigate the version of software and hardware being used (this information should be recorded in the CMDB). Review the release notes for the latest hardware and software versions. Investigate what software upgrades, patches and bug fixes have been applied. Often a fix for one problem causes another. Note down any deviations and references to issues that match the problem. Checkpoint: Is the problem related to a change in revisions?
  7. Create accurate timelines! It is useful to work backwards from the problem and the way to do this is to create a timeline. Time lines are often used in FTA (Fault tree analysis). Checkpoint: Is the problem time dependant?
  8. Turn the focus on production. Review the SOPs (Standard Operating Procedures) for the services that are involved in this problem. Perform a gap analysis on what is happening in the LIVE production environment and what has been document in the SOP. Note these differences. Checkpoint: Are these deviations in the SOP a cause of the problem? Is the SOP itself incorrect?
  9. When you come to this step, the problem is a real dog. Investigate the people aspect of the problem both from a vendor/supplier and customer/user view. Talk to the vendors and users about the problem. List and document any of the follow issues: perception, interpretation, decision-making (knowledge-based, rule-based), and execution (skills). Checkpoint: Are we dealing with human error?
  10. Time to prioritize (Parento). We have either sufficiently determined the cause or have a list of potential causes. The causes include items from the checklists, knowledge base, network issues, change issues, component failures, release issues, SOP gaps, and human errors. Brainstorm this list and produce a list in order of most likely cause to least likely cause. Following the Pareto principle concentrate the analysis of causes to the top 20% of this list. Don't discount the rest of the list but use it as a stimulant to the analysis process should the process come to a dead end. Checkpoint: Has the needle in the haystack been discovered? If not don't throw in the towel as the process can be repeated at a later stage with more data and information as time is not an issue.


Popular posts from this blog

Why Madge Networks, the token-ring company, went titsup

There I was shooting the breeze with an old mate. The conversation turned to why Madge Networks which I wrote about here went titsup. My analysis is that Madge Networks had a solution and decided to go out and find a problem. They deferred to more incorrect strategic technology choices. The truth of the matter is that when something goes titsup, its not because of one reason only, but a myriad of them all contributing to the negative consequence. There are the immediate or visual ones, which are underpinned by intermediate ones and finally after digging right down, there are the root causes. There is never a singular root cause for anything but I'll present my opinion and encourage everyone else to chip in. All of them together are more likely the reason the company went titsup. As far as technology brainfarts go there is no better example than Kodak . They invented the digital camera that killed them. However, they were so focused on milking people in their leg

Flawed "ITIL aligned"​ Incident Management

Many "ITIL aligned" service desk tools have flawed incident management. The reason is that incidents are logged with a time association and some related fields to type in some gobbledygook. The expanded incident life cycle is not enforced and as a result trending and problem management is not possible. Here is a fictitious log of an incident at PFS, a financial services company, which uses CGTSD, an “ITIL-aligned” service desk tool. Here is the log of an incident record from this system: Monday, 12 August: 09:03am (Bob, the service desk guy): Alice (customer in retail banking) phoned in. Logged an issue. Unable to assist over the phone (there goes our FCR), will escalate to second line. 09:04am (Bob, the service desk guy): Escalate the incident to Charles in second line support. 09:05am (Charles, technical support): Open incident. 09:05am (Charles, technical support): Delayed incident by 1 day. Tuesday, 13 August: 10:11am (Charles, technical support): Phoned Alice.

Updated: Articles by Ron Bartels published on iot for all

  These are articles that I published during the course of the past year on one of the popular international Internet of Things publishing sites, iot for all .  These are articles that I published during the course of the past year on one of the popular international Internet of Things publishing sites, iot for all . Improving Data Center Reliability With IoT Reliability and availability are essential to data centers. IoT can enable better issue tracking and data collection, leading to greater stability. Doing the Work Right in Data Centers With Checklists Data centers are complex. Modern economies rely upon their continuous operation. IoT solutions paired with this data center checklist can help! IoT Optimi