Skip to main content

Root cause analysis used by NASA/JPL

The problem with always fixing and never determing why it happened, has a negligible long-term benefit, as recurrence is bound to result. Root Cause Analysis (RCA) is a structured evaluation method that identifies the root causes for an undesired outcome and the actions adequate to prevent recurrence. Root cause analysis should continue until organizational factors have been identified, or until data are exhausted and allows learning from past problems, failures, and accidents. RCA is a method that helps professionals determine: what happened, how it happened and why it happened.
When performing an investigation, it is necessary to look at more than just the immediately visible cause, which is often the proximate cause. (The event(s) that occurred, including any condition(s) that existed immediately before the undesired outcome, directly resulted in its occurrence and, if eliminated or modified, would have prevented the undesired outcome.)
There are underlying organizational causes that are more difficult to see, however, they may contribute significantly to the undesired outcome and, if not corrected, they will continue to create similar types of problems. These are root causes. (One of multiple factors (events, conditions or organizational factors) that contributed to or created the proximate cause and subsequent undesired outcome and, if eliminated, or modified would have prevented the undesired outcome. Typically multiple root causes contribute to an undesired outcome.)
Requirements for mishap reporting and investigating all mishaps and investigations must identify the proximate causes(s), root causes(s) and contributing factor(s). (Any operational or management structural entity that exerts control over the system at any stage in its life cycle, including but not limited to the system’s concept development, design, fabrication, test, maintenance, operation, and disposal.)
Steps:
  1. Identify and clearly define the undesired outcome (outage).
  2. Gather data. (Identify facts surrounding the undesired outcome.)
    • When did the undesired outcome occur?
    • Where did it occur?
    • What conditions were present prior to its occurrence?
    • What controls or barriers could have prevented its occurrence but did not?
    • What are all the potential causes?
    • What actions can prevent recurrence?
    • What amelioration occurred? Did it prevent further damage?
  3. Create a timeline.
  4. Place events and conditions on an event and causal factor tree.
  5. Use a fault tree or other method/tool to identify all potential causes.
  6. Decompose system failures down to a basic events or conditions (further describe what happened.)
  7. Identify specific failure modes (immediate causes.)
  8. Continue asking “WHY” to identify root causes.
  9. Check your logic and your facts. Eliminate items that are not causes or contributing factors.
  10. Generate solutions that address both proximate causes and root causes.
Timeline (sequence diagram):
Illustrate the sequence of events in chronological order horizontally across the page. Depict relationships between conditions, events, and exceeded or failed barriers/controls.
If amelioration occurred (e.g., reboot server, move application to another server), this should be included in the evaluation to ensure that it did not contribute to the undesired outcome. Example: In the of a server reboot, the investigation should ensure that the reboot was the result of the mishap and a result of latent hardware defects. Example: Simple timeline
Create an event and causal factor tree:
A visual representation of the causes that led to the failure or mishap. Place the undesired outcome at the top of the tree. Add all events, conditions, and exceeded/failed barriers that occurred immediately before the undesired outcome and might have caused it.Brainstorm to ensure that all possible causes are included, NOT just those that you are sure are involved. Be sure to consider people, hardware, software, policy, procedures, and the environment.If you have solid data indicating that one of the possible causes is not applicable, it can be eliminated from the tree. (Caution: Do not be too eager to eliminate early on. If there is a possibility that it is a causal factor, leave it and eliminate it later when more information is available.)

You may use a fault tree to determine all potential causes and to decompose the failure down to the “basic event” (e.g., system component level).A fault tree can also be used to identify all possible types of human failures.

After you have identified all the possible causes, ask yourself “WHY” each may have occurred. Be sure to keep your questions focused on the original issue. For example “Why was the condition present?”; “Why did the event occur?”; “Why was the parameter exceeded?” or “Why did the condition fail?”

Continue to ask “why” until you have reached:
  • Root cause(s) - including all organizational factors that exert control over the design, fabrication, development, maintenance, operation, and disposal of the system.
  • A problem that is not correctable by IT or IT contractor.
  • Insufficient data to continue.
Verify it is a contributor or cause. If the action, deficiency, or decision in question were corrected, eliminated or avoided, would the undesired outcome be prevented or avoided? If no, then eliminate it from the tree.
The remaining items on the tree are the causes (or probable causes). necessary to produce the undesired outcome. Proximate causes are those immediately before the undesired outcome. Intermediate causes are those between the proximate and root causes. Root causes are organizational factors or systemic problems located at the bottom of the tree.
Some people choose to leave contributing factors on the tree to show all factors that influenced the event. Contributing factor: An event or condition that may have contributed to the occurrence of an undesired outcome but, if eliminated or modified, would not by itself have prevented the occurrence. If this is done, illustrate them differently (e.g., dotted line boxes and arrows) so that it is clear that they are not causes.Example:

Comments

Popular posts from this blog

Why Madge Networks, the token-ring company, went titsup

There I was shooting the breeze with an old mate. The conversation turned to why Madge Networks which I wrote about here went titsup. My analysis is that Madge Networks had a solution and decided to go out and find a problem. They deferred to more incorrect strategic technology choices. The truth of the matter is that when something goes titsup, its not because of one reason only, but a myriad of them all contributing to the negative consequence. There are the immediate or visual ones, which are underpinned by intermediate ones and finally after digging right down, there are the root causes. There is never a singular root cause for anything but I'll present my opinion and encourage everyone else to chip in. All of them together are more likely the reason the company went titsup. As far as technology brainfarts go there is no better example than Kodak . They invented the digital camera that killed them. However, they were so focused on milking people in their leg

Flawed "ITIL aligned"​ Incident Management

Many "ITIL aligned" service desk tools have flawed incident management. The reason is that incidents are logged with a time association and some related fields to type in some gobbledygook. The expanded incident life cycle is not enforced and as a result trending and problem management is not possible. Here is a fictitious log of an incident at PFS, a financial services company, which uses CGTSD, an “ITIL-aligned” service desk tool. Here is the log of an incident record from this system: Monday, 12 August: 09:03am (Bob, the service desk guy): Alice (customer in retail banking) phoned in. Logged an issue. Unable to assist over the phone (there goes our FCR), will escalate to second line. 09:04am (Bob, the service desk guy): Escalate the incident to Charles in second line support. 09:05am (Charles, technical support): Open incident. 09:05am (Charles, technical support): Delayed incident by 1 day. Tuesday, 13 August: 10:11am (Charles, technical support): Phoned Alice.

Updated: Articles by Ron Bartels published on iot for all

  These are articles that I published during the course of the past year on one of the popular international Internet of Things publishing sites, iot for all .  These are articles that I published during the course of the past year on one of the popular international Internet of Things publishing sites, iot for all . Improving Data Center Reliability With IoT Reliability and availability are essential to data centers. IoT can enable better issue tracking and data collection, leading to greater stability. Doing the Work Right in Data Centers With Checklists Data centers are complex. Modern economies rely upon their continuous operation. IoT solutions paired with this data center checklist can help! IoT Optimi