Root cause analysis used by NASA/JPL
The problem with always fixing and never determing why it happened, has a negligible long-term benefit, as recurrence is bound to result. Root Cause Analysis (RCA) is a structured evaluation method that identifies the root causes for an undesired outcome and the actions adequate to prevent recurrence. Root cause analysis should continue until organizational factors have been identified, or until data are exhausted and allows learning from past problems, failures, and accidents. RCA is a method that helps professionals determine: what happened, how it happened and why it happened.
When performing an investigation, it is necessary to look at more than just the immediately visible cause, which is often the proximate cause. (The event(s) that occurred, including any condition(s) that existed immediately before the undesired outcome, directly resulted in its occurrence and, if eliminated or modified, would have prevented the undesired outcome.)
There are underlying organizational causes that are more difficult to see, however, they may contribute significantly to the undesired outcome and, if not corrected, they will continue to create similar types of problems. These are root causes. (One of multiple factors (events, conditions or organizational factors) that contributed to or created the proximate cause and subsequent undesired outcome and, if eliminated, or modified would have prevented the undesired outcome. Typically multiple root causes contribute to an undesired outcome.)
Requirements for mishap reporting and investigating all mishaps and investigations must identify the proximate causes(s), root causes(s) and contributing factor(s). (Any operational or management structural entity that exerts control over the system at any stage in its life cycle, including but not limited to the system’s concept development, design, fabrication, test, maintenance, operation, and disposal.)Steps:
- Identify and clearly define the undesired outcome (outage).
- Gather data. (Identify facts surrounding the undesired outcome.)
- When did the undesired outcome occur?
- Where did it occur?
- What conditions were present prior to its occurrence?
- What controls or barriers could have prevented its occurrence but did not?
- What are all the potential causes?
- What actions can prevent recurrence?
- What amelioration occurred? Did it prevent further damage?
- Create a timeline.
- Place events and conditions on an event and causal factor tree.
- Use a fault tree or other method/tool to identify all potential causes.
- Decompose system failures down to a basic events or conditions (further describe what happened.)
- Identify specific failure modes (immediate causes.)
- Continue asking “WHY” to identify root causes.
- Check your logic and your facts. Eliminate items that are not causes or contributing factors.
- Generate solutions that address both proximate causes and root causes.
Illustrate the sequence of events in chronological order horizontally across the page. Depict relationships between conditions, events, and exceeded or failed barriers/controls.
If amelioration occurred (e.g., reboot server, move application to another server), this should be included in the evaluation to ensure that it did not contribute to the undesired outcome. Example: In the of a server reboot, the investigation should ensure that the reboot was the result of the mishap and a result of latent hardware defects. Example: Simple timelineCreate an event and causal factor tree:
A visual representation of the causes that led to the failure or mishap. Place the undesired outcome at the top of the tree. Add all events, conditions, and exceeded/failed barriers that occurred immediately before the undesired outcome and might have caused it.Brainstorm to ensure that all possible causes are included, NOT just those that you are sure are involved. Be sure to consider people, hardware, software, policy, procedures, and the environment.If you have solid data indicating that one of the possible causes is not applicable, it can be eliminated from the tree. (Caution: Do not be too eager to eliminate early on. If there is a possibility that it is a causal factor, leave it and eliminate it later when more information is available.)
You may use a fault tree to determine all potential causes and to decompose the failure down to the “basic event” (e.g., system component level).A fault tree can also be used to identify all possible types of human failures.
After you have identified all the possible causes, ask yourself “WHY” each may have occurred. Be sure to keep your questions focused on the original issue. For example “Why was the condition present?”; “Why did the event occur?”; “Why was the parameter exceeded?” or “Why did the condition fail?”
Continue to ask “why” until you have reached:
- Root cause(s) - including all organizational factors that exert control over the design, fabrication, development, maintenance, operation, and disposal of the system.
- A problem that is not correctable by IT or IT contractor.
- Insufficient data to continue.
Verify it is a contributor or cause. If the action, deficiency, or decision in question were corrected, eliminated or avoided, would the undesired outcome be prevented or avoided? If no, then eliminate it from the tree.
The remaining items on the tree are the causes (or probable causes). necessary to produce the undesired outcome. Proximate causes are those immediately before the undesired outcome. Intermediate causes are those between the proximate and root causes. Root causes are organizational factors or systemic problems located at the bottom of the tree.
Some people choose to leave contributing factors on the tree to show all factors that influenced the event. Contributing factor: An event or condition that may have contributed to the occurrence of an undesired outcome but, if eliminated or modified, would not by itself have prevented the occurrence. If this is done, illustrate them differently (e.g., dotted line boxes and arrows) so that it is clear that they are not causes.Example: