Root Cause Analysis (RCA), Explained

ITSM: The Definitive Guide
Join IT Pulse, our weekly newsletter

Receive the latest news of the IT word. right in your inbox

Root Cause Analysis (RCA) is the best way to find out what causes an issue in your IT operations (ITOps). In other words, it is a great versatile analysis method for corrective action that is inherent to the ITIL framework. It’s a comprehensive approach that all managers can appreciate.

In the IT industry, this method is invaluable since its ability to swiftly and effectively address problems is what distinguishes proactive IT Service Management (ITSM). But it’s not just for solving-problem that is valuable; RCA fosters a culture of continuous improvement, learning, and innovation without playing the blaming game. 

So, if you want to transform problems to predictable and manageable events in the hopes of navigating the complexities of your IT operations with confidence, keep on reading. 

We’ll explore everything that is to know about RCA and how to tailor a method that aligns with your organization’s goals.

Let 's get started.

Table of contents

ITIL and Root Cause Analysis

Root Cause Analysis is a method used to understand the causes of a problem or incident. It’s pretty much the rubric used within the ITIL framework to standardize Problem Management. 

As you may know, Problem Management is one of the processes under the Service Operation phase of ITIL, which aims to manage the lifecycle of all problems that could or do affect IT services.

Both ITIL and RCA embody the principle of continuous improvement. In theory, such integration is not just a procedural requirement, it's a strategic approach. And in practice, the synergy between both happens when this method is used to investigate the cause of problems so we can implement solutions long-term instead of temporary fixes. 

All of these lead to fewer disruptions and more reliable services. With that comes a robust knowledge base that will help your team make informed decisions when preventing future problems.

Why do you need RCA?

It goes without saying that organizations need a problem-solving method that is effective. And more importantly, one that doesn’t merely address the immediate symptoms of a failure.

Keep in mind that problems are often complex, with not just a single cause but multiple contributing factors. Hence, there is a lot of value in digging deeper into problems to improve your operations. 

So, why do you need to implement RCA? Because not only your business resilience depends on preemptively mitigating potential disruptions and downtime, but also because understanding root causes is as much about preventing negative outcomes as it is about replicating success.

Benefits of performing RCA

  • RCA helps in streamlining processes and removing bottlenecks that lower your IT team’s productivity.
  • On that note, an environment with fewer disruptions also contributes to better work quality and employee morale.
  • With this method the likelihood of customers facing the same issues repeatedly diminishes. Reliable products and services improve consumer satisfaction and loyalty.
  • Over time, root cause analysis saves an organization time, money, and resources because repeated repairs or adjustments are not necessary.
  • Most importantly, you’ll get peace of mind, since you are building a stable and predictable operational environment. 

5 Root Cause Analysis methods

In fact, there are several ways in which your organization can conduct Root Cause Analysis. The choice of method depends on the complexity of the issue, the level of detail required in the analysis, the available data and resources, and the desired outcome of the RCA process. 

Let’s check them out.

1. 5 Whys

the-5-whys

The 5 Whys method involves asking the question "Why?" repeatedly to peel away the layers of symptoms and reach the core issue. It's a straightforward technique that doesn't require statistical analysis, making it accessible for anyone to use. However, its simplicity can also be a limitation, as it might not be suitable for complex problems with multiple root causes.

For instance, let’s say users are experiencing slow response times when accessing the company's internal Customer Relationship Management (CRM) system.

  1. Why? The CRM system's server is experiencing high latency.
  2. Why? The server's CPU usage is consistently at 100% during peak business hours.
  3. Why? A recent update to the CRM software introduced a memory leak that increases CPU usage over time.
  4. Why? The update was not fully tested in a simulated production environment before deployment.
  5. Why? The IT department has been under-resourced and couldn't allocate time for comprehensive testing due to back-to-back project deadlines.

2. Failure Mode and Effects Analysis (FMEA)

failure-modes-and-effects-analysis

FMEA is a step-by-step approach for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service. It's particularly useful in early stages of development, as it helps to prevent problems before they occur. FMEA evaluates the severity, likelihood, and detectability of failures to prioritize which ones need to be addressed first.

3. Fishbone Diagram (Ishikawa Diagram)

ishikawa-diagram

When your issue is complex and you need a structured analysis, the Fishbone Diagram, also known as the Ishikawa Diagram, is a visual tool used to systematically identify and present all possible causes of a deeper problem. It helps teams brainstorm and categorize causes into groups such as methods, machines, materials, people, environment, and measurement. 

4. Pareto Analysis

the-pareto-principle

To focus on where to allocate the efforts of your team as well as identifying the most significant issues go with Pareto Analysis, based on the Pareto Principle (80/20 rule). It’s used to prioritize problems or causes to focus on those that will have the greatest impact if solved. Your team will create a Pareto chart, where causes are listed on the X-axis, and the frequency or impact of the causes is shown on the y-axis.

5. Fault Tree Analysis (FTA)

fault-tree-analysis

Fault Tree Analysis is a top-down, deductive analytical method used to explore the causes of specific events (usually adverse events). It uses a graphical representation of various parallel and sequential causes that can lead to the event. FTA is particularly useful in industries like aerospace and nuclear power, where preventing failures is critical.

How to choose the right RCA method?

The context of the issue at hand will help you choose the right RCA method, as each one has its own strengths. These are some of the considerations you could take when choosing a method that that aligns with the issue's nature, your objectives, and your organization's capabilities:

  • Simple issues only need basic methods like the 5 Whys, while complex problems require detailed analysis techniques like FMEA or FTA.
  • Data-intensive methods like FMEA are suitable when detailed information is available. With limited data, the Fishbone Diagram is key as it’s based on expert judgment.
  • Consider the expertise within your team and the resources to dedicate. Some methods require specialized knowledge or tools.
  • Remember that compliance with industry-specific RCA standards may dictate the method choice as it is the case for aviation.
  • Depending on your organization’s culture, methods like the 5 Whys or Fishbone Diagram workshops promote better team and stakeholder involvement.

     

How to do Root Cause Analysis?

RCA is based on the premise that it is more effective to systematically prevent and solve underlying issues rather than just treating symptoms. The process is relatively straightforward and typically involves these key steps to perform RCA:

  1. Knowing the problem: Clearly defining the problem or issue that has occurred.
  2. Collecting data: Gathering all relevant information about the problem, including when and where it occurred, and under what conditions.
  3. Analyzing data: Using various RCA methods (such as the 5 Whys, Fishbone Diagram, FMEA, etc.) to explore potential causes.
  4. Identifying the root cause(s): Determining the underlying factors that led to the problem.
  5. Developing solutions: Proposing and implementing solutions that address these root causes.
  6. Monitoring: Assessing the effectiveness of the solutions over time to ensure the problem does not recur.

Now, to integrate RCA within the broader organizational processes, you might also consider these actions or principles: 

  • Don’t isolated problem-solving, instead make RCA a standard practice across all levels and departments.
  • Encourage the formation of cross-functional teams, since perspectives can shed light on overlooked aspects of a problem.
  • Focus on improvements and solutions, not on the blaming game.
  • Prioritize root causes based on their impact and the feasibility of implementing solutions, which means dealing with the most critical aspects first.
  • Go deeper. The first conclusions may not always uncover the deepest root cause.
  • Find proper evidence. Assumptions, guesses, and opinions are not sufficient.
  • Invest in specialized software tools can streamline the RCA process. AI-driven RCA tools hold great promise. 
  • Establish mechanisms to monitor and review the outcomes of RCA efforts regularly.
  • Shift the focus from reactive problem-solving to prevention. How can your team anticipate potential issues and be prepared?

5 root cause analysis examples

The root analysis approach is versatile and systematic, which means it’s applicable to various fields such as ITSM, manufacturing, aviation, healthcare, and more. 

These are some examples of RCA adaptability and impact on our everyday lives.

1. Car manufacture defects

If a car manufacturer notices an unusual rate of returns due to a specific engine component failure a simple RCA could reveal that the component is failing due to a flaw in the manufacturing process where incorrect temperature settings during heat treatment weaken the metal. 

The root cause is identified as a misconfigured machine. All your team has to do is to reconfigure the machine settings and retrain staff on the correct procedures.

2. Ineffective sterilization process at a hospital

In case a hospital experiences a higher than average rate of post-surgical infections in one of its wards, a root analysis may uncover that the sterilization process for surgical instruments was compromised. 

The cause is traced back to a recently changed cleaning solution that was not effective against all types of bacteria. The most reasonable thing to do is for the hospital to return to the previously effective cleaning solution and probably do additional checks for sterilization effectiveness.

3. IT system outages

This is generic, but think of an IT company that faces frequent, unexplained outages of its customer service platform. Through RCA, it's discovered that the outages coincide with high traffic volumes that exceed the system's capacity. 

At this point it is obvious that what happened is due to inadequate scaling policies for cloud resources. So what has to be done is to invest in more robust infrastructure to handle peak loads.

4. Equipment failure in aviation

Let’s say an airline finds that a particular model of aircraft frequently requires unscheduled maintenance for landing gear issues. RCA identifies that the landing gear problem is due to premature wear of a hydraulic seal. 

Further investigation reveals that the issue stems from a recent switch to a cheaper hydraulic fluid that lacks certain lubricative properties. The best they can do is to switch back to the original hydraulic fluid and replace the affected seals.

5. A retail chain experiences a high employee turnover

Lastly, imagine that a retail chain is experiencing higher than industry average turnover rates among its store employees. An RCA conducted through exit interviews and employee surveys reveals that the primary cause of dissatisfaction is inflexible scheduling that doesn't consider employee availability or preferences. 

The company implements a new scheduling system that allows for greater employee input into their schedules.

Final thoughts

Our advice would be that you build a constructive RCA process focused on understanding how and why a problem occurred, rather than attributing blame to individuals or teams. If you take this principle into consideration, you’ll encourage a culture of openness and learning, where the goal is improvement rather than punishment.

Also, when it comes to problem-solving through RCA, causes and symptoms are equally necessary to track down for knowledge base purposes, but the aim is to go to the root of the problem. Identify them and allocate your team’s efforts accordingly. And so, be methodical and evidence-driven as we know from experience that identifying patterns makes all potential issues predictable events.

Lastly, for IT operations, where the complexity and volume of data are substantial, AI-driven RCA would be particularly beneficial to better forecast potential future disruptions. As this technology develops, we’ll see quick diagnosed issues in software and hardware, predicting failures, and suggesting corrective actions. 

With this in mind, we might be exploring self-optimization, freeing up IT teams to focus on strategic initiatives rather than being bogged down by routine problem-solving.

The key is to focus on investing, involving, and integrating!