Problem management is the IT service management (ITSM) process, or capability, that manages the removal recurring issues by ensuring that a long-term resolution is found. Done well, it will improve service levels and reduce costs because problem management facilitates the required root-cause analysis, identifies solutions (both interim and permanent), and works with other teams to ensure these solutions are delivered safely.
Having a formal problem management process, or capability, is a must if you want your IT organization to consistently deliver high levels of both availability and performance (and thus a minimum of IT issues). Our blog provides you some practical tips on getting started with problem management that will take you beyond our previous ITSM 101 blogs.
Conduct Major Incident Reviews
Following up on major incidents is a great way to get your problem management process off the ground.
Putting a structured approach to dealing with the aftermath of major incidents in place is a quick win for most IT organizations, because it will have visible benefits in identifying both the root cause(s) and preventative actions for the future (as well as understanding how well major incident management activities were handled). And it will most likely be related to an area, or issue, that has caused (and might continue to cause) real business issues. The business will care that you have taken steps to prevent a repeat occurrence.
Things to capture when reviewing a major incident include:
- Description of the major incident
- Service(s) affected
- Business units and operations impacted
- Resolving team
- Number of related incidents
- Related change details if applicable
- Fix effort
- Final fix
- Root cause
- Actions to prevent recurrence
Then packaging the above information into a short document will give senior management a valuable overview of what went wrong, how it was fixed, and the steps needed to prevent a repeat occurrence.
Plus, rather than having to desperately scramble for information when asked “What happened?”, your management team will have a cohesive overview – ideally written in business language – that instantly raises the games of an IT support offering in terms of responsiveness and professionalism.
Continuing on the theme of quick wins and raising the perceptions of IT – use problem management to find temporary solutions to common issues. These are termed: “workarounds.”
Why are these workarounds sometimes needed?
Not every problem – and underlying issue – is going to be easy to diagnose and fix; and not all problems can be resolved permanently. Or sometimes an issue is too expensive to remediate, or the fix could cause issues with another aspect of the service. Or in some cases there aren’t enough resources in place to investigate complex issues.
If this is the case, rather than getting disheartened, look at the problem statement and investigate if the issue can be circumvented by a temporary solution. It’s something that won’t fix everything forever, but it will get the service(s) and users back up and running again.
Examples of workarounds could include maintenance reboots for older network devices, adding additional capacity to file servers, or looking into virtualization or switching routine batch jobs to run overnight to reduce the load on the network.
Aim to Get on TOP of Things (yes, the capitalization of TOP is deliberate)
We’ve all seen problems that happen again and again. From the network being slow on Monday mornings, to business-affecting application-performance issues, to unexpected downtime. Problem management should be the capability tasked with fixing these repeat offenders.
One approach that can yield great results here is called the Technical Observation Post or TOP.
A TOP is a team of individuals who are pulled together to look at an incident/problem in detail; each being from a different area. If you have a problem that you’re stuck on, consider assembling a TOP team such that you have people looking at it from multiple angles.
For example, networks to look at how traffic is routed, applications from a code perspective, and someone from the data center to look at hosting and storage. This approach will enable the problem to be investigated from multiple perspectives, maximizing efficiency and the understanding the root cause so that it can be fixed or circumvented rather than the IT department being in firefighting mode all the time.
Proactive problem management is the side of the capability that attempts to prevent incidents from happening by identifying weaknesses in the IT infrastructure and operations.
Proactive problem management analyzes incident records, and uses data collected by other ITSM processes to identify trends or significant issues. This can be done by:
Working with the service desk – ask them what they see coming through the incident management process day-in and day-out.
Conducting trend analysis – reviewing previous incidents and looking for common or recurring themes by service and business unit.
Raising changes to prevent incidents from recurring – or, better still, occurring in the first place. Examples of proactive changes could include controlled system restarts, monthly security patching, capacity alerts, and network monitoring.
Working with support teams and service delivery managers – ask them what hasn’t failed yet but is at risk. For example, aged equipment or defects in code.
Talking to your customers – and get an understanding of business-critical times so you can plan accordingly. Be it by engaging with availability management to ensure the right resilience is in place, capacity management to ensure that the right levels of performance can be maintained, or change management to limit or restrict change volumes.
Deliver Monthly Updates on Problem Management Success
Make sure your problem management efforts, and successes, are both interactive and visible.
Have a monthly meeting and invite a representative from each area so that you can get a status and progress update for each logged problem. All too often the perception can be that things have fallen into a black hole once the service level agreement (SLA) clock has been paused (because an incident has been related to a problem record). So, a regular update of each problem and what’s been done to address it recently will go a long way to alleviate these concerns.
Also use this meeting to promote the process and to keep it moving forwards. And shout about any quick wins and make sure you publish a list of resolved or mitigated problems to business stakeholders that have or will benefit from the activities. The more that people see being resolved, the more confidence they’ll have in the process.
So, that’s five ‘problem management 102 tips’ from us. What are your top tips for kickstarting problem management? Please let us know in the comments.