Incident management is the process followed by the area of IT service management to respond to a service disruption, in order to restore it to normal as quickly as possible, minimizing the negative impact on the business.
An incident is a single unplanned event that generates a service disruption, whereas a problem is a cause or potential cause of one or more incidents, as defined by ITIL incident management guidelines.
Incident management differs from problem management, in that the first one revolves around addressing specific disruptive events, in real time, whereas problem management is focused on minimizing and preventing the root cause of those events.
As it can be seen, incident and problem management are related but are not the same. Incident management processes are mainly used by service desk teams, right from the moment the incident ticket is received. Service desks are the only point of contact for end users to report incidents.
Incident Management lifecycle
The incident management process refers to the company's guidelines or framework for identifying and responding to a service outage or other disruptive event. An incident management process starts when a user reports an issue and ends when a service desk team member solves that issue.
There are two major industry standard incident response frameworks, which are NIST and SANS. Having a good strategy will help to attain good problem management standards.
NIST stands for National Institute of Standards and Technology. Its incident response plan consists of four steps:
- Preparation
- Detection and Analysis
- Containment, Eradication, and Recovery
- Post-Incident Activity
SANS stands for SysAdmin, Audit, Network, and Security. It's a private organization, which focuses on security incidents. Its incident response process consists of six steps:
- Preparation
- Identification
- Containment
- Eradication
- Recovery
- Lessons Learned
Preparation
There are some similarities within these two incident management processes. In both cases the first step is preparation, which requires compiling all the assets and ranking them according to their level of importance. The second step implies creating an incident response plan for each event and similar incidents.
Then, it is necessary to create a communication strategy and identify who and how to contact depending on the situation. Incident responders should take this into account from the very first onset of the incident. This is also important for problem management.
Identification/Detection and Analysis
The second step in both standards require identification of the incident. Once this is done, it is important for the incident management team to evaluate what has caused the breach, so that the unexpected issue becomes a known error that can be prevented in the future. All this data will eventually help problem management as well.
Containment, Eradication, and Recovery
NIST groups these three steps in one, whereas SANS describes them separately. Containment seeks to stop the incident or breach as soon as possible to reduce the inconvenience that it may lead to. Eradication means the breach has occurred or the threat actor is within the system and it is important for the incident management team to remove it so that it does not expand to other areas. Recovery seeks to restore the system back to a prior level of performance before the disruption occurred.
Post-Incident Activity or Lessons Learned
This final step, although named differently, is shared among the NIST and SANS approach.
It refers to the moment the organization analyzes the situation in order to learn from experience. The aim is to understand how to better respond to future security incidents, or any type of incident, and record the improvements that need to be made in a document that will serve as a guideline in the future, both for incident handling and problem management. Known errors can prevent future incidents.
The five most common incident management issues
- Plans are not customized to the organization
Sometimes organizations put into practice standard incident resolution plans that are not tailored to their context or needs. Many ready-made plans are just ineffective or not well adapted to the company.
Recommendation: it is necessary to determine processes and strategies adapted to the type of business, objectives, environment and culture. It is important to either create from scratch or adapt a standard taking into account all the variables mentioned. An incident manager should then be able to put all this into practice.
- Lack of prioritization
Lack of prioritization increases the risk of missing critical incidents. Resources are limited, therefore it is important to prioritize, and differentiate critical from non critical incidents. This should also be taken into account when setting out a problem management strategy.
Recommendation: organizations should establish a clear prioritization scheme, so that the teams know what should be addressed first. It is also recommended that an incident manager help automate responses as much as possible.
- Poor communication strategy and ways to collaborate
It is crucial to know what should be communicated and to whom in order to respond to an incident. Some organizations resort to mails or spreadsheets and the information is sent many times causing an overflow of messages which is not effective and does not foster collaboration.
Recommendation: a clear communication strategy should be laid out to attain effective incident management. The first step is that service desk teams publish relevant data in a shared portal. It might be useful to resort to a centralized panel where all the latest details about the incident are clearly expressed. In this way, all the key actors will be able to get all the necessary information at once, without delay. This will lead to a better collaboration strategy and team work so that the response time is reduced.
- The response tools are inadequate
Some organizations have inadequate or outdated tools to solve incidents. At times, even when the tools are updated, they might not be properly used by the service desk teams and the rest of the personnel either because they lack training or because they are not suited to the business.
Recommendation: members of the company that deal with incident management should receive proper training to be able to use all the necessary tools to perform their duties. It is important to regularly evaluate the tools to see if they need to be updated or if they are suited to respond to the threats the organization is exposed to.
Additionally, it's important to have suitable response tools such as InvGate Service Management.
- The incident response team doesn't have authority
Incident response teams must escalate issues to different areas of management to get the support they need. They need partners, executives and other upper layers of management to be informed of the issues and solutions being developed and then make sure this information leads to their support. This might change management in a positive way.
Recommendation: it is important to lay out an automated communication channel with management so that they are well informed and ensure their support to the incident response team through the whole process. They might also need to contact other areas to facilitate their work during the response process.
Companies use top-of-the-line ITSM solutions such as InvGate Service Management to effectively manage, communicate, and prevent incidents. The software is able to identify incident types as problems and tackle potential issues early and before they become problems.
Frequently asked questions about incident management
What is an incident?
An incident is a single unplanned event that causes a service disruption.
What is the difference between an incident and a problem?
A problem is the root that has caused one or many incidents. Problem management processes therefore, try to find and address that cause in order to prevent incidents from happening again in the future. Incident and problem management are closely related but are not the same.
Why is it important to have a clear incident management system?
It seeks to restore the service affected to normal as soon as possible so as to reduce the negative impact on the business operation.
How are impact and urgency measured?
Impact is based on how the service provided is affected, whereas urgency measures the time for an incident to have a significant impact on the business operation.