What is the difference between incident and major incident?

A standard incident affects a limited number of users or services and follows normal incident procedures. A major incident affects critical services or large user groups, requiring a dedicated workflow, escalation, and cross-team coordination.

Who owns major incident communications?

The major incident manager or assigned response lead owns communications. They coordinate updates, approve announcements, and ensure consistent messaging to users and stakeholders.

What should a post-incident review include?

A post-incident review should cover the root cause, detection and escalation timelines, actions taken, communication effectiveness, lessons learned, and improvements for handling future major incidents.

Major Incident Management: Process, Roles, And a Practical Runbook

Q: What is major Incident Management?

Major Incident Management is the process of handling high-impact IT incidents that disrupt normal operations and require immediate, coordinated action to restore services.

Major Incident Management (MIM) focuses on incidents that have a business-wide impact and demand an immediate, coordinated response. These incidents put critical services at risk, create urgent pressure to restore operations, and can quickly affect revenue, compliance, or reputation if they drag on.

Not every high-priority ticket qualifies as a major incident. A high-priority incident might be urgent for one team or user. A major incident goes further: it disrupts core services, affects many users or customers, escalates quickly, and requires leadership visibility and cross-team coordination to regain control.

In this article, we’ll explain how Major Incident Management works in practice, when to trigger it, and how to use ITSM tools to help teams respond faster when the impact is too big to handle as business as usual.

How to Define Incident Severity Levels For Your Service Desk

What is Major Incident Management (MIM)?

Major Incident Management (MIM) is the process organizations use to respond to, coordinate, and resolve incidents that cause significant business disruption. These incidents typically affect critical services, impact a large number of users, involve high financial or operational risk, or require immediate executive attention.

Unlike standard Incident Management, which focuses on restoring normal service for everyday issues, Major Incident Management introduces a dedicated, structured response process designed to restore critical services as quickly as possible. It usually involves rapid escalation, cross-functional collaboration, continuous stakeholder communication, and centralized coordination through a Major Incident Manager.

In the context of the ITIL framework, a major incident is not a separate practice but a specialized procedure within Incident Management. ITIL recommends establishing predefined criteria to identify major incidents, along with specific workflows, communication plans, escalation paths, and governance mechanisms that allow organizations to respond faster when business-critical services are at risk.

Major incident vs. critical incident

Although the terms are often used interchangeably, they do not always mean the same thing.

Aspect	Major incident	Critical incident
Primary focus	Business impact and service disruption	Severity, urgency, or potential consequences
Typical trigger	Affects a critical service, many users, or core business operations	A serious event that requires immediate attention
Scope	Usually involves multiple teams and coordinated response efforts	May affect a single system, user, or service

When to trigger Major Incident Management: Classification criteria

A major incident is defined by impact and urgency, not just priority. The moment an incident threatens core business operations, it stops being handled as routine work and requires a different level of response.

Use the following checklist to decide when to move from regular Incident Management and trigger Major Incident Management.

An incident is considered major when one or more of these conditions apply:

Wide impact. The issue affects a large number of users, customers, or locations at the same time, rather than a single team or individual.
Critical services involved. Core systems such as email, authentication, ERP, customer-facing platforms, or payment services are unavailable or severely degraded.
High urgency to restore service. Delays quickly escalate business risk. Workarounds are limited or nonexistent, and normal response times are not acceptable.
Business or financial risk. The incident blocks revenue-generating activities, interrupts operations, or exposes the organization to contractual or regulatory issues.
Reputational impact. Customers, partners, or the public are aware of the disruption, or the issue is likely to reach them if not resolved fast.
Cross-team dependency. Resolution requires coordination across multiple teams, vendors, or support tiers, often under time pressure.

If several of these criteria are met, the incident should be treated as major, even if the root cause is still unclear.

Typical examples of major incidents

Large-scale user impact
Example: More than 300 employees lose access to the ERP system during business hours, preventing order processing, inventory updates, or financial operations.
Critical service outage
Example: A company-wide email, identity, or authentication service outage prevents employees from accessing business applications and performing daily work.
Customer-facing service disruption
Example: An e-commerce website becomes unavailable during a sales campaign, preventing customers from placing orders and generating immediate revenue loss.
Multi-site or regional impact
Example: A network failure disconnects several offices or distribution centers, affecting hundreds of users across multiple locations.
Security event affecting operations
Example: A ransomware attack forces critical servers offline, disrupting business services while containment and recovery activities take place.
Failure of a business-critical deployment or change
Example: A software release introduces defects that prevent customers from logging in or completing transactions, requiring an emergency rollback.
High financial, regulatory, or reputational risk
Example: A payment processing outage prevents transactions for several hours, exposing the organization to revenue loss, SLA penalties, or customer complaints.

The key signal is simple: when the impact spreads beyond a single team and time becomes a business risk, you are no longer dealing with a standard high-priority incident.

Roles and responsibilities during a major incident

A successful Major Incident Management process depends on clearly defined roles. Everyone involved needs to know what’s expected from them — especially when time is critical and the pressure is on.

Here are the main roles and responsibilities typically involved in IT Major Incident Management:

Major incident manager – Leads the response effort, coordinates teams, and acts as the central point of contact.
IT support teams – Work on diagnosing and resolving the issue, based on their area of expertise (infrastructure, networking, applications, etc.).
Service desk – Logs the incident, communicates with end users, and escalates as needed.
Communications lead – Ensures consistent, timely updates to all stakeholders, including business leaders, customers, and internal teams.
Change manager (when applicable) – Coordinates any emergency changes that need to be deployed to resolve the issue.
Business stakeholders – Provide business context, assess impact, and help prioritize efforts if there are competing risks.

Major Incident Management process: steps

A solid Major Incident Management process needs to be fast, structured, and clear. In high-pressure situations, improvising is not an option — everyone needs to know exactly what to do and when. Here are the five essential steps.

Step 1: Detect and classify the incident

Detection usually comes from monitoring tools, alerts, or user reports. Classification is the real decision point.

At this stage, teams evaluate:

Services and business processes affected.
Number of users or customers impacted.
Urgency and business exposure.
Breach risk against SLA or regulatory obligations.

Priority in ITIL comes from impact and urgency together. A high-priority incident that meets the agreed threshold triggers the major incident procedure. The objective at this stage is the classification decision, with root cause investigation coming later.

Early communication is part of this first step. When users lack information, they open duplicate tickets, escalate through informal channels, or try risky workarounds, all of which slow down recovery. Even partial updates help set expectations by confirming that an incident is in progress, clarifying which services are affected or under investigation, and stating that teams are actively working on containment or restoration.

Step 2: Coordinate

Once classified, the incident must be escalated to the right teams — including technical experts, business stakeholders, and the service desk. According to the ITIL framework, this step should follow a predefined escalation path.

The core moves here:

Assign a major incident manager.
Open a war room — a dedicated bridge, physical or virtual, where responders and stakeholders coordinate in real time.
Bring technical teams and business stakeholders onto the bridge.
Establish clear decision authority.

The war room concentrates diagnosis, decisions, and communication in one place. It supports swarming, where specialists work the incident together until ownership settles with the team best positioned to resolve it. The major incident manager runs the room: coordinating work, controlling the flow of updates, and keeping teams aligned on shared priorities. They own coordination, and technical troubleshooting stays with the specialist teams.

Step 3: Respond and contain the impact

The goal here is to stabilize the situation and limit further damage, not fixing the underlying issue.

Typical containment actions include:

Isolating affected systems or components.
Disabling failing integrations or features.
Rolling back recent changes.
Switching to backups or failover environments.

These actions may be temporary. Their purpose is to stabilize services and prevent the incident from escalating while the investigation continues.

Clear updates during this phase help reduce tension and keep teams aligned on the immediate goal: stopping further impact.

Step 4: Resolve and recover

With the incident contained, teams can work toward a permanent resolution.

This phase usually involves:

Identifying and fixing the root cause.
Restoring services to normal operation.
Validating performance, access, and dependencies.
Confirming recovery with affected stakeholders.

Documentation happens here too, capturing timelines, actions, and decisions while details are fresh. Where the root cause stays unresolved or points to a deeper fault, the incident feeds a problem record for follow-up under Problem Management.

Step 5: Review and improve

Once everything is up and running, teams conduct a post-incident review. The goal is to analyze what went wrong, what went right, and what can be done better next time.

Keep it blameless. A blameless review focuses on facts, root causes, and improvement opportunities, which is what surfaces honest detail about how the incident unfolded. Feed the findings back into major incident roles and responsibilities, playbooks, escalation paths, and communication protocols.

Communication templates for Incident Management

Clear, consistent communication reduces uncertainty and keeps users aligned with the response effort. During a major incident, communication is also part of containment. Timely updates help prevent duplicate tickets, reduce calls and messages to the service desk, limit speculation, and set realistic expectations about recovery timelines.

When users know the issue has been identified and is actively being addressed, technical teams can focus on diagnosis and restoration rather than repeatedly answering the same questions. These templates are meant to be brief, factual, and easy to adapt throughout the incident lifecycle.

Initial update: Use this message as soon as the incident is classified as major.

We’re currently investigating an incident affecting [service/system].
Some users may experience [brief impact].
Our teams are actively working to contain the issue.
We’ll share another update by [time] or sooner if there’s a change.

Ongoing update: Use this while the incident is still active and under investigation.

The incident affecting [service/system] is still in progress.
Impact remains limited to [users/areas], and no additional services are affected at this time.
Teams continue working on containment and recovery.
The next update will be shared by [time].

Resolution notice: Send this once services are fully restored and validated.

The incident affecting [service/system] has been resolved.
Services were restored at [time], and normal operation has resumed.
We’re reviewing the incident to identify follow-up actions and prevent recurrence.
Thanks for your patience.

How to manage major incidents with InvGate Service Management

InvGate Service Management supports Major Incident Management by giving teams structure without slowing them down. The idea is to guide response through workflow automation, maintain visibility while the incident is active, and capture everything needed for review afterward with analytics and reporting.

Here’s how it helps your team stay in control when it matters most:

1- Creating a major incident in InvGate Service Management

Major incidents rarely arrive labeled as such. A request comes in, an agent works it, and the true scope becomes clear once the impact spreads across users and services. InvGate Service Management handles this transition with Promote to Major Incident. An agent turns an existing request into a major incident directly, and the original request becomes the major incident record, with its history, timeline, and prior interactions preserved, so the team keeps the context it has already gathered.The same action is available through the API for teams that trigger promotion from monitoring or automation.

Major incidents act as a coordination layer for mass events. Once created, the platform checks for similar active major incidents and offers to link to an existing one, which keeps duplicate major incident records from accumulating. And you can also manually link multiple incident requests to the major incident.

When the major incident is resolved, the solution can be propagated automatically to all related incidents, applying the same resolution comment and moving them to customer confirmation.

2- AI features for Major Incident Management

InvGate Service Management applies artificial intelligence to help teams detect major incidents earlier and communicate more effectively during critical events.

AI-powered major incident detection

Major incidents often emerge from multiple related reports. AI continuously analyzes incoming incidents to identify patterns that suggest a broader issue.

When a potential major incident is detected:

Help desk managers receive a system notification and email
The suggested major incident includes AI-provided reasoning
Managers can create the major incident with prefilled data and linked requests

To enable this functionality, go to Settings → AI Hub → Proactive detection and activate Major Incident detection.

deteccion-incidentes-mayores-funcion-ia-invgate-service-management

Predictive risk and impact analysis

AI also supports classification by suggesting risk and impact levels based on historical data from similar cases. This helps teams assess business exposure faster and apply consistent criteria during escalation.

AI-generated announcement suggestions

Communication is another common failure point during major incidents. InvGate Service Management addresses this with automatic announcement suggestions.

When a major incident is created or updated, the system suggests and drafts an announcement. Agents and administrators can review, edit, and publish them immediately, to keep users informed and prevent a flood of duplicate tickets.

To enable this feature, go to Settings > AI Hub > Agent assistance and activate Suggestions for announcements associated with major incidents.

3- Post-incident review and continuous improvement

After resolving a major incident, the focus shifts to learning and preventing future disruptions. InvGate Service Management provides tools to make post-incident activities structured and actionable.

Analytics and reporting: Use built-in dashboards and reports to analyze timelines, escalation patterns, affected services, and team performance. These insights help identify bottlenecks and measure response effectiveness.
Problem Management: Link the major incident to problem records to investigate root causes, track recurring issues, and implement long-term fixes. This ensures that the same disruption doesn’t repeat.
Document post-incident learnings: Capture key decisions, communication effectiveness, and lessons learned in a structured format. Store this documentation for audits, future reference, and continuous process improvement.

Major incidents are easier to manage when your platform centralizes detection, escalation, and communication in one place. Start a free trial of InvGate Service Management today and see how your team can respond faster, reduce noise, and keep users informed during critical disruptions.