Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Incident Management Handbook JSM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Incident Management

Handbook for Jira Service


Management
1 Introduction

3 Chapter 1: Incident management overview


4 Overview
4 Who is this guide for?
5 What is an incident?
6 Our incident values
7 The importance of incident management
8 Types of incident management processes

20 Chapter 2: Incident response


21 Overview
22 Detect the incident
26 Set up team communication channels
28 Assess the impact
34 Escalate to the right responders
35 Delegate incident response roles
36 Send follow-up communications
37 Iterate incident analysis and recovery
38 Resolve the incident

39 Chapter 3: Post-incident reviews


40 The importance of a post-incident review
43 Best practices for a post-incident review
48 An effective post-incident review plan

52 Chapter 4: Incident management analytics


53 How to choose incident management key performance indicators
and metrics
54 The value of incident key performance indicators, metrics,
and analytics
55 Useful incident key performance indicators and metrics

61 Chapter 5: Good practices for modern incident management


62 The challenges facing modern IT incident management
63 Optimizing incident management practices across teams
68 Conclusion
Introduction
How incident management is described often varies by
organization; however, there is consensus that effective
incident management is vital to maintaining business
functions. And while incident management isn’t new –
systems have been breaking, and organizations have
rushed to fix them for a long time – in 2023, customers
expect proactive service, personalized interactions, and
connected experiences. Companies must release features
and provide new services quickly to remain competitive in
an accelerating marketplace, but balancing development
velocity with reliability and performance can be difficult.
This reality forces organizations to improve their incident
management practices to keep pace and provide
exceptional service experiences.

And the impact of poor incident management can’t be


overstated. Surveys have highlighted the heavy costs of
downtime to small and medium businesses as well as
enterprises. Teams must resolve outages, performance
degradations, application errors, and other types of
incidents quickly to keep customers engaged and avoid
customer churn.
Whether you are implementing an established
methodology, for example, ITIL 4, ITSM, DevOps, SRE, etc.,
or creating your own, you need to outline the process for
incident management execution, and all team members
involved in that process must understand and support it.

Establishing an incident management practice that team


members and customer trust and rely on can be difficult. It
requires unifying diverse teams, different objectives, and
loads of data. An effective incident management practice
helps teams to communicate efficiently, shared data quickly,
then learn and improve.

This handbook describes Atlassian’s approach to incident


management principles and practices. Atlassian’s approach
to incident management is grounded in ITIL 4 principles; it
also includes site reliability engineering (SRE), DevOps, and
IT security response practices. Along the way, we’ll show
how to apply these lessons using Jira Service Management.
01
Incident management overview
Overview
Teams running technology services today are expected to maintain 24/7
availability.

When something goes wrong, whether it’s an outage or a broken feature,


team members need to respond immediately and restore service. This process
is called incident management, and it’s an ongoing, complex challenge for
companies, big and small.

We want to help teams everywhere improve their incident management.


Inspired by teams, we’ve created this handbook to summarize Atlassian’s
incident management process. We’ve learned these lessons in responding to
incidents for over two decades. While it’s based on our unique situation and
experiences, we hope it can be adapted to suit the needs of your team. We’ve
also incorporated some other popular options for specific practices into the
handbook.

Who is this guide for?


If you’re on a development or operations team that looks after services for
customers who require 24/7 availability, this handbook is for you. Incident
management is the process used by development and IT Operations teams to
respond to an unplanned event or service interruption and restore the service
to its operational state.

If you are a manager of a company’s legal, public relations, or human resources


team, this handbook is also for you. Why? Because incident management
also includes an organization’s broader strategic handling of an incident. It
requires the coordinated oversight of the leadership group, which usually
consists of representatives from teams such as the executive board, IT, legal,
communications, and HR.

INCIDENT MANAGEMENT OVERVIEW 4


What is an incident?
Incident management processes vary from company to company, but the key
to success for any team is clearly defining and communicating severity levels,
priorities, roles, and processes up front – before a major incident arises. At
Atlassian, we define an incident as an event that causes disruption to or a
reduction in the quality of a service that requires an emergency response. Some
teams who follow ITIL or ITSM practices may be more familiar with the ITIL 4
definition of an incident as an unplanned interruption to a service, or reduction
in the quality of a service.

An incident is resolved when th e affected service resumes functioning


normally. The incident response includes only those tasks required to restore
full functionality and excludes follow-on tasks such as root cause identification
and mitigation, which are part of the post-incident review (PIR).

The post-incident review (often called incident postmortem) is performed


after the incident to determine the root cause and assign actions to prevent
repeat incidents.

INCIDENT MANAGEMENT OVERVIEW 5


Our incident values
A process for managing incidents can’t cover all possible situations, so we
empower our teams with general guidance in the form of values. Similar to
Atlassian’s company values, our incident values are designed to:

· Guide autonomous decision-making by people and teams in incidents and


post-incident reviews.

· Build a consistent culture between teams of how we identify, manage,


and learn from incidents.

· Align teams as to what attitude they should bring to each part of incident
identification, resolution, and reflection.

Atlassian’s incident management values

Detect Respond Recover Learn Improve


Atlassian Escalate, Shit happens, Always Never have the
knows before escalate, clean it up blameless same incident
customers do escalate quickly twice

INCIDENT MANAGEMENT OVERVIEW 6


The importance of incident management
Incident management is one of the most critical processes an organization
needs to get right. Service outages can be costly, and teams need an efficient
way to quickly respond to and resolve these issues. Teams need a reliable
method to prioritize incidents, get to resolution faster, and offer better service
for users.

When teams face an incident, they need a plan that helps them:

· Respond effectively so they can recover fast.

· Communicate clearly to customers, stakeholders, service owners, and


others in the organization.

· Collaborate effectively to solve the issue faster as a team and remove


barriers preventing them from resolving it.

· Continuously improve to learn from these outages and apply lessons to


refine their process for the future.

INCIDENT MANAGEMENT OVERVIEW 7


Types of incident management processes
No single incident management process is best for all companies, so you’ll
likely see various approaches across different companies. Many teams rely
on a more traditional IT-style incident management process, such as those
outlined in ITIL certifications. Other teams lean toward a more Site Reliability
Engineer- (SRE) or DevOps-style incident management process.

IT-style incident management process


An incident management process helps IT teams investigate, record, and
resolve service interruptions or outages. The ITIL incident management
workflow aims to reduce downtime and minimize the impact on employee
productivity from incidents. Using templates designed to manage incidents, you
can create a repeatable incident management workflow, which ensures teams
log, diagnose, and resolve incidents – and have a record of their activities.

The ITIL framework is chiefly used by IT teams running services inside


businesses. Typically teams take what they need from ITIL – which covers
almost every type of incident and issue, and process IT teams might face – and
leave the rest. ITIL is great when teams focus on cultivating a culture of active
troubleshooting. The prescribed processes help teams track incidents and
actions in a consistent manner, which improves reporting and analysis and can
lead to a healthier service and a more successful team.

INCIDENT MANAGEMENT OVERVIEW 8


IT Infrastructure Library (ITIL) is a set
of practices that focus on aligning IT
services with business needs. ITIL is
the most widely accepted approach to
IT service management and can help IT
organizations realize business change,
transformation, and growth.
IT service management (ITSM) is how
IT teams manage the end-to-end
delivery of IT services to customers.
This methodology includes all the
processes and activities to design,
create, deliver, and support
IT services.

Steps in the IT incident management process


1 IDENTIFY AN INCIDENT AND LOG IT

An incident can come from anywhere: an employee, a customer, a vendor


monitoring system. No matter the source, the first two steps are simple:
someone identifies an incident, then someone logs it. These incident logs
(i.e., tickets) typically include:

· The name of the person reporting the incident

· The date and time the incident is reported

· A description of the incident (what is down or not working properly)

· A unique identification number assigned to the incident, for tracking

2 CATEGORIZE

Assign a logical, intuitive category (and subcategory, as needed) to every


incident. This classification helps you analyze your data for trends and
patterns, which are critical to effective problem management and preventing
future incidents.

INCIDENT MANAGEMENT OVERVIEW 9


3 PRIORITIZE

Every incident must be prioritized. Start by assessing its impact on the


business, the number of people who will be affected, any applicable SLAs,
as well as the potential financial, security, and compliance implications of
the incident. Compare this incident to all other open incidents to determine
its relative priority. As a best practice, define your severity and priority levels
before an incident happens, making it simpler for incident managers to gauge
priority quickly.

RESPOND

· Initial diagnosis: Ideally, your front-line support team can see an incident
from diagnosis through close, but if they can’t, the next step is to log all
the pertinent information and escalate to the next-tier team.

· Escalate: The next team takes the logged data and continues with the
diagnosis process, and if this next team can’t diagnose the incident, it
escalates to the next team.

· Communicate: The team regularly shares updates with affected internal


and external stakeholders.

· Investigation and diagnosis: This continues until the nature of the


incident is identified. Sometimes teams bring in outside resources or
other department members to consult and assist with the resolution.

· Resolution and recovery: In this step, the team arrives at a diagnosis


and performs the necessary steps to resolve the incident. Recovery
simply implies the amount of time it may take for operations to be fully
restored since some fixes (like bug patches, etc.) may require testing and
deployment even after the proper resolution has been identified.

· Closure: If the incident was escalated, it is finally passed back to the


service desk to be closed. To maintain quality and ensure a smooth
process, only service desk employees are allowed to close incidents. The
incident owner should check with the person who reported the incident to
confirm that the resolution is satisfactory and the incident can be closed.

INCIDENT MANAGEMENT OVERVIEW 10


DevOps- and SRE-style incident management process

DevOps and SRE incident management process


With a DevOps or SRE approach to incident management, the team that builds the
service also runs it – and fixes it if it breaks. This approach has exploded in popularity
alongside the growth of always-on cloud services, globally-accessed web applications,
microservices, and software as a service (SaaS).

Increasingly the software you rely on for life and work is not being hosted on a server
in the same physical location as you. It’s likely a web-accessed application deployed
in a data center for thousands or millions of users around the globe. For teams tasked
with running these services, agility, and speed are paramount. Any downtime has the
potential to affect thousands of organizations, not just one.

An advantage of the “you build it, you run it” approach is that it offers the flexibility
agile teams need, but it can also obscure who is responsible for what and when.
DevOps teams can be comfortable – and successful – with less structured development
processes. But it’s best to standardize on a core set of procedures for incident
management so there is no question about how to respond in the heat of an incident,
and you can track issues and report how they’re resolved.

DevOps is a set of practices, tools, and a cultural philosophy that


automate and integrate the processes between software development
and IT teams. It emphasizes team empowerment, cross-team
communication and collaboration, and technology automation.
Site reliability engineering (SRE) is the practice of applying software
engineering principles to operations and infrastructure processes to
help organizations create highly reliable and scalable software systems.
To quote Ben Sloss at Google, “SRE is what happens when you ask a
software engineer to design an operations team.”
DevOps teams focus on core development; SREs implement the core
code. Their common goal is a better result for complex distributed
systems, and both focus on people working together as a team with
shared responsibilities.

INCIDENT MANAGEMENT OVERVIEW 11


Three beliefs of DevOps incident management teams
· Take turns being on-call: Rather than certain team members specializing
in being on-call, DevOps teams typically rotate through an on-call
schedule where all members share the burden of possibly being woken at
night to respond to an incident.

· The engineer who built it is the best person to fix it: The central idea of
the “you build it, you run it” ethos is that the people most familiar with
the service (the builders) are the best equipped to fix an outage.

· Build with speed, but practice accountability: When engineers know that
they and their teammates are on the hook during outages, there’s added
incentive to ensure you’re deploying quality code.

This approach assures fast response times and quick feedback to the teams
needing to build a reliable service.

Co-existing methodologies
The best-performing incident teams use a collection of the right tools, practices,
and people. Each of these frameworks and best practices add value to your
organization’s incident management process and can accelerate and improve
your teams’ incident response. These methodologies can coexist together to
align teams, meet customer demands, and improve the value delivered.

No matter which methodology (or a combination thereof) you choose, you’ll


have the most success if you focus on:

· A culture with a common vision and purpose

· Promoting cooperation and shared responsibility

· Making decisions and making them visible

· Defining customer-centric metrics and continuing to improve the value


that you deliver

Digital transformation is not achieved instantly across an organization, so


companies should start with best practices and methodologies that are suited
to rapidly changing their needs by starting small – then learn, build expertise,
and scale up.

INCIDENT MANAGEMENT OVERVIEW 12


Incident management tools
There is no single, one-size-fits-all tool for incident management.

Some tools are specific to incident management, others are more general-
purpose tools your team also uses for other tasks. And some tools might
be an in-house solution built upon layers
of integrations and customization. CIDENT
IN
THE
No matter the use case, good incident
RE Alerting
management tools have a few things

FO
Service and
on-call

BE
in common. The best incident desk
management tools are open, Service
reliable, and adaptable. Monitoring configuration
data

DURING THE
Open: In a high-pressure Team
communication
environment like an incident, it’s Issue
essential that the right people tracking Customer
communication
have access to the right tools and NT

IN
information immediately. This not only Postmortem Incident

C
E
and analysis command

ID
D
goes for incident responders but for
CI
center

E
NT
N
company stakeholders who need visibility ERI
AFT
into response efforts.

Reliable: There are few things worse during incident


response than also having your key response tools go down. Utilizing cloud
tools, like Jira Service Management (JSM), Jira Software (JSW), Slack, minimizes
the risk of an outage on your infrastructure taking down your response tools.

Adaptable: Things like integrations, workflows, add-ons, customization,


and APIs all open up the possibilities behind the product. You may want to
get started with an out-of-the-box configuration, but as your practices and
processes mature, you’ll want your tools to be flexible enough to support
changing needs.

INCIDENT MANAGEMENT OVERVIEW 13


Before the incident

Monitoring
Monitoring tools let DevOps and IT Ops teams collect, aggregate, and trigger alerts off
data from thousands of different services in real-time. These are critical to providing
complete visibility into the health of your services and often trigger the first alarm
bells during an incident.

BENEFITS

Monitoring tools give your team constant insight into the health of the infrastructure.
Modern monitoring tools also proactively trigger alerts during unexpected activity.

Feature set Questions to ask

· 24/7 coverage and Does the tool have visibility into all my servers and
analytics infrastructures?
· Integrates with alerting Can my team see real time analytics and dashboards and
tools set alerting thresholds?
Integrates with alerting tools Does the product integrate
with my alerting and on-call tool?

Service desk
Service desk software gives customers and employees a place to report incidents and
potential incidents.

BENEFITS

Along with their many other use cases, (service requests, IT help desk) service desks
empower your team to quickly learn about incidents from the people who matter
most: your users and customers.

Feature set Questions to ask

· Enable self serve Can customers quickly file tickets through a service portal?
Can customers find the help they need with automated
knowledge based suggestions?

INCIDENT MANAGEMENT OVERVIEW 14


Alerting and on-call
Prompt and reliable alerting is a critical step in incident response. This procedure is
how teams ensure the right people are aware of an incident.

BENEFITS

Alerting tools notify designated on-call responders through a sophisticated


combination of scheduling, escalation paths, and notifications.

Feature set Questions to ask

· Works globally Can I send notifications (SMS, voice, email) to almost


· Multiple notification anywhere in the world?
methods Can I send notifications using multiple notification
methods like email, SMS, phone, mobile app push, and try
them multiple times?

INCIDENT MANAGEMENT OVERVIEW 15


During the incident
Asset and service configuration information
Understanding the interdependencies within your infrastructure is key to determining
the full impact of the incident and reaching a faster resolution.

BENEFITS

A configuration management database (CMDB) helps you understand the


relationships and dependencies within your IT infrastructure. If something goes down,
this map lets you rapidly find:

· Potential causes of the incident. For example, determining which host a


service is running on at the click of a button.

· Trickle-down effects of the incident. For example, discovering other


services that are running on the same, troublesome host.

This means you can quickly investigate and communicate all aspects of the incident.

Feature set Questions to ask

· Multiple channels How flexible is the CMDB? Can I store any CI or asset?
· Integrations Can I visualize my infrastructure graphically?
Can I link CIs/assets with my service desk issues?
Can I link CIs/assets to change requests?

INCIDENT MANAGEMENT OVERVIEW 16


Team communication
Clear and reliable communication is undeniably critical during incident management.

BENEFITS

A solid communication platform helps teams communicate, share observations, links,


and screenshots in a way that’s timestamped and preserved. This brings the right
information and people together during an incident, and creates a rich record to learn
from after the incident.

Feature set Questions to ask

· Multiple channels Can my incident response team quickly spin up a


· Integrations dedicated channel for an incident?
Can other tools in my incident toolchain post into my
team’s communication channel?

Customer communication
Customer communication tools help keep customers informed during an incident.

BENEFITS

There’s no getting around it, incidents are typically a bad experience for your
customers. Keeping customers informed builds trust and speeds up response efforts.
Communicating with customers lets them know you’re aware of the incident and
working on a fix.

Feature set Questions to ask

· Off of my infrastructure Will my communication tool be operational and accessible


· Subscribers and even if my internal infrastructure is down?
notifications Can customers opt in to get notifications when I post
about an incident?

INCIDENT MANAGEMENT OVERVIEW 17


Incident command center
An incident command center is where your established record of the incident and its
key details live.

BENEFITS

After an incident is resolved, teams still often don’t know the root causes and are at
risk of the same incident happening again. Post-incident reviews help to prevent that
by bringing the team together for an after-action analysis.

Feature set Questions to ask

· Source of truth Can team members and stakeholders quickly get up to


· Timeline speed on the incident?
Can team members and stakeholders use this record to
locate all the other details of the incident and response
activities?

INCIDENT MANAGEMENT OVERVIEW 18


After the incident
Post-incident review and documentation
A post-incident review (also called postmortem) is a written record of what happened
during the incident and any follow-up actions taken to prevent it from happening again.

BENEFITS

After an incident is resolved, teams still often don’t know the root causes and are at
risk of the same incident happening again. Post-incident reviews help to prevent that
by bringing the team together for an after-action analysis.

Feature set Questions to ask

· Templates Can my team use a template to fill out an post-incident


· Map out next actions review/postmortem?
Can my team plan out next actions and remediation work
during a post-incident review/postmortem?

Federated issue tracking


An issue tracking tool helps the team map out future remediation work that needs to
be done.

BENEFITS

In many cases, resolving the incident brings the service back online without addressing
the root cause. Typically there is more engineering work that needs to be done in
order to remediate root causes and make sure the incident doesn’t repeat itself.
Issue and work tracking tools — which your team is hopefully already using for other
development work — help make sure this work is prioritized and doesn’t fall through
the cracks.

Feature set Questions to ask

· Shared workflow Can my team plan any incident remediation work


pipeline alongside their other work and priorities?
· Integrations Can my team pull in data and content from my other
incident tools?

INCIDENT MANAGEMENT OVERVIEW 19


02
Incident response
Overview
Incident response is an organization’s process of reacting to IT threats such as
cyberattacks, security breaches, and server downtime.

The following sections describe an incident response process – what to do


between realizing a service is down and getting it up and running again. The
process includes seven fundamental stages:

1. Detect the incident

2. Set up team communication channels

3. Assess the impact and apply a severity level

4. Communicate with customers

5. Escalate to the right responders

6. Delegate incident response roles

7. Resolve the incident

Incident response

Incident status

OPEN WORK IN PROGREESS COMPLETED

Detect Open comms Assess Communicate Escalate Delegate Resolve

IM activity

INCIDENT RESPONSE 21
Detect the incident
Ideally, monitoring and alerting tools will detect and inform your team about
an incident before your customers even notice. Though sometimes you’ll first
learn about an incident from social media or customer support tickets. No
matter how the incident is detected, your first step should be to record that a
new incident is open in a tool for tracking incidents.

Jira Service Management customer portal captures user-reported incidents in


a complete and consistent manner, with all of the necessary information the
support team needs to evaluate the incident.

INCIDENT RESPONSE 22
When it comes to detecting incidents and outages early, effective monitoring
is the eyes and ears for IT Operations. For system-detected incidents, Jira
Service Management easily integrates with over 200 app and web services,
such as Slack, Datadog, Sumo Logic, and Nagios, to sync alert data and
streamline your workflow.

After an incident is identified, incident managers add information about the


affected services and systems configuration information from Assets for Jira
Service Management. This capability is helpful because you can pinpoint
recent changes on related systems and use the graph to traverse through
dependencies to understand where things have gone wrong.

INCIDENT RESPONSE 23
Additionally, incident managers can review recent changes to the affected
services as well as similar incidents to gather more information as to the
cause of the ongoing incident.

INCIDENT RESPONSE 24
INCIDENT RESPONSE 25
Set up team communication channels
One of the first things the incident manager (IM) does when they come online
is set up the incident team’s communication channels. The goal at this point
is to establish and focus all incident team communications in well-known
places, such as

· Chat room in Slack or another messaging service.

· Video chat in a conferencing app like Zoom or Microsoft Teams (or if you’re
all in the same place, gather the team in a physical room).

At Atlassian, we prefer using both video chat and text chat tool during
incidents because both excel at different things. Video chat is great for quickly
creating a shared mental picture of the incident through group discussion.
And Slack helps generate a timestamped record of the incident, along with
collected links to screenshots, runbooks, and dashboards.

With Jira Service Management, teams have a centralized place to collaborate,


share real-time information, and fast-track resolution with the incident
command center. Instead of navigating fragmented one-on-one chat updates
or scrolling through long conversation histories, pre-define a video conference
room for teams to chat dynamically, assign roles, and even take decisive
actions right in the interface. By attaching runbooks to alerts, teams can
quickly launch standard remediation tasks, either automatically or on-demand.

INCIDENT RESPONSE 26
Runbooks are great for documenting common troubleshooting methods to
address alerts and resolve outages. By using Confluence for runbooks, your
IT staff has all the information they need to quickly triage an incident, right at
their fingertips.

INCIDENT RESPONSE 27
Assess the impact
After the incident team’s communication channels are set up, it’s time to
assess the incident so the team can decide what to tell people about it and
who needs to fix it.

We have the following set of questions that incident managers ask their teams:

· What is the impact on customers (internal or external)?

· What are customers seeing?

· How many customers are affected (some, all)?

· When did it start?

· How many support cases have customers opened?

· Are there other factors, e.g., social media posts, security, or data loss?

The next step typically is to assign an impact level or value to the incident. For
some teams, usually, those involved with DevOps or SRE, the incident impact
is tracked as a severity level.

Incident response severity levels

Severity Description Examples

A critical incident with · A customer-facing service is for all users.


SEV1 very high impact · Confidentiality or privacy is breached
· Customer data loss.

A major incident with · A customer-facing service is unavailable


significant impact for some, but not all, customers.
SEV2
· Core functionality is significantly
impacted.

A minor incident with · A minor inconvenience to customers,


SEV3 low impact workaround available.
· Usable performance degradation.

INCIDENT RESPONSE 28
Using a numbering system for severity levels helps define and communicate
the incident quickly. All someone has to say is “We might have a Sev 1
happening,” and the right people can immediately understand the seriousness
of the matter even before getting additional information. This is an example
of Atlassian’s severity definitions for service outages and performance
degradation.

Total service outage (hard down)

Tier 3 Tier 2 Tier 1 Tier 0

Not an Not an
< 5 mins SEV3 SEV2
Incident Incident

Not an
5-60 mins SEV3 SEV2 SEV1
Incident

> 60 mins SEV3 SEV2 SEV1 SEV1

Degraded service

Core capability broken


· Log in Not an
SEV3 SEV2 SEV1
· View, create and update Incident
widgets

Non-core capability broken Not an Not an


SEV3 SEV2
· Anything besides “core” Incident Incident

Severity levels can also help build guidelines for response expectations. At
some companies, for example, Severity 3 incidents can be addressed during
business hours, while Severity 1 and 2 require paging team members for
an immediate fix. Incident severity definitions should be documented and
consistent throughout the organization.

INCIDENT RESPONSE 29
Other teams, typically ITOps, classify incidents in terms of urgency and impact,
and these values are used to calculate an overall priority for the incident.

· Impact measures the effect of an incident on a business’ processes. The


impact is generally based on how your quality of service is affected.

· Urgency is a measure of the time for an incident to significantly impact


your business. For example, a high-impact incident may have low urgency
if the impact will not affect the business until the end of the financial year.

· Priority conveys the severity of an issue so that agents can react


accordingly; it identifies the relative importance of an incident and is
usually based on the impact and urgency of an issue. It helps your agents
prioritize issues and identifies the required time for actions to be taken to
resolve them. You can manually assign priority levels, or create an impact
urgency priority matrix and use automation to automatically assign
priorities for you.

In Jira Service Management, the severity and priority levels can be associated
with various service level agreements (SLAs). These values, as well as a major
incident indicator, are tracked for all incidents.

INCIDENT RESPONSE 30
Communications with customers
Once a team establishes that the incident is real, it’s best to communicate
to internal and external stakeholders as soon as possible. Internal
communication’s goal is to focus the incident response in one place and
reduce confusion. External communication’s purpose is to tell customers the
team is aware something’s broken and you’re looking into it. Communicating
quickly and accurately helps build trust with customers and the rest of the
organization.

Many teams use a centralized dashboard, like Statuspage, to report on the


status of critical services. Statuspage works as a single channel for clear and
proactive mass communication to both internal and external users, along with
automated notifications and updates.

Statuspage keeps internal teams informed of both scheduled and unplanned


downtime as well. Customers and employees can subscribe to updates, which
promotes consistent communication and reduces manual updates.

INCIDENT RESPONSE 31
INCIDENT RESPONSE 32
Tip

In Statuspage, you can create incident communication templates to use as
a starting point. Fields like the incident name and message will be pre-filled
and ready for your quick review before you send the information off to your
customers. This saves time and relieves some of the stress involved when in
the midst of an incident.

Here are two simple templates for updating an internal or external page:

Template Internal Statuspage External Statuspage

Incident <Incident issue key> - <Severity> - Investigating issues with


name <Incident summary> <product>

We are investigating an incident We are investigating issues


affecting <product x>, <product with <product> and will provide
Message y> and <product z>. We will updates here soon.
provide updates via email and
Statuspage shortly.

To learn more about incident communication templates, visit our tutorial.

How well are you communicating with customers during an outage?

But are you being transparent with your customers? Are your support teams
starting to see queues fill up with tickets and social media messages?

Good incident response isn’t just about getting services back up quickly –
it’s about being upfront and frequently updating your customers.

If you want to look at how well you’re doing with communications and learn
where you can improve, visit Atlassian Team Playbook – Incident response
communications for free workshop resources.

INCIDENT RESPONSE 33
Escalate to the right responders
Sometimes, the initial responders are the ones who resolve the incident.
More often than not, those responders need to bring other teams into the
incident by paging them using an alerting tool. With Jira Service Management,
responders can take their pick as to what alerting method they use, or even
use them all in one central location.

Alerting tools allow teams to define on-call rosters to create a rotation of


staff who are expected to be reachable during an incident. This is better than
relying on a specific person every time there’s an incident.

INCIDENT RESPONSE 34
Delegate incident response roles
After a new incident responder is paged and comes online, the incident
manager delegates a role to them. It’s important that everyone working on an
incident understands what’s required of their role and how to contribute to the
incident team quickly and effectively.

Another advantage to defining roles is it allows more adaptability and


flexibility. As long as a person knows how to perform a certain role, they can
take that role in any incident. At Atlassian, we train our responders on what
the incident roles are and what they do using a combination of online training,
face-to-face training, documentation, and hands-on “shadowing” experience.

Three fundamental incident response roles


· Incident manager
Each incident is driven by the incident manager, who has overall
responsibility and authority for the incident. The incident manager has the
authority to take any action necessary to resolve the incident, including
paging anyone in the organization and keeping those involved in an
incident focused on restoring service as quickly as possible.

· Tech lead
The tech lead is a senior technical responder. The tech lead responsibilities
are to develop theories about what’s broken and why, decide on changes,
and run the technical team. This person works closely with the incident
manager.

· Communications manager
The person familiar with public communications, possibly from the
customer support team or public relations. They are responsible for
writing and sending both internal and external communications about
the incident.

INCIDENT RESPONSE 35
Send follow-up communications
You already sent out initial communications. Once the incident team is rolling,
you have to update staff and customers on the incident, and as the incident
progresses you need to keep them looped in.

Updating customers and staff regularly creates consistent, shared information


about the incident. When something goes wrong, information is often
scarce, especially during the early stages. If you establish a reliable source of
information about what’s happened and how you’re responding, then people
tend to be more objective which reduces confusion.

Using a communication template makes it easier and more reliable to


produce consumable information during incidents. In an emergency, easy-to-
understand information is important.

Before sending, we review the communications for completeness using


this checklist:

Have we described the actual impact on customers?

Did we say how many internal and external customers are affected?

If the root cause is known, what is it?

If there is an ETA for restoration, what is it?

When and where will the next update be?

Jira Service Management gives you the ability to add people as stakeholders
and update them by sending email messages.

INCIDENT RESPONSE 36
Iterate incident analysis and recovery
There’s no single prescriptive process that will resolve all incidents – if there
were, we’d simply automate that and be done with it. Instead, we iterate on
the following process to quickly adapt to various incident response scenarios:

· Observe what’s going on. Share and confirm observations.

· Develop theories about why it’s happening.

· Develop experiments that prove or disprove those theories. Carry those out.

· Repeat

For example, you might observe a high error rate in a service corresponding
with a fault that your regional infrastructure provider has posted on their
Statuspage. You might theorize that the fault is isolated to this region, decide
to fail over to another region, and observe the results. Process aficionados will
recognize this as a generalization of the Deming “Plan-Do-Check-Act” cycle,
the “Observe-Orient-Decide-Act” cycle, or simply the scientific method.

The biggest challenges for the incident manager at this point are around
maintaining the team’s discipline:

· Is the team communicating effectively?

· What are the current observations, theories, and streams of work?

· Are we making decisions effectively?

· Are we making changes intentionally and carefully? Do we know what


changes we’re making?

· Are roles clear? Are people doing their jobs? Do we need to escalate to
more teams?

In any case, don’t panic – it doesn’t help. Stay calm, and the rest of the team
will take that cue.

The incident manager has to keep an eye on team fatigue and plan team
handovers. A dedicated team can risk burning themselves out when resolving
complex incidents – incident managers should look out for how long members
have been awake for and how long they’ve been working on the incident for,
and decide who’s going to fill their roles next.

INCIDENT RESPONSE 37
Resolve the incident
An incident is resolved when the current or imminent business impact has
ended. At that point, the emergency response process ends and the team
transitions onto any cleanup tasks and the post-incident review.

We send final internal and external communications when the incident is


resolved. The internal communications have a recap of the incident’s impact
and duration, such as how many support cases were raised and other
important incident dimensions. It should also clearly state that the incident
is resolved and there will be no further communications about it. The external
communications are usually brief, telling customers that service has been
restored and the team will follow up with a post-incident review.

There are many moving parts to the incident response process. Keeping
track of each step with seamless communication is easy with an incident
management tool like Jira Service Management. Centralized alerts, flexible
communications channels, and unified work tracking are central to resolving
incidents quickly.

INCIDENT RESPONSE 38
03
Post-incident reviews
The importance of a post-incident review
Incidents happen. They just do. As our systems grow in scale and complexity,
failures are inevitable.

Incidents are also a learning opportunity. As Dave Resin, Senior Director at


Google, framed it, “Incidents are unplanned investments in the reliability of
your service.”

A chance to uncover vulnerabilities in your system. An opportunity to mitigate


repeat incidents and decrease time to resolution. A time to bring your teams
together and plan for how they can be even better next time.

The best way to work through what happened during an incident and capture
any lessons learned is by conducting a post-incident review, also known as an
incident postmortem.

A post-incident review (PIR) brings people together to discuss the details of an


incident: why it happened, its impact, what actions were taken to mitigate it
and resolve it, and what should be done to prevent it from happening again.

Thanks to tools like version control, feature flags, and continuous delivery, a lot
of incidents can be quickly “undone.” Many incidents are caused by some bug
in a change pushed to production, and rolling back that change can get the app
up and running again. This is really beneficial for everyone, it gets the service
quickly working again. But it often doesn’t help you understand what failed and
why. This is where post-incident reviews come in.

A post-incident review is a framework for learning from incidents and turning


problems into progress. It also builds trust with customers, colleagues, and
end users (basically the folks affected by the incident) and lets them know your
team is working to minimize future incidents and impact.

POST-INCIDENT REVIEWS 40
Plan

Post-incident
Build
review

Incident Deploy

A post-incident review is an essential step in the lifecycle of an always-on


service. The findings from your review should feed right back into your
planning process. This closed process ensures that the critical remediation
work identified in the post-incident review finds a place in upcoming work
and is balanced against other upcoming work and priorities.

WHAT IS A POST-INCIDENT REVIEW?

A post-incident review is a written record of an incident that describes:

·  The incident’s impact.

·  The actions taken to mitigate or resolve the incident.

·  The incident’s causes.

·  Follow-up actions taken to prevent the incident from


happening again.

In Jira Service Management, PIRs are a work category that you can link to
primary incident, subtasks, Jira Software issues, etc., so all critical actions
that occurred during the incident are documented in a timeline and related
Jira tickets are included as reference and remediation.

POST-INCIDENT REVIEWS 41
POST-INCIDENT REVIEWS 42
Best practices for a post-incident review
How you approach your post-incident review is just as crucial as the checklist
of steps you take. Tensions can run high in the wake of an incident. The key to
getting people to come to the process engaged and ready to tackle a complex
problem is to give them a sense of psychological safety.

Establish a blameless culture


Former Etsy CTO John Allspaw wrote a seminal piece on “blameless
postmortems.” This approach to the investigation of an incident allows the
people involved in an incident to account for all their actions, their impact, and
what they knew and when without fear of punishment or retribution.

This approach is essential to ensuring your teams openly share information


and get to the root cause of an incident. If anyone fears rebuke, they may hold
back information or try to redirect blame. When this happens, people lose trust
in each other. And the organization loses the opportunity to build resiliency in
its teams and systems. Many teams, including here at Atlassian and at Google,
have adopted the tenants of the blameless postmortem in order to avoid
those pitfalls.

Avoid pointing fingers, and keep critiques constructive


In your review meeting – and in the subsequent write-up of the findings –
avoid language that singles out individuals as personally responsible for the
incident. Instead, focus on actions, results, and impact.

While it’s important to keep the conversation safe and objective, getting to
the root cause of the incident is critical to resolving it. You can use a technique
in your meeting called “The Five Whys.” Start by making sure everyone agrees
on what the problem is. Then, ask why this happened, and then ask “why” to
the answer to that question. Repeat this process at least five times to ensure
you uncover all the deep factors contributing to the problem. Make sure the
room doesn’t try to steer away from an uncomfortable truth or try to reach
an easy consensus. You can learn more about “The Five Whys” approach with
Atlassian’s playbook.

POST-INCIDENT REVIEWS 43
Review every single post-incident report, and ingrain
lessons learned into your process
An unreviewed post-incident report might as well never have been written.
Once a post-incident report is drafted, it’s important to review it to close out
any unresolved issues, capture ideas to consider in the future, and finalize
the report. You may even say that the incident isn’t truly resolved until this
analysis has taken place.

How do you make this happen? Schedule a recurring meeting with engineering
(and anyone else who may have an interest, like customer support or account
managers), at least monthly, to review post-incident reports. You can choose
to review recent reports or perhaps review older reports and share lessons
that are still relevant today.

Usually, the team that delivers the service that caused the incident is
responsible for completing the associated post-incident review. They nominate
one person to be accountable for completing the review, and the issue is
assigned to them. They are the “PIR owner” and they drive the review through
drafting and approval, all the way until it’s published.

Infrastructure and platform-level incidents often impact a cross-section of


the company, making their post-incident reviews more complicated and
effort-intensive. For this reason, we sometimes assign a dedicated program
manager to own infrastructure or platform-level PIRs because this staff is
better suited to working across groups, and they are able to commit the
requisite level of effort.

POST-INCIDENT REVIEWS 44
INCIDENT CAUSE CATEGORIES

At Atlassian, we find it helpful to group the causes of incidents into


categories. This classification helps point us in the right direction when
deciding on mitigations and lets us analyze incident trends. For example,
seeing many scale-related incidents in one group might prompt us to
compare scaling strategies with other groups.

Category Definition What should you do about it?

Bug A change to code made Test. Canary. Do incremental rollouts


by Atlassian (this is a and watch them. Use feature flags.
specific type of change)

Change A change made by Improve the way you make changes,


Atlassian (other than for example, your change reviews
changes to code, which or change management processes.
are bugs) Everything next to “bug” also
applies here.

Architecture Design misalignment Review your design. Do you need to


with operational change platforms?
conditions

Scale Failure to scale What are your service’s resource


(e.g., blind to resource constraints? Are they monitored and
constraints, or lack of alerted? If you don’t have a capacity
capacity planning) plan, make one. If you
do have one, what new constraint
do you need to factor in

Unknown Indeterminable (action is Improve your system’s observability


to increase the ability to by adding logging, monitoring,
diagnose) debugging, and similar things.

The categories we use are tailored to our own business as a software company.
You may find that different categories work better for your business.

POST-INCIDENT REVIEWS 45
POST-INCIDENT REVIEW ACTIONS

Sue Lueder and Betsy Beyer from Google have an excellent presentation
and article on postmortem action items, which we use at Atlassian to
prompt the team. This section references their suggestions. Work through
the questions below to help ensure the postmortem covers both short- and
long-term fixes.

Category Question to ask Examples

Investigate this “What happened to cause logs analysis, diagramming


incident this incident and why?” the request path, reviewing
Determining the root heap dumps
causes is your ultimate
goal.

Mitigate this “What immediate actions rolling back, cherry-


incident did we take to resolve picking, pushing configs,
and manage this specific communicating with
event?” affected users

Repair damage “How did we resolve restoring data, fixing


from this immediate or collateral machines, removing traffic
incident damage from this incident?” re-routes

Detect future “How can we decrease the monitoring, alerting,


incidents time to accurately detect a plausibility checks on input/
similar failure?” output

“Mitigate future incidents” and “Prevent future incidents” are your most likely
sources of actions that address the root cause. Be sure to get at least one of
these.

POST-INCIDENT REVIEWS 46
We also use Lueder and Beyer’s advice on wording for our post-incident
review actions:

The right wording for a PIR action can make the difference between an easy
completion and an indefinite delay due to infeasibility or procrastination.
A well-crafted PIR action should have these properties:

· Actionable – Phrase each action as a sentence starting with a verb. The


action should result in a useful outcome, not a process. For example,
“Enumerate the list of critical dependencies” is a good action, while
“Investigate dependencies” is not.

· Specific – Define each action’s scope as narrowly as possible, clarifying


what is and what is not included in the work.

· Bounded – Word each action to indicate how to tell when it is finished, as


opposed to leaving the action open-ended or ongoing

From... To...

Investigate monitoring for this (Actionable) Add alerting for all cases
scenario. where this service returns >1% errors.

Fix the issue that caused the outage. (Specific) Handle invalid postal code in
user address form input safely.

Make sure engineer checks that (Bounded) Add automated pre-submit


database schema can be parsed check for schema changes.
before updating.

Visit Postmortem Action Items: Plan the Work and Work the Plan for the full
source material.

POST-INCIDENT REVIEWS 47
An effective post-incident review plan
For post-incident reviews to be effective–and allow you to build a culture of
continuous improvement–you want to implement a simple, repeatable process
in that everyone can participate. How you do this will depend on your culture
and your team. At Atlassian, we’ve developed the following approach that
works for us.

Here are some tips to get started.

Set a threshold
Incidents in your organization should have clear and measurable severity levels.
These severity levels can trigger the post-incident review process. For example,
any incident Sev-1 or higher triggers the PIR process, while the review can be
optional for less severe incidents. Consider allowing team leads or management
to request a review for any incident that doesn’t meet the threshold.

TIP

Post-incident reviews can
also be created using Jira
Service Management’s native
automation engine. For example,
you can set an automation rule
to create a post-incident review
each time a major or critical
priority incident is resolved by
your team.

To learn more about automation in


Jira Service Management, visit our key
concepts guide.

POST-INCIDENT REVIEWS 48
Don’t procrastinate
It’s important to take a break and get some rest after an incident. But don’t
delay writing the post-incident review. Wait too long, and important details
might be lost or forgotten. Ideally, it’s drafted immediately after a post-incident
review meeting. If possible, the team should hold a review meeting within 24-
48 hours of the incident resolution and not more than five business days.

Assign roles and owners


A post-incident review meeting is where you’ll hash out the details that will
be recorded in the PIR report. It’s good to delegate the PIR draft to a specific
person, preferably someone familiar with the incident, and who has the
required level of technical and organizational knowledge to understand the
causes and mitigations.

Work from a template


A template can keep you from leaving out fundamental details. And it’s a great
way to build consistency throughout your post-incident review process.

Include a timeline
A timeline is a very helpful aid in incident documentation. Often it’s the
first place your readers’ eyes jump to when trying to quickly size up what
happened. Try to be as clear and specific as possible. For example, “11:14 am
Pacific Standard Time,” not “around 11.” Being specific with timestamps allows
you to map out a high-fidelity chain of events, which is useful to identify areas
of improvement. For example, you might identify that the interval between
when impact started and when customers were notified was too long.

Important times to include:

· First alert or ticket

· First comms announcement (internal and external)

· Times of incident status page/notification updates

· Time of any remediation attempts (code rollbacks, etc.)

· Time of resolution

POST-INCIDENT REVIEWS 49
Details, details, details
Skimping on details is a quick path to writing post-incident reviews that are
unhelpful and unclear. Add as many details as possible about what happened
and what was done during the incident. Instead of “then public comms went
out,” say “We sent the initial public comms announcing the incident on our
public status page and Social platform account.”

Wherever possible, include links and names, links to tickets and status
updates, links to incident state documents, and monitoring charts. Don’t be
afraid to add screenshots of relevant graphics or dashboards. A graph from
your monitoring system that clearly shows the incident’s start and end times
(for example, a drop in request rate followed by a return to normal) is very
valuable because it’s unambiguous. It becomes even more powerful when
combined with graphs that show what was happening behind the scenes
during that time, for example, database connections, network link state, or
CPU/ memory/io/bandwidth consumption over the same timeframe.

Capture incident metrics


When you capture metrics in your post-incident review, you apply hard data
to the issues and their impact. Having these data points helps you determine
if your team is heading in the right direction and reducing the number of
incidents, their severity, and downtime. With consistent metrics being
measured, you can take a step back and look at incident trends over time.

Some metrics to consider in your post-incident review tracking:

· The number of minutes of downtime, so you can track if this number is


going up or down.

· The severity of the incident so that you can determine the relative
reliability of your systems.

· Mean Time to Resolution (MTTR) measures the average time it takes to


resolve an incident from when it was initially reported.

POST-INCIDENT REVIEWS 50
Use checklists and templates to streamline the process
To ensure that your team develops a culture around post-incident reviews,
make it easy to capture information, schedule meetings, and publish the final
report with reusable checklists and templates. A repeatable process provides
consistency for teams, helps people know what to expect, and encourages
participants to engage in the PIR with a productive mindset.

Below are typical checklist items for a PIR process.

Meetings that need to be held:

· Information gathering meeting

· Review of report

· Presentation of report

Information that needs to be gathered ahead of time:

· Standard agendas for each meeting

· Participants, stakeholders, reviewers

· Standardize PIR report writing with a template

The most important tip? Don’t skip any steps. The key to conducting post-
incident reviews that help you improve your team and systems is to have a
process and stick to it.

POST-INCIDENT REVIEWS 51
04
Incident management analytics
How to choose incident management
key performance indicators and metrics
In today’s always-on world, tech incidents come with significant consequences.

A recent industry report found that

100K
downtime costs continue to rise with
the majority of outages costing at least
≥$
$100,000, and the price is increasing (39%
cost of most major
from 2019 to 2022). But the monetary outages
impact is far from the only cost to
businesses. CIOs indicate that incidents
result in lower customer satisfaction, data 39%
increase in the cost of
loss, loss of reputation, and SLA payouts.
outages in recent years
Additional research found that the
number of outages are increasing, and it
is taking longer for businesses to recover
+ 2 hrs
increase in MTTR since
from them with MTTR (Mean Time To 2020
Repair) ramping since 2020 by almost two
hours.

Evidence points to incidents causing


significant financial and reputation impact across all business sizes (enterprise
and well as SMBs), industries, and global regions.

With so much at stake, it’s more important than ever for teams to track
incident management data and use their findings to detect, diagnose, fix, and –
ultimately – prevent incidents.

The good news is that with web and software incidents (unlike mechanical
and offline systems), teams usually are able to capture a lot more data to help
them understand and improve.

The bad news? Sometimes too much data can obscure issues instead of
illuminating them.

INCIDENT MANAGEMENT ANALYTICS 53


A WORD OF CAUTION ABOUT INCIDENT ANALYTICS

The downside to KPIs is that it’s easy to become too reliant on shallow
data. Knowing that your team isn’t resolving incidents fast enough won’t get
you to a fix. Because you still need to know how and why the team is or isn’t
resolving issues. And you still need to know if the issues you’re comparing
are actually comparable.

KPIs can’t tell you how your teams approach tricky problems. They can’t
explain why your time between incidents has been getting shorter instead
of longer. They don’t know why Incident A took three times as long as
Incident B.

For that, you need insights. And while the data can be a starting point on
the way to those insights, it can also be a stumbling block. It can make us
feel like we’re doing enough even if our metrics aren’t improving. It can lump
together incidents that are actually dramatically different and should be
approached differently. It can discount the experience of your teams and the
underlying complication of incidents themselves.

The point isn’t that KPIs are bad. We don’t think you should throw the baby
out with the bathwater. The point is that KPIs aren’t enough. They’re a
starting point. They’re a diagnostic tool. They’re the first step down a more
complex path to actual improvement.

“  Incidents are much more unique than conventional wisdom would have you
believe. Two incidents of the same length can have dramatically different
levels of surprise and uncertainty in how people came to understand what was
happening. They can also contain wildly different risks with respect to taking
actions that are meant to mitigate or improve the situation. Incidents are not
widgets being manufactured, where limited variation in physical dimensions is
seen as key markers of quality.

JOHN ALLSPAW, MOVING PAST SHALLOW INCIDENT DATA

INCIDENT MANAGEMENT ANALYTICS 54


The value of incident key performance
indicators, metrics, and analytics
KPIs (Key Performance Indicators) are metrics that help businesses determine
whether they’re meeting specific goals. For incident management, these
metrics could be number of incidents, average time to resolve, or average time
between incidents.

Tracking KPIs for incident management can help identify and diagnose
problems with processes and systems, set benchmarks and realistic goals for
the team to work toward, and provide a jumping-off point for larger questions.

For example, let’s say the business’ goal is to resolve all incidents within
30 minutes, but your team is currently averaging 45 minutes. Without specific
metrics, it’s hard to know what’s going wrong. Is your alert system taking too
long? Is your process broken? Do your diagnostic tools need to be updated?
Is it a team problem or a tech problem?

Now, add some metrics: If you know exactly how long the alert system is
taking, you can identify it as a problem or rule it out. If you see that diagnostics
are taking up more than 50% of the time, you can focus your troubleshooting
there. If you see that Team B is taking 25% more time than Teams A, C, and D,
you can start to dig into why.

KPIs won’t automatically fix your problems, but they will help you understand
where the problem lies and focus your energy on digging deeper in the right
places.

Mean time to respond


TEAM A
Alert! System back up!

DIAGNOSIS REPAIRS

TEAM B Takes significantly longer.


Time to dig into why.

DIAGNOSIS REPAIRS

TEAM C

DIAGNOSIS REPAIRS

TEAM D

DIAGNOSIS REPAIRS

INCIDENT MANAGEMENT ANALYTICS 55


Useful incident key performance indicators
and metrics
Alerts created
If you’re using an alerting tool, it’s helpful to know how many alerts are
generated in a given time period. Using a solution like Jira Service Management,
you can send alerts and spin up reports and dashboards to track them.

Watch for periods with significant, uncharacteristic increases or decreases or


upward-trending numbers, and when you see them, dig deeper into why those
changes are happening and how your teams are addressing them.

Incidents over time


Tracking incidents over time means looking at the average number of
incidents over time. This measurement can mean weekly, monthly, quarterly,
yearly, or even daily.

Are incidents happening more or less frequently over time? Is the number of
incidents acceptable, or could it be lower? Once you identify a problem with
the number of incidents, you can ask why that number is trending upward or
staying high and what the team can do to resolve the issue.

MTBF
MTBF (mean time between failures) is the average time between repairable
failures of a tech product. It can help you track availability and reliability across
products. The higher the time between failures, the more reliable the system.

As with other metrics, it’s a good jumping-off point for more extensive
questions. If your MTBF is lower than you want, it’s time to ask why the
systems are failing so often and how you can reduce or prevent future failures.

MTTA
MTTA (mean time to acknowledge) is the average time it takes between
a system alert and when a team member acknowledges the incident
and begins working to resolve it. The value here is in understanding how
responsive your team is to issues.

INCIDENT MANAGEMENT ANALYTICS 56


Once you know there’s a responsiveness problem, you can again start to dig
deeper. Why is your MTTA high? Are teams overburdened? Distracted? Is it
unclear whose responsibility an alert is? MTTA can help you identify a problem,
and questions like these can help you get to the heart of it.

MTTD
MTTD (mean time to detect) is the average time it takes your team to discover
an issue. This term is often used in cybersecurity when teams are focused on
detecting attacks and breaches.

If this metric changes drastically or isn’t quite hitting the mark, it’s, yet again,
time to ask why.

MTTR
MTTR can stand for mean time to repair, resolve, respond, or recovery.
Arguably, the most useful of these metrics is the mean time to resolve, which
tracks the time spent diagnosing and fixing an immediate problem and the
time spent ensuring the issue doesn’t happen again. Recovery is a primary
DevOps metric that DevOps Research and Assessment (DORA) notes is key to
measuring the stability of a DevOps team.

Again, this metric is best when used diagnostically. Are your resolution times as
quick and efficient as you want them to be? If not, it’s time to ask more profound
questions about how and why said resolution time is missing the mark.

INCIDENT MANAGEMENT ANALYTICS 57


MTBF VS. MTTR VS. MTTF VS. MTTA

So, which measurement is better when it comes to tracking and improving


incident management? The answer is all of them.

Although they are sometimes used interchangeably, each metric provides a


different insight. When used together, they can tell a more complete story
about how successful your team is with incident management and where
the team can improve.

OUTAGE INCIDENT ALL IS WELL PRODUCT


OCCURRENCE MTTF
FAILURE

MTTRecovery

MTBF

MTTRepair

MTTRespond

MTTResolve

!
TIME

OUTAGE DEVS SEE REPAIR REPAIR COMPLETE TEAM FINDS A FIX PRODUCT FAILS COMPLETELY
BEGINS AN ALERT BEGINS AND SYSTEM TO PREVENT AND NEEDS TO BE REPLACED
IS RESTORED FUTURE OUTAGES

· Mean time to recovery tells you · Add mean time to resolve to the
how quickly you can get your mix and you start to understand
systems back up and running. the full scope of fixing and
resolving issues beyond the
· Layer in mean time to respond
actual downtime they cause.
and you get a sense for how
much of the recovery time · Fold in mean time between
belongs to the team and how failures, and the picture gets
much is your alert system. even more extensive, showing
you how successful your team is
· Further layer in mean time to
at preventing or reducing future
repair and you start to see how
issues.
much time the team is spending
on repairs vs. diagnostics.

And then add mean time to failure to understand the entire lifecycle of a
product or system.

INCIDENT MANAGEMENT ANALYTICS 58


On-call time
If you have an on-call rotation, tracking how much time employees and
contractors spend on-call can be helpful. This metric can help you make sure
no one employee or team is overburdened.

Using Jira Service Management, you can generate comprehensive reports to


see these figures at a glance.

SLA
An SLA (service level agreement) is an agreement between a provider
and client about measurable metrics like uptime, responsiveness, and
responsibilities.

The promises made in SLAs (about uptime, mean time to recovery, etc.) are
one of the reasons incident management teams need to track these metrics.
If and when things like average response time or mean time between failures
change, contracts need to be updated, and fixes need to happen – quickly.

SLO
An SLO (service level objective) is an agreement within an SLA about a specific
metric like uptime. As with the SLA, SLOs are essential metrics to track to
ensure the company upholds its end of the bargain regarding customer service.

Timestamps (or timeline)


A timestamp is encoded information about what happened at specific times
during, before, or after the incident. This information isn’t typically considered
a metric, but it’s essential data when assessing your incident management
health and developing strategies to improve.

Timestamps help teams build out timelines of the incident, along with the
lead-up and response efforts. A clear, shared timeline is one of the most
valuable artifacts during a post-incident review.

Uptime
Uptime is the amount of time (represented as a percentage) that your
systems are available and functional.

The increasing connectivity of online services and increasing complexity


of the systems themselves means there’s typically no such thing as 100%
guaranteed uptime. The goal for most products is high availability – having

INCIDENT MANAGEMENT ANALYTICS 59


a system or product that’s operational without interruption for long periods
of time. Industry standard says 99.9% uptime is very good and 99.99% is
excellent.

Tracking your success against this metric is all about making and keeping
customer promises. And, as with other metrics, it’s just a starting point. If
your uptime isn’t at 99.99%, the question of why will require more research,
conversations with your team, and investigation into process, structure,
access, or technology.

Jira Service Management offers report and dashboard features so your team
can track KPIs and monitor and optimize your incident management practice.

INCIDENT MANAGEMENT ANALYTICS 60


05
Good practices for modern
incident management

GOOD PRACTICES FOR MODERN INCIDENT MANAGEMENT 61


The challenges facing modern IT incident
management
Disconnected processes and technologies
A side effect of 40 years of computing innovation is that many companies now
operate an eclectic mix of applications and systems. Some applications live in
their own data centers where they can be intimately controlled, while others
are delivered on the cloud and managed by third-party providers.

This collection of applications, services, and systems often results in a


tenuously connected patchwork of solutions and processes for logging,
monitoring, and alerting. It’s not uncommon for enterprises to use dozens of
monitoring tools to track thousands of application events or alerts each day.

This patchwork approach can lead to an overwhelming volume of alerts, a


breakdown in communication, a lack of clear priority for on-call employees,
and a situation where a failure in one stage of this patchwork process can take
down the whole thing.

An overwhelming volume of alerts/incidents


Many IT operations departments funnel alerts into email boxes to counteract
their volume problem. But this just makes matters worse, creating a situation
where email requires 24/7 monitoring by senior-level staff responsible for
prioritizing incidents and escalating critical messages.

This never-ending stream of alerts can be overwhelming; 40% of IT


organizations are flooded by more than one million event alerts each day,
while 11% are swamped by more than ten million alerts. This constant borage
of noise can lead to alert fatigue, burnout, work dissatisfaction, anxiety, and
longer response times. It impacts employee well-being in the workplace and
productivity, directly impacting the business’s bottom line.

Rising operations costs


While infrastructure costs have declined, operations costs have risen – driven,
in part, by the complexity of debugging issues when you don’t control the
entire system. And as software continues to become more complex in the
years ahead, challenges and costs associated with debugging applications will
only continue to grow.

GOOD PRACTICES FOR MODERN INCIDENT MANAGEMENT 62


Measuring the wrong success metrics
Service desk operations success has often been measured with metrics like
call throughput and mean call time – neither of which contribute to or directly
measure the effectiveness of incident management.

Even useful metrics like MTTR and MTBF alone aren’t enough to improve
incident management performance. They are there to help us identify an issue,
but they can’t answer the stickier, more qualitative questions of why and how
incidents occur and are resolved and how to improve those metrics.

Incident response team structures.


Organizations vary in their approaches and maturity levels in supporting their
services.

· Ad hoc: Smaller and newly-formed companies often lack a formal incident


management process. Instead, they rely on customer-reported outages to
become aware of incidents. Spreadsheets are used for on-call scheduling,
and email is used for communication and collaboration.

· Traditional: This approach combines service desks with strict processes


to triage, escalate, and resolve incidents. This process prioritizes cost
savings at the expense of agility. The slower response time of a team that
starts incidents with entry-level employees and requires multiple levels of
escalation can have an immediate impact on incident resolution timelines.

· Modern: This approach combines manual and automated processes to


resolve incidents. It employs issue-tracking tools, monitoring platforms,
and chat applications to respond to alerts and notifications. Individual
teams are tasked with detecting and resolving incidents for the specific
services they manage. Still, some challenges persist; for example, need for
coordination between individual service teams and centralized incident
management groups.

Critical challenges are associated with all of these approaches, but everyone
agrees there is great value in resolving incidents faster to provide an
outstanding customer experience. As General Stanley McChrystal illustrates in
his book, Team of Teams, we want both efficiency and adaptability in incident
management. However, in the earlier stages of response, adaptability is more
important than efficiency because we must respond to a volatile environment
in the best ways possible.

GOOD PRACTICES FOR MODERN INCIDENT MANAGEMENT 63


Optimizing incident management practices
across teams
Outages impact the bottom line
Downtime often means not only lost revenue, but also compliance and
regulatory penalties, lost customers, and a climb in operational costs and
delays as IT professionals are pulled off other projects to resolve incidents.

A recent Information Technology Intelligence Consulting Corp (ITIC) report


emphasizes “ …hourly downtime costs are on the upswing for all businesses
irrespective of size or vertical market. ITIC’s latest research indicates the
average cost of a single hour of downtime now exceeds $300,000 for 91% of
small and mid-size (SMEs) and large enterprises, with 500+ employees.”

Additionally, industry surveys indicate IT failures lead to significant brand


reputation damage that can last for months or years which limits future
revenue. In fact, in a study conducted by TrustPilot, a positive online
reputation was the #1 most important factor when consumers were
considering whether or not to frequent a business while having the best
quality product or service offering was the fourth most important factor.

Figures like these make it clear that lost revenue isn’t the only – or even the
most important – priority for incident management. An optimized incident
management process also needs to address the very real, very expensive
challenges of the people, processes, and technology behind incident
management.

GOOD PRACTICES FOR MODERN INCIDENT MANAGEMENT 64


How to optimize your IT incident management process
It’s clear it’s time to refocus our incident management efforts with processes,
team structures, and practices that reflect the new business realities of today.
But what does that refocusing process look like?

Prioritize and consolidate alerts


The primary culprit in alert fatigue and a significant contributor to lost
productivity is a surplus of meaningless, non-actionable alerts. The simplest
fix? Identify critical systems, de-duplicate redundant notifications, and create
a clear prioritization hierarchy for alerts.

Create an on-call schedule that works for your teams


Avoiding alert fatigue, burnout, and inefficiencies also means creating
an on-call schedule that works for your teams. This approach means not
overburdening any one person or team, providing backup support where
needed, and reevaluating the effectiveness of your schedule regularly.

Automate where you can


Automation remains a priority in the incident management process, and
recent research on the state of incident management research shows that
organizations continue automating various procedures.

AUTOMATED INCIDENT MANAGEMENT PROCESSES

67%
INCIDENT COMMUNICATION
63%
(STATUSPAGE,EMAIL,ETC.)
50%

66%
TICKET CREATION 52%
(JIRA, JIRA SERVICE MANAGEMENT, ETC.) 48%
53%
CHAT CHANNEL CREATION 59%
(SLACK, MICROSOFT TEAMS) 43%

58%
ON-CALL NOTIFICATIONS 57%
FROM MONITORING TOOLS
42%

VISIBILITY INTO RECENT DEPLOYMENTS 37%

CHANGE RECORD CREATION 46%


FOR STANDARD CHANGES 36%

33%
POSTMORTEM CREATION 28%
29%

3%
OTHER 1%
1% 2020 2021 2023

GOOD PRACTICES FOR MODERN INCIDENT MANAGEMENT 65


It’s easy to lose focus when you’re manually sifting through dozens of reports
to identify and escalate the ones that matter. The good news is that this is no
longer something that has to be done manually by a team member, and you
can avoid lost productivity and alert fatigue by removing it from the task list
through automation.

Alert routing, notification, deduplication, message workflows, conference


bridge creation, status page updates, on-call scheduling, escalation processes,
and KPI tracking can also be wholly or partially automated to save the team
time and reduce human error in set, repetitive tasks. Not to mention that
automation saves the company money over time.

To learn more about Jira Service Management low code/no code automation
capabilities, visit our ITSM automation template library.

Communicate effectively across channels and stakeholders


Incidents impact various stakeholders – often both internal and external – and
those stakeholders need to be informed. Studies show that 87% of business
stakeholders want updates on incidents (and 56% are more frustrated by lack
of communication than by the incident itself). And customers feel the same.

In a time where always-on is the expectation, a solid incident communication


plan is vital to the optimization puzzle.

GOOD PRACTICES FOR MODERN INCIDENT MANAGEMENT 66


Make it easy to track the right metrics
The easier it is to track success metrics and review them, the more likely your
team is to keep up with them. Automate reporting where possible and get
clear up front on which metrics matter for your team and why.

Conduct blameless post-incident reviews


An incident isn’t over just because the app or database is back online.
To prevent incidents, reduce time spent on future incidents, and better
understand how your processes, teams, and policies impact your incident
management, you must conduct post-incident reviews.

At Atlassian, our PIRs are blameless, which means they focus on improving
performance and moving forward – not finding someone to blame.

Choose technology that supports your processes and needs


Automation. Alert prioritization. Service configuration management.KPI
tracking. To be effective, each of these essential processes needs technology
that supports them. Before choosing your technology, ensure you understand
your goals, processes, and team needs. As software and services become
increasingly complex, Atlassian has found that high-performing teams adopt
a collaborative and proactive approach to plan, respond, and learn from every
incident. Jira Service Management provides an incident management solution
to help IT teams escalate, bring in the right responders, swarm, and ultimately
minimize downtime.

GOOD PRACTICES FOR MODERN INCIDENT MANAGEMENT 67


Conclusion
Incident management practices are changing. While
incident management was once a well-defined process in
the IT organization focusing primarily on service availability,
it has evolved to embrace DevOps and SRE practices, with
an added focus on service performance. Because business
success is now driven by customer experience, diverse
teams must work together to increase engagement,
reduce churn and deliver the digital services people rely
on. Organizations practicing traditional – and even modern
– incident management must evolve their approach to
address issues systematically. Process advancements must
include increasingly leveraging data, automation, and AI to
enable human collaboration and continuous improvement.

Organizations need a platform that integrates a variety


of cultures, toolsets, architectures, and methodologies
to provide consistent, efficient, and measurable incident
management. Many Atlassian customers have achieved
significant business process improvements and cost
benefits when they implement Jira Service Management.

INCIDENT MANAGEMENT OVERVIEW 68


The business impact of Jira Service Management
According to Forrester Consulting’s Total Economic Impact™ report
enterprises that replace their existing ITSM systems with Jira Service
Management realize the following three-year financial impact:

277% 155hrs $1.4M $2.0M


ROI recovered per improvement saved by
month for IT in service desk switching from a
operations teams productivity traditional ITSM
product

Whether you’re already in the Atlassian ecosystem or


you’re making a switch from legacy ITSM application, Jira
Service Management can help you modernize your incident
management practices.

To take the next steps in your modernization journey,


visit our website and start a free trial of Jira Service
Management.

INCIDENT MANAGEMENT OVERVIEW 69


©2023 Atlassian. All Rights Reserved. CSD-5786_DRD-07/23

You might also like