Incident Management Handbook JSM
Incident Management Handbook JSM
Incident Management Handbook JSM
· Align teams as to what attitude they should bring to each part of incident
identification, resolution, and reflection.
When teams face an incident, they need a plan that helps them:
2 CATEGORIZE
RESPOND
· Initial diagnosis: Ideally, your front-line support team can see an incident
from diagnosis through close, but if they can’t, the next step is to log all
the pertinent information and escalate to the next-tier team.
· Escalate: The next team takes the logged data and continues with the
diagnosis process, and if this next team can’t diagnose the incident, it
escalates to the next team.
Increasingly the software you rely on for life and work is not being hosted on a server
in the same physical location as you. It’s likely a web-accessed application deployed
in a data center for thousands or millions of users around the globe. For teams tasked
with running these services, agility, and speed are paramount. Any downtime has the
potential to affect thousands of organizations, not just one.
An advantage of the “you build it, you run it” approach is that it offers the flexibility
agile teams need, but it can also obscure who is responsible for what and when.
DevOps teams can be comfortable – and successful – with less structured development
processes. But it’s best to standardize on a core set of procedures for incident
management so there is no question about how to respond in the heat of an incident,
and you can track issues and report how they’re resolved.
· The engineer who built it is the best person to fix it: The central idea of
the “you build it, you run it” ethos is that the people most familiar with
the service (the builders) are the best equipped to fix an outage.
· Build with speed, but practice accountability: When engineers know that
they and their teammates are on the hook during outages, there’s added
incentive to ensure you’re deploying quality code.
This approach assures fast response times and quick feedback to the teams
needing to build a reliable service.
Co-existing methodologies
The best-performing incident teams use a collection of the right tools, practices,
and people. Each of these frameworks and best practices add value to your
organization’s incident management process and can accelerate and improve
your teams’ incident response. These methodologies can coexist together to
align teams, meet customer demands, and improve the value delivered.
Some tools are specific to incident management, others are more general-
purpose tools your team also uses for other tasks. And some tools might
be an in-house solution built upon layers
of integrations and customization. CIDENT
IN
THE
No matter the use case, good incident
RE Alerting
management tools have a few things
FO
Service and
on-call
BE
in common. The best incident desk
management tools are open, Service
reliable, and adaptable. Monitoring configuration
data
DURING THE
Open: In a high-pressure Team
communication
environment like an incident, it’s Issue
essential that the right people tracking Customer
communication
have access to the right tools and NT
IN
information immediately. This not only Postmortem Incident
C
E
and analysis command
ID
D
goes for incident responders but for
CI
center
E
NT
N
company stakeholders who need visibility ERI
AFT
into response efforts.
Monitoring
Monitoring tools let DevOps and IT Ops teams collect, aggregate, and trigger alerts off
data from thousands of different services in real-time. These are critical to providing
complete visibility into the health of your services and often trigger the first alarm
bells during an incident.
BENEFITS
Monitoring tools give your team constant insight into the health of the infrastructure.
Modern monitoring tools also proactively trigger alerts during unexpected activity.
· 24/7 coverage and Does the tool have visibility into all my servers and
analytics infrastructures?
· Integrates with alerting Can my team see real time analytics and dashboards and
tools set alerting thresholds?
Integrates with alerting tools Does the product integrate
with my alerting and on-call tool?
Service desk
Service desk software gives customers and employees a place to report incidents and
potential incidents.
BENEFITS
Along with their many other use cases, (service requests, IT help desk) service desks
empower your team to quickly learn about incidents from the people who matter
most: your users and customers.
· Enable self serve Can customers quickly file tickets through a service portal?
Can customers find the help they need with automated
knowledge based suggestions?
BENEFITS
BENEFITS
This means you can quickly investigate and communicate all aspects of the incident.
· Multiple channels How flexible is the CMDB? Can I store any CI or asset?
· Integrations Can I visualize my infrastructure graphically?
Can I link CIs/assets with my service desk issues?
Can I link CIs/assets to change requests?
BENEFITS
Customer communication
Customer communication tools help keep customers informed during an incident.
BENEFITS
There’s no getting around it, incidents are typically a bad experience for your
customers. Keeping customers informed builds trust and speeds up response efforts.
Communicating with customers lets them know you’re aware of the incident and
working on a fix.
BENEFITS
After an incident is resolved, teams still often don’t know the root causes and are at
risk of the same incident happening again. Post-incident reviews help to prevent that
by bringing the team together for an after-action analysis.
BENEFITS
After an incident is resolved, teams still often don’t know the root causes and are at
risk of the same incident happening again. Post-incident reviews help to prevent that
by bringing the team together for an after-action analysis.
BENEFITS
In many cases, resolving the incident brings the service back online without addressing
the root cause. Typically there is more engineering work that needs to be done in
order to remediate root causes and make sure the incident doesn’t repeat itself.
Issue and work tracking tools — which your team is hopefully already using for other
development work — help make sure this work is prioritized and doesn’t fall through
the cracks.
Incident response
Incident status
IM activity
INCIDENT RESPONSE 21
Detect the incident
Ideally, monitoring and alerting tools will detect and inform your team about
an incident before your customers even notice. Though sometimes you’ll first
learn about an incident from social media or customer support tickets. No
matter how the incident is detected, your first step should be to record that a
new incident is open in a tool for tracking incidents.
INCIDENT RESPONSE 22
When it comes to detecting incidents and outages early, effective monitoring
is the eyes and ears for IT Operations. For system-detected incidents, Jira
Service Management easily integrates with over 200 app and web services,
such as Slack, Datadog, Sumo Logic, and Nagios, to sync alert data and
streamline your workflow.
INCIDENT RESPONSE 23
Additionally, incident managers can review recent changes to the affected
services as well as similar incidents to gather more information as to the
cause of the ongoing incident.
INCIDENT RESPONSE 24
INCIDENT RESPONSE 25
Set up team communication channels
One of the first things the incident manager (IM) does when they come online
is set up the incident team’s communication channels. The goal at this point
is to establish and focus all incident team communications in well-known
places, such as
· Video chat in a conferencing app like Zoom or Microsoft Teams (or if you’re
all in the same place, gather the team in a physical room).
At Atlassian, we prefer using both video chat and text chat tool during
incidents because both excel at different things. Video chat is great for quickly
creating a shared mental picture of the incident through group discussion.
And Slack helps generate a timestamped record of the incident, along with
collected links to screenshots, runbooks, and dashboards.
INCIDENT RESPONSE 26
Runbooks are great for documenting common troubleshooting methods to
address alerts and resolve outages. By using Confluence for runbooks, your
IT staff has all the information they need to quickly triage an incident, right at
their fingertips.
INCIDENT RESPONSE 27
Assess the impact
After the incident team’s communication channels are set up, it’s time to
assess the incident so the team can decide what to tell people about it and
who needs to fix it.
We have the following set of questions that incident managers ask their teams:
· Are there other factors, e.g., social media posts, security, or data loss?
The next step typically is to assign an impact level or value to the incident. For
some teams, usually, those involved with DevOps or SRE, the incident impact
is tracked as a severity level.
INCIDENT RESPONSE 28
Using a numbering system for severity levels helps define and communicate
the incident quickly. All someone has to say is “We might have a Sev 1
happening,” and the right people can immediately understand the seriousness
of the matter even before getting additional information. This is an example
of Atlassian’s severity definitions for service outages and performance
degradation.
Not an Not an
< 5 mins SEV3 SEV2
Incident Incident
Not an
5-60 mins SEV3 SEV2 SEV1
Incident
Degraded service
Severity levels can also help build guidelines for response expectations. At
some companies, for example, Severity 3 incidents can be addressed during
business hours, while Severity 1 and 2 require paging team members for
an immediate fix. Incident severity definitions should be documented and
consistent throughout the organization.
INCIDENT RESPONSE 29
Other teams, typically ITOps, classify incidents in terms of urgency and impact,
and these values are used to calculate an overall priority for the incident.
In Jira Service Management, the severity and priority levels can be associated
with various service level agreements (SLAs). These values, as well as a major
incident indicator, are tracked for all incidents.
INCIDENT RESPONSE 30
Communications with customers
Once a team establishes that the incident is real, it’s best to communicate
to internal and external stakeholders as soon as possible. Internal
communication’s goal is to focus the incident response in one place and
reduce confusion. External communication’s purpose is to tell customers the
team is aware something’s broken and you’re looking into it. Communicating
quickly and accurately helps build trust with customers and the rest of the
organization.
INCIDENT RESPONSE 31
INCIDENT RESPONSE 32
Tip
In Statuspage, you can create incident communication templates to use as
a starting point. Fields like the incident name and message will be pre-filled
and ready for your quick review before you send the information off to your
customers. This saves time and relieves some of the stress involved when in
the midst of an incident.
Here are two simple templates for updating an internal or external page:
But are you being transparent with your customers? Are your support teams
starting to see queues fill up with tickets and social media messages?
Good incident response isn’t just about getting services back up quickly –
it’s about being upfront and frequently updating your customers.
If you want to look at how well you’re doing with communications and learn
where you can improve, visit Atlassian Team Playbook – Incident response
communications for free workshop resources.
INCIDENT RESPONSE 33
Escalate to the right responders
Sometimes, the initial responders are the ones who resolve the incident.
More often than not, those responders need to bring other teams into the
incident by paging them using an alerting tool. With Jira Service Management,
responders can take their pick as to what alerting method they use, or even
use them all in one central location.
INCIDENT RESPONSE 34
Delegate incident response roles
After a new incident responder is paged and comes online, the incident
manager delegates a role to them. It’s important that everyone working on an
incident understands what’s required of their role and how to contribute to the
incident team quickly and effectively.
· Tech lead
The tech lead is a senior technical responder. The tech lead responsibilities
are to develop theories about what’s broken and why, decide on changes,
and run the technical team. This person works closely with the incident
manager.
· Communications manager
The person familiar with public communications, possibly from the
customer support team or public relations. They are responsible for
writing and sending both internal and external communications about
the incident.
INCIDENT RESPONSE 35
Send follow-up communications
You already sent out initial communications. Once the incident team is rolling,
you have to update staff and customers on the incident, and as the incident
progresses you need to keep them looped in.
Did we say how many internal and external customers are affected?
Jira Service Management gives you the ability to add people as stakeholders
and update them by sending email messages.
INCIDENT RESPONSE 36
Iterate incident analysis and recovery
There’s no single prescriptive process that will resolve all incidents – if there
were, we’d simply automate that and be done with it. Instead, we iterate on
the following process to quickly adapt to various incident response scenarios:
· Develop experiments that prove or disprove those theories. Carry those out.
· Repeat
For example, you might observe a high error rate in a service corresponding
with a fault that your regional infrastructure provider has posted on their
Statuspage. You might theorize that the fault is isolated to this region, decide
to fail over to another region, and observe the results. Process aficionados will
recognize this as a generalization of the Deming “Plan-Do-Check-Act” cycle,
the “Observe-Orient-Decide-Act” cycle, or simply the scientific method.
The biggest challenges for the incident manager at this point are around
maintaining the team’s discipline:
· Are roles clear? Are people doing their jobs? Do we need to escalate to
more teams?
In any case, don’t panic – it doesn’t help. Stay calm, and the rest of the team
will take that cue.
The incident manager has to keep an eye on team fatigue and plan team
handovers. A dedicated team can risk burning themselves out when resolving
complex incidents – incident managers should look out for how long members
have been awake for and how long they’ve been working on the incident for,
and decide who’s going to fill their roles next.
INCIDENT RESPONSE 37
Resolve the incident
An incident is resolved when the current or imminent business impact has
ended. At that point, the emergency response process ends and the team
transitions onto any cleanup tasks and the post-incident review.
There are many moving parts to the incident response process. Keeping
track of each step with seamless communication is easy with an incident
management tool like Jira Service Management. Centralized alerts, flexible
communications channels, and unified work tracking are central to resolving
incidents quickly.
INCIDENT RESPONSE 38
03
Post-incident reviews
The importance of a post-incident review
Incidents happen. They just do. As our systems grow in scale and complexity,
failures are inevitable.
The best way to work through what happened during an incident and capture
any lessons learned is by conducting a post-incident review, also known as an
incident postmortem.
Thanks to tools like version control, feature flags, and continuous delivery, a lot
of incidents can be quickly “undone.” Many incidents are caused by some bug
in a change pushed to production, and rolling back that change can get the app
up and running again. This is really beneficial for everyone, it gets the service
quickly working again. But it often doesn’t help you understand what failed and
why. This is where post-incident reviews come in.
POST-INCIDENT REVIEWS 40
Plan
Post-incident
Build
review
Incident Deploy
In Jira Service Management, PIRs are a work category that you can link to
primary incident, subtasks, Jira Software issues, etc., so all critical actions
that occurred during the incident are documented in a timeline and related
Jira tickets are included as reference and remediation.
POST-INCIDENT REVIEWS 41
POST-INCIDENT REVIEWS 42
Best practices for a post-incident review
How you approach your post-incident review is just as crucial as the checklist
of steps you take. Tensions can run high in the wake of an incident. The key to
getting people to come to the process engaged and ready to tackle a complex
problem is to give them a sense of psychological safety.
While it’s important to keep the conversation safe and objective, getting to
the root cause of the incident is critical to resolving it. You can use a technique
in your meeting called “The Five Whys.” Start by making sure everyone agrees
on what the problem is. Then, ask why this happened, and then ask “why” to
the answer to that question. Repeat this process at least five times to ensure
you uncover all the deep factors contributing to the problem. Make sure the
room doesn’t try to steer away from an uncomfortable truth or try to reach
an easy consensus. You can learn more about “The Five Whys” approach with
Atlassian’s playbook.
POST-INCIDENT REVIEWS 43
Review every single post-incident report, and ingrain
lessons learned into your process
An unreviewed post-incident report might as well never have been written.
Once a post-incident report is drafted, it’s important to review it to close out
any unresolved issues, capture ideas to consider in the future, and finalize
the report. You may even say that the incident isn’t truly resolved until this
analysis has taken place.
How do you make this happen? Schedule a recurring meeting with engineering
(and anyone else who may have an interest, like customer support or account
managers), at least monthly, to review post-incident reports. You can choose
to review recent reports or perhaps review older reports and share lessons
that are still relevant today.
Usually, the team that delivers the service that caused the incident is
responsible for completing the associated post-incident review. They nominate
one person to be accountable for completing the review, and the issue is
assigned to them. They are the “PIR owner” and they drive the review through
drafting and approval, all the way until it’s published.
POST-INCIDENT REVIEWS 44
INCIDENT CAUSE CATEGORIES
The categories we use are tailored to our own business as a software company.
You may find that different categories work better for your business.
POST-INCIDENT REVIEWS 45
POST-INCIDENT REVIEW ACTIONS
Sue Lueder and Betsy Beyer from Google have an excellent presentation
and article on postmortem action items, which we use at Atlassian to
prompt the team. This section references their suggestions. Work through
the questions below to help ensure the postmortem covers both short- and
long-term fixes.
“Mitigate future incidents” and “Prevent future incidents” are your most likely
sources of actions that address the root cause. Be sure to get at least one of
these.
POST-INCIDENT REVIEWS 46
We also use Lueder and Beyer’s advice on wording for our post-incident
review actions:
The right wording for a PIR action can make the difference between an easy
completion and an indefinite delay due to infeasibility or procrastination.
A well-crafted PIR action should have these properties:
From... To...
Investigate monitoring for this (Actionable) Add alerting for all cases
scenario. where this service returns >1% errors.
Fix the issue that caused the outage. (Specific) Handle invalid postal code in
user address form input safely.
Visit Postmortem Action Items: Plan the Work and Work the Plan for the full
source material.
POST-INCIDENT REVIEWS 47
An effective post-incident review plan
For post-incident reviews to be effective–and allow you to build a culture of
continuous improvement–you want to implement a simple, repeatable process
in that everyone can participate. How you do this will depend on your culture
and your team. At Atlassian, we’ve developed the following approach that
works for us.
Set a threshold
Incidents in your organization should have clear and measurable severity levels.
These severity levels can trigger the post-incident review process. For example,
any incident Sev-1 or higher triggers the PIR process, while the review can be
optional for less severe incidents. Consider allowing team leads or management
to request a review for any incident that doesn’t meet the threshold.
TIP
Post-incident reviews can
also be created using Jira
Service Management’s native
automation engine. For example,
you can set an automation rule
to create a post-incident review
each time a major or critical
priority incident is resolved by
your team.
POST-INCIDENT REVIEWS 48
Don’t procrastinate
It’s important to take a break and get some rest after an incident. But don’t
delay writing the post-incident review. Wait too long, and important details
might be lost or forgotten. Ideally, it’s drafted immediately after a post-incident
review meeting. If possible, the team should hold a review meeting within 24-
48 hours of the incident resolution and not more than five business days.
Include a timeline
A timeline is a very helpful aid in incident documentation. Often it’s the
first place your readers’ eyes jump to when trying to quickly size up what
happened. Try to be as clear and specific as possible. For example, “11:14 am
Pacific Standard Time,” not “around 11.” Being specific with timestamps allows
you to map out a high-fidelity chain of events, which is useful to identify areas
of improvement. For example, you might identify that the interval between
when impact started and when customers were notified was too long.
· Time of resolution
POST-INCIDENT REVIEWS 49
Details, details, details
Skimping on details is a quick path to writing post-incident reviews that are
unhelpful and unclear. Add as many details as possible about what happened
and what was done during the incident. Instead of “then public comms went
out,” say “We sent the initial public comms announcing the incident on our
public status page and Social platform account.”
Wherever possible, include links and names, links to tickets and status
updates, links to incident state documents, and monitoring charts. Don’t be
afraid to add screenshots of relevant graphics or dashboards. A graph from
your monitoring system that clearly shows the incident’s start and end times
(for example, a drop in request rate followed by a return to normal) is very
valuable because it’s unambiguous. It becomes even more powerful when
combined with graphs that show what was happening behind the scenes
during that time, for example, database connections, network link state, or
CPU/ memory/io/bandwidth consumption over the same timeframe.
· The severity of the incident so that you can determine the relative
reliability of your systems.
POST-INCIDENT REVIEWS 50
Use checklists and templates to streamline the process
To ensure that your team develops a culture around post-incident reviews,
make it easy to capture information, schedule meetings, and publish the final
report with reusable checklists and templates. A repeatable process provides
consistency for teams, helps people know what to expect, and encourages
participants to engage in the PIR with a productive mindset.
· Review of report
· Presentation of report
The most important tip? Don’t skip any steps. The key to conducting post-
incident reviews that help you improve your team and systems is to have a
process and stick to it.
POST-INCIDENT REVIEWS 51
04
Incident management analytics
How to choose incident management
key performance indicators and metrics
In today’s always-on world, tech incidents come with significant consequences.
100K
downtime costs continue to rise with
the majority of outages costing at least
≥$
$100,000, and the price is increasing (39%
cost of most major
from 2019 to 2022). But the monetary outages
impact is far from the only cost to
businesses. CIOs indicate that incidents
result in lower customer satisfaction, data 39%
increase in the cost of
loss, loss of reputation, and SLA payouts.
outages in recent years
Additional research found that the
number of outages are increasing, and it
is taking longer for businesses to recover
+ 2 hrs
increase in MTTR since
from them with MTTR (Mean Time To 2020
Repair) ramping since 2020 by almost two
hours.
With so much at stake, it’s more important than ever for teams to track
incident management data and use their findings to detect, diagnose, fix, and –
ultimately – prevent incidents.
The good news is that with web and software incidents (unlike mechanical
and offline systems), teams usually are able to capture a lot more data to help
them understand and improve.
The bad news? Sometimes too much data can obscure issues instead of
illuminating them.
The downside to KPIs is that it’s easy to become too reliant on shallow
data. Knowing that your team isn’t resolving incidents fast enough won’t get
you to a fix. Because you still need to know how and why the team is or isn’t
resolving issues. And you still need to know if the issues you’re comparing
are actually comparable.
KPIs can’t tell you how your teams approach tricky problems. They can’t
explain why your time between incidents has been getting shorter instead
of longer. They don’t know why Incident A took three times as long as
Incident B.
For that, you need insights. And while the data can be a starting point on
the way to those insights, it can also be a stumbling block. It can make us
feel like we’re doing enough even if our metrics aren’t improving. It can lump
together incidents that are actually dramatically different and should be
approached differently. It can discount the experience of your teams and the
underlying complication of incidents themselves.
The point isn’t that KPIs are bad. We don’t think you should throw the baby
out with the bathwater. The point is that KPIs aren’t enough. They’re a
starting point. They’re a diagnostic tool. They’re the first step down a more
complex path to actual improvement.
“ Incidents are much more unique than conventional wisdom would have you
believe. Two incidents of the same length can have dramatically different
levels of surprise and uncertainty in how people came to understand what was
happening. They can also contain wildly different risks with respect to taking
actions that are meant to mitigate or improve the situation. Incidents are not
widgets being manufactured, where limited variation in physical dimensions is
seen as key markers of quality.
Tracking KPIs for incident management can help identify and diagnose
problems with processes and systems, set benchmarks and realistic goals for
the team to work toward, and provide a jumping-off point for larger questions.
For example, let’s say the business’ goal is to resolve all incidents within
30 minutes, but your team is currently averaging 45 minutes. Without specific
metrics, it’s hard to know what’s going wrong. Is your alert system taking too
long? Is your process broken? Do your diagnostic tools need to be updated?
Is it a team problem or a tech problem?
Now, add some metrics: If you know exactly how long the alert system is
taking, you can identify it as a problem or rule it out. If you see that diagnostics
are taking up more than 50% of the time, you can focus your troubleshooting
there. If you see that Team B is taking 25% more time than Teams A, C, and D,
you can start to dig into why.
KPIs won’t automatically fix your problems, but they will help you understand
where the problem lies and focus your energy on digging deeper in the right
places.
DIAGNOSIS REPAIRS
DIAGNOSIS REPAIRS
TEAM C
DIAGNOSIS REPAIRS
TEAM D
DIAGNOSIS REPAIRS
Are incidents happening more or less frequently over time? Is the number of
incidents acceptable, or could it be lower? Once you identify a problem with
the number of incidents, you can ask why that number is trending upward or
staying high and what the team can do to resolve the issue.
MTBF
MTBF (mean time between failures) is the average time between repairable
failures of a tech product. It can help you track availability and reliability across
products. The higher the time between failures, the more reliable the system.
As with other metrics, it’s a good jumping-off point for more extensive
questions. If your MTBF is lower than you want, it’s time to ask why the
systems are failing so often and how you can reduce or prevent future failures.
MTTA
MTTA (mean time to acknowledge) is the average time it takes between
a system alert and when a team member acknowledges the incident
and begins working to resolve it. The value here is in understanding how
responsive your team is to issues.
MTTD
MTTD (mean time to detect) is the average time it takes your team to discover
an issue. This term is often used in cybersecurity when teams are focused on
detecting attacks and breaches.
If this metric changes drastically or isn’t quite hitting the mark, it’s, yet again,
time to ask why.
MTTR
MTTR can stand for mean time to repair, resolve, respond, or recovery.
Arguably, the most useful of these metrics is the mean time to resolve, which
tracks the time spent diagnosing and fixing an immediate problem and the
time spent ensuring the issue doesn’t happen again. Recovery is a primary
DevOps metric that DevOps Research and Assessment (DORA) notes is key to
measuring the stability of a DevOps team.
Again, this metric is best when used diagnostically. Are your resolution times as
quick and efficient as you want them to be? If not, it’s time to ask more profound
questions about how and why said resolution time is missing the mark.
MTTRecovery
MTBF
MTTRepair
MTTRespond
MTTResolve
!
TIME
OUTAGE DEVS SEE REPAIR REPAIR COMPLETE TEAM FINDS A FIX PRODUCT FAILS COMPLETELY
BEGINS AN ALERT BEGINS AND SYSTEM TO PREVENT AND NEEDS TO BE REPLACED
IS RESTORED FUTURE OUTAGES
· Mean time to recovery tells you · Add mean time to resolve to the
how quickly you can get your mix and you start to understand
systems back up and running. the full scope of fixing and
resolving issues beyond the
· Layer in mean time to respond
actual downtime they cause.
and you get a sense for how
much of the recovery time · Fold in mean time between
belongs to the team and how failures, and the picture gets
much is your alert system. even more extensive, showing
you how successful your team is
· Further layer in mean time to
at preventing or reducing future
repair and you start to see how
issues.
much time the team is spending
on repairs vs. diagnostics.
And then add mean time to failure to understand the entire lifecycle of a
product or system.
SLA
An SLA (service level agreement) is an agreement between a provider
and client about measurable metrics like uptime, responsiveness, and
responsibilities.
The promises made in SLAs (about uptime, mean time to recovery, etc.) are
one of the reasons incident management teams need to track these metrics.
If and when things like average response time or mean time between failures
change, contracts need to be updated, and fixes need to happen – quickly.
SLO
An SLO (service level objective) is an agreement within an SLA about a specific
metric like uptime. As with the SLA, SLOs are essential metrics to track to
ensure the company upholds its end of the bargain regarding customer service.
Timestamps help teams build out timelines of the incident, along with the
lead-up and response efforts. A clear, shared timeline is one of the most
valuable artifacts during a post-incident review.
Uptime
Uptime is the amount of time (represented as a percentage) that your
systems are available and functional.
Tracking your success against this metric is all about making and keeping
customer promises. And, as with other metrics, it’s just a starting point. If
your uptime isn’t at 99.99%, the question of why will require more research,
conversations with your team, and investigation into process, structure,
access, or technology.
Jira Service Management offers report and dashboard features so your team
can track KPIs and monitor and optimize your incident management practice.
Even useful metrics like MTTR and MTBF alone aren’t enough to improve
incident management performance. They are there to help us identify an issue,
but they can’t answer the stickier, more qualitative questions of why and how
incidents occur and are resolved and how to improve those metrics.
Critical challenges are associated with all of these approaches, but everyone
agrees there is great value in resolving incidents faster to provide an
outstanding customer experience. As General Stanley McChrystal illustrates in
his book, Team of Teams, we want both efficiency and adaptability in incident
management. However, in the earlier stages of response, adaptability is more
important than efficiency because we must respond to a volatile environment
in the best ways possible.
Figures like these make it clear that lost revenue isn’t the only – or even the
most important – priority for incident management. An optimized incident
management process also needs to address the very real, very expensive
challenges of the people, processes, and technology behind incident
management.
67%
INCIDENT COMMUNICATION
63%
(STATUSPAGE,EMAIL,ETC.)
50%
66%
TICKET CREATION 52%
(JIRA, JIRA SERVICE MANAGEMENT, ETC.) 48%
53%
CHAT CHANNEL CREATION 59%
(SLACK, MICROSOFT TEAMS) 43%
58%
ON-CALL NOTIFICATIONS 57%
FROM MONITORING TOOLS
42%
33%
POSTMORTEM CREATION 28%
29%
3%
OTHER 1%
1% 2020 2021 2023
To learn more about Jira Service Management low code/no code automation
capabilities, visit our ITSM automation template library.
At Atlassian, our PIRs are blameless, which means they focus on improving
performance and moving forward – not finding someone to blame.