OReilly Post Incident Analysis Report
OReilly Post Incident Analysis Report
OReilly Post Incident Analysis Report
m
pl
im
en
ts
of
Post-Incident
Reviews
Learning from Failure for Improved
Incident Response
Jason Hand
Post-Incident Reviews
Learning From Failure for Improved
Incident Response
Jason Hand
978-1-491-98693-6
[LSI]
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
2. Old-View Thinking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Whats Broken? 8
The Way Weve Always Done It 8
Change 11
5. Continuous Improvement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Creating Flow 26
Eliminating Waste 26
Feedback Loops 27
v
6. Outage: A Case Study Examining the Unique Phases of an Incident 35
Day One 35
Day Two 38
11. Readiness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Next Best Steps 91
vi | Table of Contents
Foreword
I know we dont have tests for that, but its a small change; its prob
ably fine...
I ran the same commands I always do, but...something just doesnt
seem quite right.
That rm -rf sure is taking a long time!
If youve worked in software operations, youve probably heard or
uttered similar phrases. They mark the beginning of the best Ops
horror stories the hallway tracks of Velocity and DevOps Days the
world over have to offer. We hold onto and share these stories
because, back at that moment in time, what happened next to us,
our teams, and the companies we work for became a epic journey.
Incidents (and managing them, or...not, as the case may be) is far
from a new field: indeed, as an industry, weve experienced inci
dents as long as weve had to operate software. But the last decade
has seen a renewed interest in digging into how we react to, remedi
ate, and reason after-the-fact about incidents.
This increased interest has been largely driven by two tectonic shifts
playing out in our industry: the first began almost two decades ago
and was a consequence of a change in the types of products we
build. An era of shoveling bits onto metallic dust-coated plastic and
laser-etched discs that we then shipped in cardboard boxes to users
to install, manage, and operate themselves has given way to a
cloud-connected, service-oriented world. Now we, not our users, are
on the hook to keep that software running.
vii
The second industry shift is more recent, but just as notable: the
DevOps movement has convincingly made the argument that if
you build it, you should also be involved (at least in some way) in
running it, a sentiment that has spurred many a lively conversation
about who needs to be carrying pagers these days! This has resulted
in more of us, from ops engineers to developers to security engi
neers, being involved in the process of operating software on a daily
basis, often in the very midst of operational incidents.
I had the pleasure of meeting Jason at Velocity Santa Clara in 2014,
after Id presented A Look at Looking in the Mirror, a talk on the
very topic of operational retrospectives. Since then, weve had the
opportunity to discuss, deconstruct, and debate (blamelessly, of
course!) many of the ideas youre about to read. In the last three
years, Ive also had the honor of spending time with Jason, sharing
our observations of and experiences gathered from real-world prac
titioners on where the industry is headed with post-incident
reviews, incident management, and organizational learning.
But the report before you is more than just a collection of the whos,
whats, whens, wheres, and (five) whys of approaches to post-
incident reviews. Jason explains the underpinnings necessary to
hold a productive post-incident review and to be able to consume
those findings within your company. This is not just a postmortem
how-to (though it has a number of examples!): this is a postmor
tem why-to that helps you to understand not only the true com
plexity of your technology, but also the human side that together
make up the socio-technical systems that are the reality of the
modern software we operate every day.
Through all of this, Jason illustrates the positive effect of taking a
New View of incidents. If youre looking for ways to get better
answers about the factors involved in your operational incidents,
youll learn myriad techniques that can help. But more importantly,
Jason demonstrates that its not just about getting better answers: its
about asking better questions.
No matter where you or your organization are in your journey of
tangling with incidents, you have in hand the right guide to start
improving your interactions with incidents.
And when you hear one of those hallowed phrases that you know
will mark the start of a great hallway track tale, after reading this
guide, youll be confident that after youve all pulled together to fix
viii | Foreword
the outage and once the dust has settled, youll know exactly what
you and your team need to do to turn that incident on its head and
harness all the lessons it has to teach you.
J. Paul Reed
DevOps consultant and retrospective researcher
San Francisco, CA
July 2017
Foreword | ix
Introduction
xi
Once the transfer was complete, I checked all of the relevant logs; I
connected to the instance via SSH and stepped through my checklist
of things to verify before contacting the customer and closing out
the support ticket. Everything went exactly as expected. The admin
login worked, data existed in the MySQL tables, and the URL was
accessible.
When I reached out to the customer, I let them know everything
had gone smoothly. In fact, the backup and restore took less time
than I had expected. Recent changes to the process had shortened
the average maintenance window considerably. I included my per
sonal phone number in my outreach to the customer so that they
could contact me if they encountered any problems, especially since
they would be logging in to use the system several hours earlier than
Id be back onlinethey were located in Eastern Europe, so would
likely be using it within in the next few hours.
Incident Detection
Within an hour my phone began blowing up. First it was an email
notification (that I slept through). Then it was a series of push noti
fications tied to our ticketing system, followed almost immediately
by an SMS from the customer. There was a problem.
After a few back-and-forth messages in the middle of the night from
my mobile phone, I jumped out of bed to grab my laptop and begin
investigating further. It turned out that while everything looked like
it had worked as expected, the truth was that nearly a months worth
of data was missing. The customer could log in and there was data,
but it wasnt up to date.
Incident Response
At this point I reached out to additional resources on my team to
leverage someone with more experience and knowledge about the
system. Customer data was missing, and we needed to recover and
restore it as quickly as possible, if that was possible at all. All the ops
engineers were paged, and we began sifting through logs and data
looking for ways to restore the customers data, as well as to begin to
understand what had gone wrong.
xii | Introduction
Incident Remediation
Very quickly we made the horrifying discovery that the backup data
that was used in the migration was out of date by several months.
The migration process relies on backup files that are generated every
24 hours when left in the default setting (users, however, could make
it much more frequent). We also found out that for some reason
current data had not been backed up during those months. That hel
ped to explain why the migration contained only old data. Ulti
mately, we were able to conclude that the current data was
completely gone and impossible to retrieve.
Collecting my thoughts on how I was going to explain this to the
customer was terrifying. Being directly responsible for losing
months worth of the data that a customer relies on for their own
business is a tough pill to swallow. When youve been in IT long
enough, you learn to accept failures, data loss, and unexplainable
anomalies. The stakes are raised when it impacts someone else and
their livelihood. We all know it could happen, but hope it wont.
We offered an explanation of everything we knew regarding what
had happened, financial compensation, and the sincerest of apolo
gies. Thankfully the customer understood that mistakes happen and
that we had done the best we could to restore their data. With any
new technology there is an inherent risk to being an early adopter,
and this specific customer understood that. Accidents like this are
part of the trade-off for relying on emerging technology and serv
ices like those our little tech startup had built.
After many hours of investigation, discussion, and back and forth
with the customer, it was time to head in to the office. I hadnt slept
for longer than an hour before everything transpired. The result of
my actions was by all accounts a worst-case scenario. I had only
been with the company for a couple of months. The probability of
being fired seemed high.
Incident Analysis
Once all of the engineers, the Ops team, the Product team, and our
VP of Customer Development had arrived, our CEO came to me
and said, Lets talk about last night. Anxiously, I joined he and the
others in the middle of our office, huddled together in a circle. I was
then prompted with, Tell us what happened.
Introduction | xiii
I began describing everything that had taken place, including when
the customer requested the migration, what time I began the pro
cess, when I was done, when I reached out to them, and when they
let me know about the data loss. We started painting a picture of
what had happened throughout the night in a mental timeline.
To cover my butt as much as possible, I was sure to include extra
assurances that I had reviewed all logs after the process and verified
that every step on my migration checklist was followed, and that
there were never any indications of a problem. In fact, I was sur
prised by how quickly the whole process went.
Having performed migrations many times before, I had a pretty
good idea of how long something like this should take given the size
of the MySQL data. In my head, it should have taken about 30
minutes to complete. It actually only took about 10 minutes. I men
tioned that I was surprised by that but knew that we had recently
rolled out a few changes to the backup and restore process, so I
attributed the speediness of the migration to this new feature.
I continued to let them know what time I reached out to the Ops
team. Although time wasnt necessarily a huge pressure, finding the
current data and getting it restored was starting to stretch my
knowledge of the system. Not only was I relatively new to the team,
but much about the systemhow it works, where to find data, and
morewasnt generally shared outside the Engineering team.
Most of the system was architected by only a couple of people. They
didnt intentionally hoard information, but they certainly didnt have
time to document or explain every detail of the system, including
where to look for problems and how to access all of it.
As I continued describing what had happened, my teammates
started speaking up and adding more to the story. By this point in
our mental timeline we each were digging around in separate areas
of the system, searching for answers to support our theories regard
ing what had happened and how the system behaved. We had begun
to divide and conquer with frequent check-ins over G-chat to gain a
larger understanding about the situation from each other.
I was asked how the conversation went when I reached out to the
customer. We discussed how many additional customers might be
affected by this, and how to reach out to them to inform them of a
possible bug in the migration process.
xiv | Introduction
Several suggestions were thrown out to the Operations team about
detecting something like this sooner. The engineers discussed
adding new logging or monitoring mechanisms. The Product team
suggested pausing the current sprint release so that we could priori
tize this new work right away. Everyone, including the CEO, saw this
as a learning opportunity, and we all walked away knowing more
about:
In fact, we all learned quite a bit about what was really going on in
our system. We also gained a much clearer picture of how we would
respond to something like this. Being a small team, contacting each
other, and collaborating on the problem was just like any other day
at the office. We each knew one anothers cell phones, emails, and G-
chat handles. Still, we discovered that in situations like this someone
from the Ops team should be pulled in right away, until access can
be provided to more of the team and accurate documentation is
made available to everyone. We were lucky that we could coordinate
and reach each other quickly to get to the bottom of the problem.
As we concluded discussing what we had learned and what action
items we had as takeaways, everyone turned and headed back to
their desks. It wasnt until that moment that I realized I had never
once been accused of anything. No one seemed agitated with me for
the decisions Id made and the actions I took. There was no blaming,
shaming, or general animosity toward me. In fact, I felt an immense
amount of empathy and care from my teammates. It was as though
everyone recognized that they likely would have done the exact
same thing I had.
Incident Readiness
The system was flawed, and now we knew what needed to be
improved. Until we did so, the exact same thing was at risk of hap
pening again. There wasnt just one thing that needed to be fixed.
There were many things we learned and began to immediately
improve. I became a much better troubleshooter and gained access
to parts of the system where I can make a significant positive impact
in the recovery efforts moving forward.
Introduction | xv
For modern IT organizations, maintaining that line of reasoning
and focus on improving the system as a whole is the difference
between being a high-performing organization or a low-performing
one. Those with a consistent effort toward continuous improvement
along many vectors come out on top. Looking for ways to improve
our understanding of our systems as well as the way in which teams
respond to inevitable failure means becoming extremely responsive
and adaptable. Knowing about and remediating a problem faster
moves us closer to a real understanding of the state and behavior of
our systems.
What would have happened if this latent failure of the automated
backup process in the system had lain dormant for longer than just
a few months? What if this had gone on for a year? What if it was
happening to more than just Open CRM instances on AWS? What if
we had lost data that could have taken down an entire company?
In order to answer those questions better, we will leverage the use of
a post-incident review. A type of analytic exercise, post-incident
reviews will be explored in depth in Chapter 8; youll see how we
know what an incident is as well as when it is appropriate to per
form an analysis.
As well learn in the coming chapters, old-view approaches to retro
spective analysis of incidents have many flaws that inherently pre
vent us from learning more about our systems and how we can
continuously improve them.
By following a new approach to post-incident reviews, we can make
our systems much more stable and highly available to the growing
number of people that have come to rely on the service 24 hours a
day, every day of every year.
Whats Next?
This short book sets out to explore why post-incident reviews are
important and how you and your team can best execute them to
continuously improve many aspects of both building resilient sys
tems and responding to failure sooner.
Chapters 1 and 2 examine the current state of addressing failure in
IT organizations and how old-school approaches have done little to
help provide the right scenario for building highly available and
reliable IT systems.
xvi | Introduction
Chapter 3 points out the roles humans play in managing IT and our
shift in thinking about their accountability and responsibility with
regard to failure.
In Chapters 4 and 5 we will begin to set the context of what we
mean by an incident and develop a deeper understanding of cause
and effect in complex systems.
Chapter 6 begins to get us thinking about why these types of exerci
ses are important and the value they provide as we head into a case
study illustrating a brief service disruption and what a simple post-
incident review might look like.
The remainder of the book (Chapters 7 through 10) discusses
exactly how we can approach and execute a successful post-incident
review, including resources that may help you begin preparing for
your next IT problem. A case study helps to frame the value of
these exercises from a management or leadership point of view.
Well conclude in Chapter 11 by revisiting a few things and leaving
you with advice as you begin your own journey toward learning
from failure.
Acknowledgments
Id like to give an extra special thank you to the many folks
involved in the creation of this report.
The guidance and flexibility of my editors Brian Anderson, Virginia
Wilson, Susan Conant, Kristen Brown, and Rachel Head was greatly
appreciated and invaluable. Thank you to Matthew Boeckman,
Aaron Aldrich, and Davis Godbout for early reviews, as well as
Mark Imbriaco, Courtney Kissler, Andi Mann, John Allspaw, and
Dave Zwieback for their amazing and valuable feedback during the
technical review process. Thanks to Erica Morrison and John Paris
for your wonderful firsthand stories to share with our readers.
Thank you to J. Paul Reed, who was the first presenter I saw at
Velocity Santa Clara in 2014. His presentation A Look At Looking
In the Mirror, was my first personal exposure to many of the con
cepts Ive grown passionate about and have shared in this report.
Special thanks to my coworkers at VictorOps and Standing Cloud
for the experiences and lessons learned while being part of teams
tasked with maintaining high availability and reliability. To those
Introduction | xvii
before me who have explored and shared many of these concepts,
such as Sidney Dekker, Dr. Richard Cook, Mark Burgess, Samuel
Arbesman, Dave Snowden, and L. David Marquetyour work and
knowledge helped shape this report in more ways than I can express.
Thank you so much for opening our eyes to a new and better way of
operating and improving IT services.
Id also like to thank John Willis for encouraging me to continue
spreading the message of learning from failure in the ways Ive out
lined in this report. Changing the hearts and minds of those set in
their old way of thinking and working was a challenge I wasnt sure I
wanted to continue in late 2016. This report is a direct result of your
pep talk in Nashville.
Last but not least, thank you to my family, friends, and especially my
partner Stephanie for enduring the many late nights and weekends
spent in isolation while I juggled a busy travel schedule and dead
lines for this report. Im so grateful for your patience and under
standing. Thank you for everything.
xviii | Introduction
CHAPTER 1
Broken Incentives and Initiatives
1
of what has been taught or seen in practice follows a command and
control pattern. Sometimes referred to as the leaderfollower struc
ture, this is a remnant of a successful labor model when mankinds
primary work was physical.1
In IT, as well as many other roles and industries, the primary output
and therefore most important work we do is cognitive. Its no won
der the leaderfollower structure L. David Marquet describes in
Turn the Ship Around! has failed when imposed in modern IT
organizations. It isnt optimal for our type of work. It limits
decision-making authority and provides no incentive for individuals
to do their best work and excel in their roles and responsibilities.
Initiative is effectively removed as everyone is stripped of the oppor
tunity to utilize their imagination and skills. In other words, nobody
ever performs at their full potential.2
This old-view approach to management, rooted in physical labor
models, doesnt work for IT systems, and neither does the concept of
control, despite our holding on to the notion that it is possible.
Control
Varying degrees of control depend on information and the scale or
resolution at which we are able to perceive it. Predictability and
interaction are the key necessary components of control.
We sometimes think we are in control because we either dont have
or choose not to see the full picture.
Mark Burgess, In Search of Certainty
Consider the company culture you operate in, and construct your
own way to get the conversation started. Framing the question in a
way that encourages open discussion helps teams explore alterna
tives to their current methods. With a growth mindset, we can
explore both the negative and the positive aspects of what tran
spired.
We must consider as much data representing both positive and neg
ative impacts to recovery efforts and from as many diverse and
informative perspectives as possible. Excellent data builds a clearer
picture of the specifics of what happened and drives better theories
and predictions about the systems moving forward, and how we can
consistently improve them.
Observations that focus less on identifying a cause and fix of a prob
lem and more on improving our understanding of state and behav
ior as it relates to all three key elements (people, process, and
technology) lead to improved theory models regarding the system.
This results in enhanced understanding and continuous improve
ment of those elements across all phases of an incident: detection,
response, remediation, analysis, and readiness.
Failure can never be engineered out of a system. With each new bit
that is added or removed, the system is being changed. Those
changes are happening in a number of ways due to the vast inter
connectedness and dependencies. No two systems are the same. In
fact, the properties and state of a single system now are quite dif
ferent from those of the same system even moments ago. Its in con
stant motion.
Working in IT today means being skilled at detecting problems,
solving them, and multiplying the effects by making the solutions
available throughout the organization. This creates a dynamic sys
tem of learning that allows us to understand mistakes and translate
that understanding into actions that prevent those mistakes from
recurring in the future (or at least having less of an impacti.e.,
graceful degradation).
By learning as much as possible about the system and how it
behaves, IT organizations can build out theories on their normal.
Teams will be better prepared and rehearsed to deal with each new
problem that occurs, and the technology, process, and people
aspects will be continuously improved upon.
4 Jennifer Davis and Katherine Daniels, Effective DevOps: Building a Culture of Collabora
tion, Affinity, and Tooling at Scale (OReilly, Kindle Edition), location 1379 of 9606.
Weve all seen the same problems repeat themselves. Recurring small
incidents, severe outages, and even data losses are stories many in IT
can commiserate over in the halls of tech conferences and forums of
Reddit. Its the nature of building complex systems. Failure happens.
However, in many cases we fall into the common trap of repeating
the same process over and over again, expecting the results to be dif
ferent. We investigate and analyze problems using techniques that
have been well established as best practices. We always feel like we
can do it better. We think we are smarter than othersor maybe our
previous selvesyet the same problems seem to continue to occur,
and as systems grow, the frequency of the problems grows as well.
Attempts at preventing problems always seem to be an exercise in
futility. Teams become used to chaotic and reactionary responses to
IT problems, unaware that the way it was done in the past may no
longer apply to modern systems.
We have to change the way we approach the work. Sadly, in many
cases, we dont have the authority to bring about change in the way
we do our jobs. Tools and process decisions are made from the top.
Directed by senior leaders, we fall victim to routine and the fact that
no one has ever stopped to ask if what we are doing is actually help
ing.
Traditional techniques of post-incident analysis have had minimal
success in providing greater availability and reliability of IT services.
In Chapter 6, we will explore a fictional case study of an incident to
illustrate an example service disruption and post-incident review
7
one that more closely represents a systematic approach to learning
from failure in order to influence the future.
Whats Broken?
When previous attempts to improve our systems through retrospec
tive analysis have not yielded IT services with higher uptime, it can
be tempting to give up on the practice altogether. In fact, for many
these exercises turn out to be nothing more than boxes for someone
higher up the management ladder to check. Corrective actions are
identified, tickets are filed, fixes are put in place. Yet problems
continue to present themselves in unique but common ways, and
the disruptions or outages continue to happen more frequently and
with larger impact as our systems continue to grow.
This chapter will explore one old-school approachroot cause
analysisand show why it is not the best choice for post-incident
analysis.
This brings us to the second problem with this approach: our line of
questioning led us to a human. By asking why long enough, we
eventually concluded that the cause of the problem was human error
and that more training or formal processes are necessary to prevent
this problem from occurring again in the future.
This is a common flaw of RCAs. As operators of complex systems, it
is easy for us to eventually pin failures on people. Going back to the
lost data example in the Introduction, I was the person who pushed
the buttons, ran the commands, and solely operated the migration
for our customer. I was new to the job and my Linux skills werent as
sharp as they could have been. Obviously there were things that I
needed to be trained on, and I could have been blamed for the data
loss. But if we switch our perspective on the situation, I in fact dis
covered a major flaw in the system, preventing future similar inci
dents. There is always a bright side, and its ripe with learning
opportunities.
The emotional pain that came from that event was something Ill
never forget. But youve likely been in this situation before as well. It
is just part of the natural order of IT. Weve been dealing with prob
lems our entire professional careers. Memes circulate on the web
joking about the inevitable Have you rebooted? response trotted
out by nearly every companys help desk team. Weve always
accepted that random problems occur and that sometimes we think
weve identified the cause, only to discover that either the fix we put
in place caused trouble somewhere else or a new and more interest
ing problem has surfaced, rendering all the time and energy we put
into our previous fix nearly useless. Its a constant cat-and-mouse
game. Always reactionary. Always on-call. Always waiting for things
to break, only for us to slip in a quick fix to buy us time to address
our technical debt. Its a bad cycle weve gotten ourselves into.
Weve been looking at post-incident analysis the wrong way for quite
some time. Focusing on avoiding problems distracts us from seeking
to improve the system as a whole.
Change
What makes a system a system? Change. This concept is central to
everything in this book. Our systems are constantly changing, and
we have to be adaptable and responsive to that change rather than
rigid. We have to alter the way we look at all of this.
If youre like me, its very easy to read a case study or watch a pre
sentation from companies such as Amazon, Netflix, or Etsy and be
suspicious of the tactics they are suggesting. Its one thing to learn of
an inspirational new approach to solving common IT problems and
accept that its how we should model our efforts. Its something quite
different to actually implement such an approach in our own com
panies.
But we also recognize that if we dont change something we will be
forever caught in a repeating cycle of reactionary efforts to deal with
Change | 11
IT problems when they happen. And we all know they will happen.
Its just a matter of time. So, its fair for us to suspect that things will
likely continue to get worse if some sort of change isnt effected.
Youre not alone in thinking that a better way exists. While stories
from unicorn companies like those mentioned above may seem
unbelievably simple or too unique to their own company culture,
the core of their message applies to all organizations and industries.
For those of us in IT, switching up our approach and building the
muscle memory to learn from failure is our way to a better world
and much of it lies in a well-executed post-incident analysis.
In the next four chapters, well explore some key factors that you
need to consider in a post-incident analysis to make it a successful
systems thinking approach.
Celebrate Discovery
When accidents and failures occur, instead of looking for human
error we should look for how we can redesign the system to prevent
these incidents from happening again.1
A company that validates and embraces the human elements and
considerations when incidents and accidents occur learns more
from a post-incident review than those who are punished for
actions, omissions, or decisions taken. Celebrating transparency and
learning opportunities shifts the culture toward learning from the
human elements. With that said, gross negligence and harmful acts
must not be ignored or tolerated.
1 Gene Kim, Jez Humble, Patrick Debois, and John Willis, The DevOps Handbook (IT
Revolution), 40.
13
Human error should never be a cause.
Transparency
Nurturing discovery through praise will encourage transparency.
Benefits begin to emerge as a result of showing the work that is
being done. A stronger sense of accountability and responsibility
starts to form.
Transparency | 15
While blaming individuals is clearly counter-
productive, it is important to seek out and identify
knowledge or skill gaps that may have contributed to
the undesirable outcome so that they can be consid
ered broadly within the organization.
Cynefin
The Cynefin (pronounced kun-EV-in) complexity framework is one
way to describe the true nature of a system, as well as appropriate
approaches to managing systems. This framework first differentiates
between ordered and unordered systems. If knowledge exists from
previous experience and can be leveraged, we categorize the system
as ordered. If the problem has not been experienced before, we
treat it as an unordered system.
Cynefin, a Welsh word for habitat, has been popularized within the
DevOps community as a vehicle for helping us to analyze behavior
17
and decide how to act or make sense of the nature of complex sys
tems. Broad system categories and examples include:
Ordered
Complicated systems, such as a vehicle. Complicated systems
can be broken down and understood given enough time and
effort. The system is knowable.
Unordered
Complex systems, such as traffic on a busy highway. Complex
systems are unpredictable, emergent, and only understood in
retrospect. The system is unknowable.
As Figure 4-1 shows, we can then go one step further and break
down the categorization of systems into five distinct domains that
provide a bit more description and insight into the appropriate
behaviors and methods of interaction.
We can conceptualize the domains as follows:
Simpleknown knowns
Complicatedknown unknowns
Complexunknown unknowns
Chaoticunknowable unknowns
Disorderyet to be determined
1 Greg Brougham, The Cynefin Mini-Book: An Introduction to Complexity and the Cyne
fin Framework (C4Media/InfoQ), 7.
Cynefin | 19
Other aspects of the Cynefin complexity framework help us to see
the emergent behavior of complex systems, how known best practi
ces apply only to simple systems, and that when dealing with a cha
otic domain, your best first step is to act, then to sense and finally
probe for the correct path out of the chaotic realm.
Evaluation Models
Choosing a model helps us to determine what to look for as we seek
to understand cause and effect and, as part of our systems thinking
approach, suggests ways to explain the relationships between the
many factors contributing to the problem. Three kinds of model are
regularly applied to post-incident reviews:3
Sequence of events model
Suggests that one event causes another, which causes another,
and so on, much like a set of dominoes where a single event
kicks off a series of events leading to the failure.
2 Sidney Dekker, The Field Guide to Understanding Human Error, (CRC Press), 73.
3 Ibid., 81.
4 Ibid.
5 Ibid., 92.
Evaluation Models | 21
Youve likely picked up on a common thread regarding
sequence of events modeling (a.k.a. root cause analy
sis) in this book. These exercises and subsequent
reports are still prevalent throughout the IT world. In
many cases, organizations do submit that there was
more than one cause to a system failure and set out to
identify the multitude of factors that played a role. Out
of habit, teams often falsely identify these as root
causes of the incident, correctly pointing out that many
things conspired at once to cause the problem, but
unintentionally sending the misleading signal that one
single element was the sole true reason for the failure.
Most importantly, by following this path we miss an
opportunity to learn.
Evaluation Models | 23
CHAPTER 5
Continuous Improvement
25
Creating Flow
Identifying what to improve can often feel like an exercise in who
has the most authority. With so many areas of effort and concern in
an IT organization, the decision of where to begin may simply come
from those in senior positions rather than from the individuals clos
est to the work. This isnt wholly unjustified: practitioners of the
work itself may struggle to see the forest for the trees. Without the
presence of generalists or consultants trained in seeing the big pic
ture, identifying the work that brings the most value can be chal
lenging.
Understanding the entire process, from idea to customer usage,
requires a close examination of every step in the lifecycle. This can
enable inefficiency and friction to be identified and examined
further.
Creating a value stream map, a method for visualizing and analyzing
the current state of how value moves through the system, is a com
mon practice to assist in understanding this flow. Mapping the
entire process and the time between handoffs not only helps all
stakeholders have a much larger and clearer picture, but begins to
surface areas of possible improvement. Finding ways to trim the
time it takes to move through the value stream paves the path
toward continuous delivery and highly available systems.
Eliminating Waste
Post-incident reviews work in much the same way. When we are
responsible for maintaining the availability and reliability of a ser
vice, anything that prevents us from knowing about a problem is
friction in the system. Likewise, delays in efforts to remediate and
resolve service disruptions are considered waste.
Identifying impediments, such as delays in contacting the right indi
vidual during the detection phase of the incident timeline, sets the
stage for a clear direction on not only what needs to be improved
but where to begin. Regardless of where friction exists in the overall
continuous delivery lifecycle, everything that makes it into a pro
duction environment and is relied on by others has started in the
mind of an individual or collective team. Focusing on where the
most value can be gained earlier and often in the entire process
shows you your path of continuous improvement.
Feedback Loops
The key to successful continuous improvement is reducing the time
it takes to learn. Delaying the time it takes to observe the outcomes
of effortsor worse, not making an effort to receive feedback about
the current conditionsmeans we dont know whats working and
whats not. We dont know what direction our systems are moving
in. As a result, we lose touch with the current state of our three key
elements: people, process, and technology.
Think about how easy it is for assumptions to exist about things as
simple as monitoring and logging, on-call rotations, or even the
configuration of paging and escalation policies. The case study in
Chapter 8 will illustrate a way in which we can stay in touch with
those elements and more easily avoid blind spots in the bigger pic
ture.
Post-incident reviews help to eliminate assumptions and increase
our confidence, all while lessening the time and effort involved in
obtaining feedback and accelerating the improvement efforts.
Retrospectives
Retrospective analysis is a common practice in many IT organiza
tions. Agile best practices suggest performing retros after each
development sprint to understand in detail what worked, what
didnt, and what should be changed. Conditions and priorities shift;
teams (and the entire organization) arent locked into one specific
way of working. Likewise, they avoid committing to projects that
Feedback Loops | 27
may turn out to be no longer useful or on the path of providing
value to the business.
Its important to note that these retrospectives include the things
that worked, as well as what didnt quite hit the mark.
Learning Reviews
In an attempt to place emphasis on the importance of understand
ing the good and bad in a post-incident analysis, many teams refer
to the exercise as simply a learning review. With the agenda spelled
out in the name, it is clear right from the beginning that the purpose
of taking the time to retrospectively analyze information regarding a
service disruption is to learn. Understanding as much about the sys
tem as possible is the best approach to building and operating an IT
system.
Regardless of what you call these reviews, it is important to clarify
their intention and purpose. A lot of value comes from a well-
executed post-incident review. Not only are areas of improvement
identified and prioritized internally, but respect from customers and
the industry as a whole emerges due to the transparency and altru
ism of sharing useful information publicly.
Incident Debriefing
After Action Report
Rapid Improvement Event
Objectives
It is important to clarify the objective of the analysis as an exercise in
learning. It is common for many to approach these exercises much
like troubleshooting failures in simple systems, such as manufactur
ing and assembly systems, and focus their efforts on identifying the
cause.
In nonlinear, simple systems, cause and effect are obvious and the
relationships between components of the system are clear. Failures
can be linked directly back to a single point of failure. As made clear
by the Cynefin complexity framework, this in no way describes the
systems we build and operate in IT. Complexity forces us to take a
new approach to operating and supporting systemsone that looks
more toward what can be learned rather than what broke, and how
it can it be prevented from happening ever again.
Depending on the scope and scale of issues that may contribute to
IT problems, information can be gained from a variety of areas in
not only the IT org, but the entire company. Learning becomes the
differentiating factor between a high-performing team and a low-
performing team.
Shortening the feedback loops to learn new and important things
about the nature of the system and how it behaves under certain cir
cumstances is the greatest method of improving the availability of
the service.
Feedback Loops | 29
To learn what will contribute the most to improvement efforts, we
begin by asking two questionsquestions designed to get right to
the heart of shortening critical feedback loops. Answering the first
two questions leads to the third and most important of the group:
Feedback Loops | 31
Multiple areas of the incident lifecycle will prove to be excellent
places to focus continuous improvement efforts. The return on
investment for IT orgs is the ability to recover from outages much
sooner. The 2017 State of DevOps Report reported that high-
performing organizations had a mean time to recover (MTTR) from
downtime 96 times faster than that of low performers.1 Scrutinizing
the details related to recovery helps to expose methods that can help
achieve that improvement.
Feedback Loops | 33
CHAPTER 6
Outage: A Case Study Examining
the Unique Phases of an Incident
Day One
Detection
Around 4 p.m. Gary, a member of the support team for a growing
company, begins receiving notifications from Twitter that the com
pany is being mentioned more than usual. After wrapping up
responding to a few support cases, Gary logs into Twitter and sees
that several users are complaining that they are not able to access the
services login page.
Gary then reaches out to Cathy, who happens to be the first engineer
he sees online and logged into the company chat tool. She says shell
take a look and reach out to others on the team if she cant figure out
whats going on and fix it. Gary then files a ticket in the customer
support system for follow-up and reporting.
35
Note that Gary was the first to know of the problem internally, but
that external users or customers first detected the disruption and a
sense of urgency did not set in until Twitter notifications became
alarmingly frequent. Still, responding to support cases was of
higher priority to Gary at that moment, extending the elapsed time
of the detection phase of the incident.
How many ideas for improvement can you spot in this scenario?
Response
Cathy attempts to verify the complaint by accessing the login page
herself. Sure enough, its throwing an error. She then proceeds to fig
ure out which systems are affected and how to get access to them.
After several minutes of searching her inbox she locates a Google
Document explaining methods to connect to the server hosting the
site, and is then able to make progress.
Remediation
Upon logging in to the server, Cathys first action is to view all run
ning processes on the host. From her terminal, she types:
cathy#: top
to display the running processes and how many resources are being
used. Right away she spots that there is a service running she isnt
familiar with and its taking 92% of the CPU. Unfamiliar with this
process, shes hesitant to terminate it.
Day One | 37
What would you have done had you discovered an unknown ser
vice running on the host? Would you kill it immediately or hesitate
like Cathy?
Thankfully Greg was able to help restore service, but at what
expense? A more severe problem might have forced Greg to miss
his daughters birthday party entirely. How humane are your on-call
and response expectations?
How mindful were the participants of the external users or custom
ers throughout this phase? Often, customers are constantly refresh
ing status pages and Twitter feeds for an update. Were transparent
and frequent updates to the end users made a priority?
This is a great opportunity for a discussion around the question
What does it look like when this goes well? as first suggested in
Chapter 1.
Day Two
Analysis
Cathy, Greg, Gary, and several additional members from the engi
neering and support teams huddle around a conference table at 10
a.m., with a number of managers hovering near the door on their
way to the next meeting.
Greg begins by asking Cathy to describe what happened. Stepping
the group through exactly what transpired from her own perspec
tive, Cathy mentions how she was first alerted to the problem by
Gary in support. She then goes on to explain how it took a while for
her to figure out how to access the right server. Her first step after
accessing the system was to check the running processes. Upon
doing so she discovered an unknown service running, but was afraid
to kill it as a remediation step. She wanted a second opinion, but
explains that again it took her some time to track down the phone
number she needed to get in touch with Greg.
Several engineers chime in with their opinions on what the service
was and whether it was safe to stop or not. Greg then adds that those
were his exact first steps as well and that he didnt hesitate to kill a
process he wasnt familiar with. Cathy asks, Did you run top to see
Learnings
1. We didnt detect this on our own. Customers detected the
outage.
2. We dont have a clear path to responding to incidents. Support
contacted Cathy as a result of chance, not process.
3. Its not common knowledge how to connect to critical systems
regarding the service we provide.
4. Access to systems for the first responder was clumsy and
confusing.
5. We arent sure who is responsible for updating stakeholders
and/or the status page.
Day Two | 39
6. A yet-to-be-identified process was found running on a critical
server.
7. Pulling in other team members was difficult without instant
access to their contact information.
8. We dont have a dedicated area for the conversations that are
related to the remediation efforts. Some conversations were held
over the phone and some took place in Slack.
9. Someone other than Greg should have been next on the escala
tion path so he could enjoy time with his family.
Armed with an extensive list of things learned about the system, the
team then begins to discuss actionable tasks and next steps.
Several suggestions are made and captured in the Google Doc:
Action Items
1. Add additional monitoring of the host to detect potential or
imminent problems.
2. Set up an on-call rotation so everyone knows who to contact if
something like this happens again.
3. Build and make widely available documentation on how to get
access to systems to begin investigating.
4. Ensure that all responders have the necessary access and privi
leges make an impact during remediation.
5. Establish responsibility and process surrounding who is to
maintain the status page.
6. Define escalation policies and alerting methods for engineers.
7. Build and make widely available contact information for engi
neers who may be called in to assist during remediation efforts.
8. Establish a specific communication client and channel for all
conversations related to remediation efforts and try to be
explicit and verbose about what you are seeing and doing.
Attempt to think out loud.
9. Come up with a way for engineers to communicate to their
team availability to assist in remediation efforts.
Day Two | 41
Recap
This example illustrated a fairly brief Sev2 incident in which the
time to resolve was relatively low. Lengthier outages may result in
exercises that go on longer. The total time of this exercise was just
under 30 minutes, and it resulted in 9 extremely beneficial tasks that
will make a huge impact on the uptime of the site and the overall
reliability of the service the business is providing.
There are two main philosophical approaches to both what the anal
ysis is and the value it sets out to provide organizations. For many,
its purpose is to document in great detail what took place during the
response to an IT problem (self-diagnosis). For others, it is a means
to understand the cause of a problem so that fixes can be applied to
various aspects of process, technology, and people (self-
improvement).
Regardless of the approach you take, the reason we perform these
exercises is to learn as much about our systems as possible and
uncover areas of improvement in a variety of places. Identifying a
root cause is a common reason most claim as to why analysis is
important and helpful. However, this approach is shortsighted.
43
expected to, but the system as a whole did not suffer a disruption of
service. All signs pointed to a healthy working system. However, as
we learned from the remediation efforts and post-incident review,
there were latent problems in the migration process that caused a
customer to lose data. Despite not directly disturbing the availability
of our system, it certainly impacted an aspect of reliability. The sys
tem wasnt reliable if it couldnt perform the tasks related to its
advertised functionality, namely seamless migration of applications
between cloud providers.
Another situation might be the failure of a database in a staging
environment failed. Despite not harming the production environ
ment or real customers, whatever problem exists in staging will
likely make its way to production if not caught and addressed. Dis
covering and analyzing such non-production disruptions is
important as well. Often an incident in a development or staging
environment is a precursor to that same condition occurring in the
production environment.
Priority
To categorize types of incidents, their impact on the system or ser
vice, and how they should be actively addressed, a priority level is
assigned. One common categorization of those levels is:
49
Information
Warning
Critical
Severe incidents are assigned the critical priority, while minor prob
lems or failures that have well-established redundancy are typically
identified with a warning priority. Incidents that are unactionable or
false alarms have the lowest priority and should be identified simply
as information to be included in the post-incident review.
Severity
Categorization of severity levels often varies depending on industry
and company culture. One example of labeling severity levels is as
follows:
Detection
Knowing about a problem is the initial step of the incident lifecycle.
When it comes to maintaining service availability, being aware of
problems quickly is essential. Monitoring and anomaly detection
tools are the means by which service owners or IT professionals
keep track of a systems health and availability. Regularly adjusting
Lifecycle of an Incident | 51
monitoring thresholds and objectives ensures that the time to
know phase of an incident is continuously improved upon,
decreasing the overall impact of a problem. A common approach to
monitoring is to examine conditions and preferred states or values.
When thresholds are exceeded, email alerts are triggered. However,
a detection and alerting process that requires an individual to read
and interpret an email to determine if action needs to be taken is not
only difficult to manage, its impossible to scale. Humans should be
notified effectively, but only when they need to take action.
Response
Once a problem is known, the next critical step of an incidents life
cycle is the response phase. This phase typically accounts for nearly
three-quarters of the entire lifecycle.2 Establishing the severity and
priority of the problem, followed by investigation and identification,
helps teams formulate the appropriate response and remediation
steps. Consistent, well-defined response tactics go a long way toward
reducing the impact of a service disruption.
Responding to problems of varying degree is not random or rare
work. It is something that is done all the time. It is this responsive
ness that is the source of reliability.
Triage
Investigation
Identification
Lifecycle of an Incident | 53
Remediation
We dont know what we dont know. If detection and response
efforts are chaotic and undefined, the resulting remediation phase
will be just as unsuccessful. In many situations, remediation may
start with filing a ticket as part of established procedures. Without a
sense of urgency, recovering from a service disruption as quickly as
possible is not made a priority. Knowledge is not shared, improve
ments to the system as a whole are not made, and teams find them
selves in a break/fix cycle. This chaotic firefighting is no way to
approach remediation.
Readiness
Organizations have gone to great lengths to try to predict and pre
vent disruptions to service. In simple systems where causation and
correlation are obvious, outages may be reduced by looking for
known patterns that have previously led to failure. However, in
complex systems such obvious relationships and patterns are only
understood in hindsight. As a result, mature organizations under
stand that prediction and prevention of service disruptions is practi
Lifecycle of an Incident | 55
cally impossible. This is not to say that efforts should not be made to
identify patterns and act accordingly. However, relying heavily on
spotting and avoiding problems before they happen is not a viable
long-term or scalable solution.
Instead, teams begin to take more of a readiness approach rather
than concentrating on prevention. Efforts become focused more
on reviewing documentation, response processes, and metrics in a
day-to-day context.
Signs that an organizational culture of continuous improvement is
gaining traction include updating out-of-date documentation or
resources as they are found. Providing the team and individuals
with the space and resources to improve methods of collaborating
over important data and conversations brings the greatest value in
terms of readiness and, as a result, leads to highly resilient systems
with greater availability.
As teams and organizations mature, crew formation and dissolution
are considered the top priority in responding to, addressing, and
resolving incidents. Frequent rehearsal of team formation means
teams are well practiced when problems inevitably emerge. Inten
tionally creating service disruptions in a safe environment assists in
building muscle memory around how to respond to problems. This
provides great metrics to review and analyze for further improve
ments to the various phases of an incidents lifecycle. Creating fail
over systems, tuning caching services, rewriting small parts of
codeall are informed by analysis and set the team up in bigger
ways. Service disruptions cannot be prevented, but the only logical
path toward providing highly available systems is for teams to col
laborate and continuously learn and improve the way they respond
to incidents.
Lifecycle of an Incident | 57
CHAPTER 9
Conducting a
Post-Incident Review
59
Who
Having many diverse perspectives on what took place during
response and remediation efforts helps to bring high-value improve
ments to the surface. Rather than focusing simply on identifying
what went wrong and targeting that as what should be fixed, we now
understand that there are many contributing factors to an incident,
and avoiding opportunities to discuss and improve them all but
guarantees a scenario where engineers are in a constant break/fix
situation, chaotically reacting to service disruptions. Focusing the
perspectives and efforts toward making slight but constant improve
ments to the entire incident lifecycle provides the greatest gains.
Essential participants in the post-incident review include all of the
people involved in decisions that may have contributed to the prob
lem or recovery efforts:
The Facilitator
In many cases, leveraging an objective third-party facilitator can
provide several key benefits. Including someone who wasnt directly
involved in the incident removes opportunities for human bias to
What
Now that we have gathered all parties to the exercise, the tone and
mission should be established. First and foremost, we are here for
one reason only: to learn.
Yes, understanding the causes of problems in our systems is impor
tant. However, focusing solely on the cause of a problem misses a
large opportunity to explore ways in which systems can be designed
to be more adaptable. More important than identifying what may
have caused a problem is learning as much as possible about our
systems and how they behave under certain conditions. In addition
to that, we want to scrutinize the way in which teams form during
incident response. Identifying and analyzing data points regarding
these areas wont necessarily bring you to a root cause of the prob
lem, but it will make your system much more knownand a
greater and more in-depth understanding of the system as a whole is
far more valuable to the business than identifying the root cause of
any one problem.
What | 61
In addition to establishing a space to learn, post-incident reviews
should be considered an environment in which all information
should be made available. No engineers should be held responsible
for their role in any phase of the incident lifecycle. In fact, we should
reward those who surface relevant information and flaws within our
systems, incentivizing our engineers to provide as much detail as
possible. When engineers make mistakes but feel safe to give
exhaustive details about what took place, they prove to be an invalu
able treasure chest of knowledge. Collectively, our engineers know
quite a lot about many aspects of the system. When they feel safe to
share that information, a deeper understanding of the system as a
whole is transferred to the entire team or organization. Blaming,
shaming, or demoting anyone involved in an incident is the surest
way for that deeper understanding to not take place. Encourage
engineers to become experts in areas where they have made mis
takes previously. Educating the rest of the team or organization on
how not to make similar mistakes in the future is great for team cul
ture as well as knowledge transfer.
When
Analyses conducted too long after a service disruption are of little
value to the overall mission of a post-incident review. Details of
what took place, memories of actions taken, and critical elements of
the conversations will be lost forever. Performing the review exercise
as quickly as possible ensures that the maximum amount of relevant
Where
Team members dont necessarily work in the same office, or even the
same time zones. When possible, physical meetings should be taken
advantage of, but these are not essential. In fact, virtual conference
calls and meetings can allow more diverse perspectives to be
remotely pulled into the discussion. This in turn can help avoid
groupthink pitfalls and provide more opportunities for genuine
analysis of not only what went wrong, but how well teams
responded.
How
Once we have gathered the people involved and established our
intent, its time to begin discussing what took place.
Words are how we think; stories are how we link.
Christina Baldwin
Where | 63
Establish a Timeline
The facilitator may begin by first asking, When did we know about
this problem? This helps us to begin constructing a timeline. Estab
lishing when we first knew about the problem provides a starting
point. From there, we can begin describing what we know.
Human Interactions
As engineers step through the process of investigating, identifying,
and then working to solve the problem, conversations are happen
ing among all who are involved. Engineers sharing in great detail
exactly what they were thinking and doing and the results of those
actions helps others involved in the firefight build a shared context
and awareness about the incident. Capturing the dialogue will help
spot where some engineers may need to improve their communica
tion or team skills.
Remediation Tasks
Included in those conversations should be fairly detailed documen
tation describing exactly what happened throughout the timeline.
This may include specific commands that were run (copied and pas
ted into chat), or it could be as simple as engineers describing which
commands they were using and on which systems. These descrip
tions teach others how to perform the same type of investigation
and not only will others learn something from discussing these
tasks, but the engineers can analyze if they are taking the most effi
cient route during their investigation efforts. Perhaps there are bet
ter ways of querying systems. You may never know if you dont
openly discuss how the work is accomplished.
How | 65
ChatOps
One method teams have employed to facilitate not only the conver
sation and remediation efforts but the forthcoming post-incident
review is the use of ChatOps.1
Metrics
Not all tricks and commands used during a firefight are suitable for
chat or available via chatbots and scripts, but describing what was
done from the engineers local terminal at the very least helps to
share more with the team about how to diagnose and recover from
problems. Incidents become teaching opportunities as engineers
step through the incident lifecycle.
One example of this is the querying and retrieval of relevant metrics
during the response to an incident. First responders typically begin
their triaging process by exploring a number of key metrics such as
time-series data and dashboards. The collection of time-series data
is extremely easy and cost-effective. You may not need to watch a
1 For more on this subject, see my report ChatOps: Managing Operations in Group
Chat (OReilly).
Real Numbers
Cost of downtime = Deployment frequency Change failure rate
Mean time to recover Hourly cost of outage2
How | 67
In other words, improving the time to recover lowers the real dollar
cost of failure. What accounts for the time between time to
acknowledge and time to recover is examined during a well-
executed post-incident review. Endless ways in which teams can
improve can be discovered.
Status Pages
As mentioned previously, transparency is extremely important, and
not only within teams or the organization as a whole. Customers
and users of our services expect transparent and timely communica
tion about problems and remediation efforts. To provide this feed
back, many employ status pages as a quick and easy way of issuing
real-time updates regarding the state of a service.
Examples include:
https://status.newrelic.com/
https://status.twilio.com/
https://status.aws.amazon.com/
SLA impact
Along with the severity of the incident, many in management roles
may be very concerned about the possible impact on any service
level agreements (SLAs) that are in place. In some cases, harsh pen
alties are incurred when SLAs are broken. The sad truth about SLAs
is that while they are put in place to establish a promised level of ser
vice, such as 99.999% (five nines) uptime, they incentivize engineers
to avoid innovation and change to the systems. Attempts to protect
the systems hinder opportunities to explore their limits and contin
uously improve them. Nevertheless, discussing the potential impact
to SLAs that are in place helps everyone understand the severity of
the failure and any delays in restoring service. If engineers are to
make the case that they should be allowed to innovate on technol
ogy and process, they will have to ensure that they can work within
the constraints of the SLA or allowable downtime.
Customer impact
Someone in attendance to the exercise should be able to speak to
questions regarding customer impact. Having a member of support,
sales, or elsewhere providing that kind of feedback to the engineers
helps to build empathy and a better understanding of the experience
and its impact on the end users. Shortening the feedback loop from
user to engineer helps put things into perspective.
How | 69
Contributing Factors
When discussing an incident, evidence will likely emerge that no
single piece or component of the system can be pointed to as the
clear cause of the problem. Systems are constantly changing, and
their interconnectivity means that several distinct problems typically
contribute, in various ways and to varying degrees, to failures. Sim
ple, linear systems have a direct and obvious relationship between
cause and effect. This is rarely the case with complex systems. Many
still feel that by scrutinizing systems, breaking them down, and fig
uring out how they work we should be able to understand and con
trol them. Unfortunately, this is often impossible.
We are in an era, one in which we are building systems that cant be
grasped in their totality or held in the mind of a single person; they
are simply too complex.
Samuel Arbesman, Overcomplicated
Action Items
Once we have had a chance to discuss the events that took place,
conversations that were had, and how efforts helped (or didnt) to
restore service, we need to make sure our learnings are applied to
improving the system as quickly as possible. A list of action items
should be captured when suggestions are made on how to better
detect and recover from this type of problem. An example of an
action item may be to begin monitoring time-series data for a data
base and establish thresholds to be alerted on. Shortening the time it
The Problem
Lack of involvement from Development meant Operations often
band-aided problems. Automation and DevOps best practices were
not typically considered key components in improvements to be
made to the system. Often problems were attacked with more pro
cess and local optimization, without addressing proximate cause. In
a lot of cases, they moved on without any attacking of the problem.
The Shift in Thinking
The first large post-incident review I participated in was related to a
product I inherited as part of our DevOps reorganization. I was still
fairly new to working with the product and we had a major failed
It quickly became clear this should be done for many more inci
dents as well.
The Plan
After we saw the real value in the exercise, we set out to hold a
team-wide post-incident review on every production outage.
We didnt have a lot of formality around thiswe just wanted to
ensure we were constantly learning and growing each time we had an
issue.
Additionally, we beefed up the process around our leadership After
Action Summaries (AASs). We moved the tracking of these from a
SharePoint Word doc to JIRA. This allowed us to link to action
items and track them to closure, as well as providing a common
location where everyone can see the documents.
Challenges
One of the biggest challenges has been analysis overload. Similar to
agile team retros, you have to select a few key items to focus on and
not try to boil the ocean. This takes discipline and senior leadership
buy-in to limit work in process and focus on the biggest bang for
the buck items.
Sample Guide
This chapter presents a guide to help get you started. A downloada
ble version is available at http://postincidentreviews.com.
Begin by reflecting on your goals and noting the key metrics of the
incident (time to acknowledge, time to recover, severity level, etc.)
and the total time of each individual phase (detection, response,
remediation).
77
78 | Chapter 10: Templates and Guides
Establish and Document the Timeline
Document the details of the following in chronological order, noting
their impact on restoring service:
Detection
Response
Remediation
Sample Guide | 79
80 | Chapter 10: Templates and Guides
By plotting the tasks unfolding during the lifecycle of the incident,
as in Figure 10-1, we can visualize and measure the actual work
accomplished against the time it took to recover. Because we have
identified which tasks made a positive, negative, or neutral impact
on the restoration of service, we can visualize the lifecycle from
detection to resolution. This exposes interesting observations, par
ticularly around the length of each phase, which tasks actually made
a positive impact, and where time was either wasted or used ineffi
ciently. The graph highlights areas we can explore further in our
efforts to improve uptime.
Sample Guide | 81
Understand How Judgments and Decisions Are Made
Throughout the discussion, its important to probe deeply into how
engineers are making decisions. Genuine inquiry allows engineers
to reflect on whether this is the best approach in each specific phase
of the incident. Perhaps another engineer has a suggestion of an
alternative quicker or safer method. Best of all, everyone in the
company learns about it.
In Chapter 6, Gary exposed Cathy to a new tool as a result of dis
cussing the timeline in detail. Those types of discovery may seem
small and insignificant, but collectively they contribute to the
organizations tribal knowledge and ensuring the improvement
compass is pointed in the right direction.
Engineers will forever debate and defend their toolchain decisions,
but exposing alternative approaches to tooling, processes, and peo
ple management encourages scrutiny of their role in the organiza
tions ongoing continuous improvement efforts.
Learnings
The most important part of the report is contained here.
Genuine inquiry within an environment that welcomes transpar
ency and knowledge sharing not only helps us detect and recover
from incidents sooner, but builds a broader understanding about the
system among a larger group.
Be sure to document as many findings as possible. If any member
participating in the post-incident review learns something about the
true nature of the system, that should be documented. If something
wasnt known by one member of the team involved in recovery
efforts, it is a fair assumption that others may be unaware of it as
well.
The central goal is to help everyone understand more about what
really goes on in our systems and how teams form to address prob
lems. Observations around work as designed vs. work as per
formed, as mentioned in Chapter 7, emerge as these findings are
documented.
Sample Guide | 83
As responders describe their efforts, explore whether each task per
formed moved the system closer to recovery, or further away. These
are factors to evaluate more deeply for improvement opportunities.
Action Items
Finally, action items will have surfaced throughout the discussion.
Specific tasks should be identified, assigned an owner, and priori
tized. Tasks without ownership and priority sit at the bottom of the
backlog, providing no value to either the analysis process or system
health. Countermeasures and enhancements to the system should be
prioritized above all new work. Until this work is completed, we
know less about our systems state and are more susceptible to
repeated service disruptions. Tracking action item tasks in a ticket
ing system helps to ensure accountability and responsibility for
work.
Sample Guide | 85
Summaries and Public Reports
These exercises provide a great deal of value to the team or organi
zation. However, there are likely others who would like to be
informed about the incident, especially if it impacted customers. A
high-level summary should be made available, typically consisting
of several or all of the following sections:
Summary
Services Impacted
Duration
Severity
Customer Impact
Proximate Cause
Resolution
Countermeasures or Action Items
Sample Guide | 87
CHAPTER 11
Readiness
We never had a name for that huddle and discussion after Id lost
months worth of customer data. It was just, Lets talk about last
night. That was the first time Id ever been a part of that kind of
investigation into an IT-related problem.
At my previous company, we would perform RCAs following inci
dents like this. I didnt know there was another way to go about it.
We were able to determine a proximate cause to be a bug in a
backup script unique to Open CRM installations on AWS. However,
we all walked away with much more knowledge about how the sys
tem worked, armed with new action items to help us detect and
recover from future problems like this much faster. As with the list
of action items in Chapter 6, we set in motion many ways to
improve the system as a whole rather than focusing solely on one
distinct part of the system that failed under very unique circumstan
ces.
It wasnt until over two years later, after completely immersing
myself in the DevOps community, that I realized the exercise we had
performed (intentionally or not) was my very first post-incident
review. I had already read blog posts and absorbed presentation after
presentation about the absence of root cause in complex systems.
But it wasnt until I made the connection back to that first post-
incident review that I realized its not about the report or discover
ing the root causeits about learning more about the system and
89
opening new opportunities for improvement, gaining a deeper
understanding of the system as a whole and accepting that failure is
a natural part of the process. Through that awareness, I finally saw
the value in analyzing the unique phases of an incidents lifecycle. By
setting targets for small improvements throughout detection,
response, and remediation, I could make dealing with and learning
from failure a natural part of the work done.
Thinking back on that day now gives me a new appreciation of what
we were doing at that small startup and how advanced it was in a
number of ways. I also feel fortunate that I can share that story and
the stories of others, and what Ive learned along the way, to help
reshape your view of post-incident analysis and how you can con
tinuously improve the reliability and availability of a service.
Post-incident reviews are so much more than discussing and docu
menting what happened in a report. They are often seen as only a
tool to explain what happened and identify a cause, severity, and
corrective action. In reality, they are a process intended to improve
the system as a whole. By reframing the goal of these exercises as an
opportunity to learn, a wealth of areas to improve become clear.
As we saw in the case of CSG International, the value of a post-
incident review goes well beyond the artifact produced as a sum
mary. They were able to convert local discoveries into improvements
in areas outside of their own.
Theyve created an environment for constant experimentation,
learning, and making systems safer, all while making them highly
resilient and available.
Teams and individuals are able to achieve goals much more easily
with ever-growing collective knowledge regarding how systems
work. The results include better team morale and an organizational
culture that favors continuous improvement.
The key takeaway: focus less on the end result (the cause and fix
report) and more on the exercise that reveals many areas of
improvement.
When challenged to review failure in this way, we find ingenious
ways to trim seconds or minutes from each phase of the incident
lifecycle, making for much more effective incident detection,
response, and remediation efforts.
If you can answer that question honestly to yourself, and you are
satisfied with your next step, my job here is done.
Wherever these suggestions and stories take you, I wish you good
luck on your journey toward learning from failure and continuous
improvement.