2019 SRE Report
2019 SRE Report
2019 SRE Report
Table of contents
Methodology Page 22
Executive Summary
For the second year, we surveyed Site Reliability Engineers or those that identify as an SRE to understand more about this emerging
role. Last year we focused on who SREs are, where they work, what they do, and how they do it. The 2018 report explored the skills,
toolset used, and the corporate culture to determine if there is a core set of principles across teams and organizations.
This year’s survey examined team structure, outages, incidents, and post-incident stress. We looked to answer the question of "What
impact do incidents have on organizations and the people responding to them?" Organizations are focused on building resilient
systems and recovering quickly, but does this focus extend to employee resilience and recovery from post-incident stress?
The 2019 report analyzed responses from 188 SREs globally across a range of industries and company sizes. This report provides a
unique view of trends and issues facing site reliability engineers and the organizations that employ them.
Acknowledgements
This report would not exist without the contribution and support of many people inside and outside of Catchpoint.
The inspiration for this year’s survey came from a talk given by Jaime Woo from Incident Labs at SREcon. Jaime was instrumental in
defining and refining the questions on post-incident stress. Thank you for the inspiration and assistance.
Not being an SRE myself, I wanted to make sure the questions made sense and weren’t missing important pieces. Seth Vargo from
Google and Liz Fong-Jones from Honeycomb provided valuable feedback on the types of questions to ask.
Phrasing survey questions to get meaningful answers is harder than it looks. Nicole Forsgren from Google gave me guidance on
avoiding loaded questions/answers, providing clarification and definitions, and using best practices for survey questions.
Thank you to the Catchpoint employees that are instrumental in the creation of this report:
• Sarah Sanders for handling the logistics, communications, and overall program management.
Finally, this report would not be possible without those that took the time to respond to and share the survey. We took a bit of a risk
asking people to share personal information on how they handle post-incident stress. Thank you for sharing your thoughts and
feelings.
3
64%
Key finding 1
49%
Key finding 2
is a massive part
had worked on an incident in the last
week.
of the job.
79%
Key finding 3
is stressful.
67%
Key finding 4
stress. 4
Survey
demographics &
firmographics
Industry Geography
62%
62% of respondents
14%
work in technology related industries North Europe
America 34% Asia
55% 7%
14% of respondents
work in retail/consumer ecommerce South
Australia
America
1% 3%
No other industry had more than 5% of respondents. As a
Role
A wide variety of titles are used to refer to people doing SRE work –
SRE 45%
45% had a title of SRE. But the remainder self-identified as doing
Management 15%
SRE work. When including management with an SRE title (SRE
Engineer/Developer/Programmer 12%
Manager, SRE Director, etc.), the percentage increased to 49%.
DevOps 11%
Size of organization
6
Key finding 1
with fewer than 10 SREs. 6% are the sole SRE 2-10 51%
50-99 5%
100+ 22%
The SRE concept, while in existence for over Within the last 12 months 26%
6-10 years 6%
31%
How was the SRE team built?
renaming a team is not the same as evolving a team or it growing 29% Evolved from operations/systems administration
organically. 13% We renamed an operations/engineering/system
administration team the SRE team
8
Impact of toil
Toil is manual, repetitive, automatable, tactical work that scales linearly and is the main source of concern for SREs. 59% believe there
is too much toil in their organization and not enough has been automated to reduce that toil. Nobody strongly agreed with the
statement “We have used automation to reduce toil” while 48.5% disagreed or strongly disagreed. The two main sources of toil for
SREs are in maintenance tasks and non-urgent, service-related messages. Those maintenance tasks are an automation opportunity to
30%
27% Non-urgent service related messages
16% Releases
30% of respondents 15% On-call notifications
said maintenance tasks
7% Non-service related messages
9
Service level objectives
Setting and monitoring service level objectives is a key aspect of the SRE role. The most widely tracked SLO is availability. Considering
that 27% of respondents indicated they do not have SLOs in their organization every SRE that has SLOs tracks availability.
world turning into a mess.” Not all SREs work on external- facing applications, some SREs support internal applications—which is why
we asked about a drop in employee productivity. The drop can be related to employees not being able to access systems, or from
86%
70% Lost revenue
10
Key finding 1
Catchpoint’s take
on Finding 1
SRE disciplines are still nascent. SREs ensure applications and services are reliable. This includes defining what reliable means in terms
of service levels. If an API is available, but it takes 5 seconds to respond to a request, will that meet users’ expectations? Before
deciding that your organization is ready to take on SRE work (or if you already have) consider what are acceptable service level
objectives. Establish benchmarks of current application and service performance from multiple perspectives and use these to guide
For companies that are well-entrenched with SRE practices, find areas for improvement. What additional SLOs should be added? What
toil currently exists in the organization? Are there things you can automate to reduce that toil? What new toil may be created when
Consider the tools used by SREs. Are these adding to toil? Do they help you track service level objectives accurately?
Key finding 2
Incident management is
a massive part of the job
For the survey, we defined an incident as an unplanned interruption to an application or service that reduces the quality of the service.
Incidents are assigned priorities based on the scope, impact, complexity, and urgency of the failure or interruption.
88% of SREs receive notifications about incidents via alerting and notification tools, but a handful are still being notified by coworkers
Incidents are an unknown and can be difficult to prepare for; some are 49% of respondents
easy, some are not. Almost 50% of respondents have worked on said within the last week
outages lasting more than a day at some point in their career.
34%
34% Within the
Within the last
last month
month
10%
10% I’m working
I’m working on
on one
one now
now
4%
4% II don’t
don’t work
work on
on incidents
incidents
4%
4% II don’t
don’t remember
remember
92%
How many service incidents do you work on in a week?
Incidents are an unknown and can be difficult to prepare for; some are 92% of respondents
easy, some are not. Almost 50% of respondents have worked on work on 5 or fewer incidents
outages lasting more than a day at some point in their career.
48% 1 or fewer
44% Up to 5
4% Six to 10
4% Over 10
two people while other values have fairly even representation. One
in 100 responses had 300 people in the on-call rotation, and another
14
had 150.
4 4
2
One of the respondents who indicated zero people in the on-call
0 1 2-5 6-10 10+ 100+ rotation explained they are in pre-production, so the on-call rotation
Catchpoint’s take
on Finding 2
Consider how many SREs your team will have and whether they will be able to support the applications and services adequately.
Include the appropriate people in the on-call rotation and ensure they have access to the alerting and notification systems. If you use
Slack, integrate alerts into the appropriate Slack channels to reduce the number of times people find out about an incident from a
Examine if there is a pattern when incidents occur. Do more incidents happen after code deploys? If so, consider if additional
14
Key finding 3
Resolving incidents is
stressful: Understanding
post-incident stress
The survey defined post-incident stress as changes to physical and psychological well-being experienced up to two days after an
incident occurs. Post-incident stress can last for a few minutes or up to two days.
How often do you experience post incident stress?
67% of those that report post-incident stress worked on an incident in the last week, 14 % indicated they were currently working on an
incident.
One way to not experience post-incident stress is to not work on incidents. 18% of those that never experience post-incident stress
don’t remember the last time they worked on an incident or reported that they don’t work on incidents.
On the flip side, those who are the only SRE in an organization will always experience some stress. Post-incident stress happens after
some or all incidents for the 12 people who are the only SRE in their organization. There is never an incident where they don’t feel
On average, after an incident how would you rate your stress level?
Stress level is subjective. One person may classify their stress level as low while
Very High
2%
another as moderate or high. What matters here is this is an individual’s perception
of their stress level. Just because something isn’t stressful for you, doesn’t mean it
post-incident stress levels. I am fortunate enough to have a good culture and long
tenure. Also, it is rarely my service that is the root cause of our incidents (knock
wood, no brag, thug life).” Analysis of stress level based on seniority or title did not Moderate
reveal any noticeable difference in the respondents. 49%
Low
29%
16
Do you experience higher levels of stress during more severe incidents
(i.e., widespread outage vs minor incident)?
68% of those reporting low stress report the level varies based on the severity of the incident compared to 85-100% reporting
52%
Even those who reported never experiencing post-incident stress
48% Concentration
9% Appetite
1% None
61%
There are many things people do to relieve stress.
52% Spend time on a hobby
17
Key finding 3
Catchpoint’s take
on Finding 3
The SRE role is stressful. There are steps organizations can take from a process and technology perspective to reduce that stress. One
way to reduce the amount of stress employees feel is to deploy alerting and notification solutions. Of the respondents that always
experience post-incident stress, 20% discover incidents from users contacting the help desk compared to 2-3% for those who never or
sometimes experience post-incident stress. Stress levels may decrease if employees receive notifications before users start
complaining.
If you already have an alerting and notification solution in place, but the stress levels are still high, explore whether this is due to alert
fatigue. Are there too many false positives and false negatives occurring? Are you missing critical notifications because you aren’t
Conducting game-day drills can help prepare the team for live incidents.
If all these are in place, the solution may be people related and not solvable with technology, read on to learn more.
Key finding 4
A supportive team
reduces post-incident
stress
Employees that feel their employer cares about their well-being experience less stress. 76% of the people who experience stress after
every incident either feel neutral or disagree that their company cares about their well-being. SREs feel their teams care more about
Strongly Disagree Neutral Agree Strongly Strongly Disagree Neutral Agree Strongly
disagree agree disagree agree
61%
We opted not to include an option for “nothing” however 9% of
respondents wrote that in. 10% that report stress after every
61% of respondents
incident indicate their company does nothing to help alleviate the
said reinforces a just/blameless culture
stress. 16% of the people that reported high levels of stress after
How supported an SRE feels by their team during and after an During incident Post incident
stress after some, and 43% feeling stress after every incident. 20% of
3 13% 22%
those feeling stress after every incident feel little to no support after
an incident compared to 14% of those experiencing some stress and 4 31% 23%
20
Key finding 4
Catchpoint’s take
on Finding 4
Stress is considered “part of the job” but ignoring the stress is not healthy.
The notion of blameless post-mortems is good, but this doesn’t eliminate the stress experienced when resolving incidents.
Organizations need to deploy more concrete ways of reducing stress. Recognize that failure is stressful for many. Finding ways to
reduce the number of incidents, and most importantly the number of high priority incidents, will go a long way towards reducing
stress.
When people are on call, or when incidents occur, compensate those that worked on the incident. A few suggestions from
respondents include paying people to be on-call or offering the option for extra time off. One company referred to this policy as
"surge protection."
Conduct regular post-incident reviews. Document what went wrong, identify whether additional investments are needed to fix a
problem. Share knowledge and information across the organization and teams.
21
Methodology
In January 2019 Catchpoint conducted an SRE survey promoted via email lists and social media. The survey questioned technical
professionals from across a variety of industries about their role as a site reliability engineer, how they manage incidents, and the
22
About the Author
Dawn is a Director at Catchpoint where she uses her storytelling prowess to write and speak
about the intersection of technology and psychology. She makes technical information
accessible avoiding buzzwords and jargon whenever possible. Dawn has spoken at
DevOpsDays, Velocity, Interop, and Monitorama. Her articles have appeared in numerous
technical publications. She uses her non-existent spare time to serve as a chapter organizer
About Catchpoint
Catchpoint is revolutionizing end-user experience monitoring to help companies deliver amazing digital experiences. Our platform
provides complete visibility into your users’ experiences from anywhere – and real-time intelligence into your applications and services
to detect and fix issues faster. We are proud to partner with digital innovators like L'Oréal, Verizon, Oracle, LinkedIn, Honeywell,
Priceline, and Qualtrics, who trust Catchpoint to improve their brand experience and drive their business success. See how Catchpoint