Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

2019 SRE Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

2019 SRE Report

Table of contents

Acknowledgements / Executive Summary Page 3

Key Findings Summary Page 4

Survey demographics and firmography Page 6

Key Finding 1: SRE still emerging practice Page 8

Key Finding 2: Incident Resolution massive part of job Page 13

Key Finding 3: Resolving incidents is stressful Page 16

Key Finding 4: Team support reduces stress Page 20

Methodology Page 22
Executive Summary
For the second year, we surveyed Site Reliability Engineers or those that identify as an SRE to understand more about this emerging

role. Last year we focused on who SREs are, where they work, what they do, and how they do it. The 2018 report explored the skills,

toolset used, and the corporate culture to determine if there is a core set of principles across teams and organizations.

This year’s survey examined team structure, outages, incidents, and post-incident stress. We looked to answer the question of "What

impact do incidents have on organizations and the people responding to them?" Organizations are focused on building resilient

systems and recovering quickly, but does this focus extend to employee resilience and recovery from post-incident stress?

The 2019 report analyzed responses from 188 SREs globally across a range of industries and company sizes. This report provides a

unique view of trends and issues facing site reliability engineers and the organizations that employ them.

Acknowledgements
This report would not exist without the contribution and support of many people inside and outside of Catchpoint.

The inspiration for this year’s survey came from a talk given by Jaime Woo from Incident Labs at SREcon. Jaime was instrumental in

defining and refining the questions on post-incident stress. Thank you for the inspiration and assistance.

Not being an SRE myself, I wanted to make sure the questions made sense and weren’t missing important pieces. Seth Vargo from

Google and Liz Fong-Jones from Honeycomb provided valuable feedback on the types of questions to ask.

Phrasing survey questions to get meaningful answers is harder than it looks. Nicole Forsgren from Google gave me guidance on

avoiding loaded questions/answers, providing clarification and definitions, and using best practices for survey questions.

Thank you to the Catchpoint employees that are instrumental in the creation of this report:

• Taylor Meluch for her design wizardry.

• Kayla Lee for her detailed editorial eye.

• Peter Saulitis for his guidance and insights on branding.

• Sarah Sanders for handling the logistics, communications, and overall program management.

Finally, this report would not be possible without those that took the time to respond to and share the survey. We took a bit of a risk

asking people to share personal information on how they handle post-incident stress. Thank you for sharing your thoughts and

feelings.

3
64%
Key finding 1

Site Reliability 64% of respondents indicate the SRE


Engineering is still role or team has been in existence for
three years or less.
emerging as a practice.

49%
Key finding 2

Incident resolution 49% of respondents indicated they

is a massive part
had worked on an incident in the last
week.

of the job.

79%
Key finding 3

Resolving incidents 79% of respondents have stress.

is stressful.

67%
Key finding 4

A supportive team 67% of SREs who feel stress after


every incident do not believe their
reduces post-incident company cares about their well-being.

stress. 4
Survey
demographics &
firmographics
Industry Geography

62%
62% of respondents

14%
work in technology related industries North Europe
America 34% Asia
55% 7%

14% of respondents
work in retail/consumer ecommerce South
Australia
America
1% 3%
No other industry had more than 5% of respondents. As a

result, we did not analyze whether industry impacts an answer.

Role
A wide variety of titles are used to refer to people doing SRE work –
SRE 45%
45% had a title of SRE. But the remainder self-identified as doing
Management 15%
SRE work. When including management with an SRE title (SRE
Engineer/Developer/Programmer 12%
Manager, SRE Director, etc.), the percentage increased to 49%.
DevOps 11%

Infrastructure & operations 8.5%


29% held senior positions (this includes people with the word lead,
Other 3%
architect, or senior in their title) 16% are in leadership positions
Architect 2%
(manager, director, VP, or executive). The remaining are junior or
Executive 2%
mid-level.

Size of organization

14% 36% 19% 31%


Less than 50 50-999 1000-4999 5000+

6
Key finding 1

SRE is new and


still being defined
SREs
How many SREs are in your organization?

The majority of SREs work in organizations I’m the only one 6%

with fewer than 10 SREs. 6% are the sole SRE 2-10 51%

in the organization. 11-49 16%

50-99 5%

100+ 22%

How long has the SRE team been in existence?

The SRE concept, while in existence for over Within the last 12 months 26%

15 years, is still in its infancy. 1-3 years 38%

3-6 years 13%

6-10 years 6%

10+ years 15%

31%
How was the SRE team built?

Given that SRE is relatively new, we were interested in how the


31% of respondents
said it grew organically
team/role came into being across organizations. Some answers

sounded similar, but there were nuances to them. For example,

renaming a team is not the same as evolving a team or it growing 29% Evolved from operations/systems administration
organically. 13% We renamed an operations/engineering/system
administration team the SRE team

13% Select people were chosen for a team

9% Executive sponsor said “we are now doing SRE”

2% We hired junior level people and trained them

8
Impact of toil

Toil is manual, repetitive, automatable, tactical work that scales linearly and is the main source of concern for SREs. 59% believe there

is too much toil in their organization and not enough has been automated to reduce that toil. Nobody strongly agreed with the

statement “We have used automation to reduce toil” while 48.5% disagreed or strongly disagreed. The two main sources of toil for

SREs are in maintenance tasks and non-urgent, service-related messages. Those maintenance tasks are an automation opportunity to

help reduce toil.

There is too much toil in the organization

3% 10% 27% 32% 27%

Strongly Disagree Neutral Agree Strongly


disagree agree

What is your top source of toil?

30%
27% Non-urgent service related messages
16% Releases
30% of respondents 15% On-call notifications
said maintenance tasks
7% Non-service related messages

We have used automation to reduce toil

40% 8.5% 13% 38% 0%

Strongly Disagree Neutral Agree Strongly


disagree agree

9
Service level objectives
Setting and monitoring service level objectives is a key aspect of the SRE role. The most widely tracked SLO is availability. Considering

that 27% of respondents indicated they do not have SLOs in their organization every SRE that has SLOs tracks availability.

We have defined SLOs for all essential services

26% 22% 22% 20% 10%

Strongly Disagree Neutral Agree Strongly


disagree agree

Our service level objectives cover:

72% 47% 46% 27%


Availability Response Latency We don’t
Time have SLOs

Business impact of incidents


A missed SLO can have a noticeable impact on the business. One SRE rightfully indicated that a consequence of an incident is “the

world turning into a mess.” Not all SREs work on external- facing applications, some SREs support internal applications—which is why

we asked about a drop in employee productivity. The drop can be related to employees not being able to access systems, or from

employees having to resolve the incident.

86%
70% Lost revenue

57% Drop in employee productivity


86% of respondents
49% Lost customers
said drop in customer satisfaction
36% Social media backlash

10
Key finding 1

Catchpoint’s take
on Finding 1
SRE disciplines are still nascent. SREs ensure applications and services are reliable. This includes defining what reliable means in terms

of service levels. If an API is available, but it takes 5 seconds to respond to a request, will that meet users’ expectations? Before

deciding that your organization is ready to take on SRE work (or if you already have) consider what are acceptable service level

objectives. Establish benchmarks of current application and service performance from multiple perspectives and use these to guide

the creation of your service level objectives.

For companies that are well-entrenched with SRE practices, find areas for improvement. What additional SLOs should be added? What

toil currently exists in the organization? Are there things you can automate to reduce that toil? What new toil may be created when

new SLOs are implemented, or new services are launched?

Consider the tools used by SREs. Are these adding to toil? Do they help you track service level objectives accurately?
Key finding 2

Incident management is
a massive part of the job
For the survey, we defined an incident as an unplanned interruption to an application or service that reduces the quality of the service.

Incidents are assigned priorities based on the scope, impact, complexity, and urgency of the failure or interruption.

88% of SREs receive notifications about incidents via alerting and notification tools, but a handful are still being notified by coworkers

or users contacting the helpdesk.


49%
When was the last time you worked on a service incident?

Incidents are an unknown and can be difficult to prepare for; some are 49% of respondents
easy, some are not. Almost 50% of respondents have worked on said within the last week
outages lasting more than a day at some point in their career.

34%
34% Within the
Within the last
last month
month

10%
10% I’m working
I’m working on
on one
one now
now

4%
4% II don’t
don’t work
work on
on incidents
incidents

4%
4% II don’t
don’t remember
remember

92%
How many service incidents do you work on in a week?

Incidents are an unknown and can be difficult to prepare for; some are 92% of respondents
easy, some are not. Almost 50% of respondents have worked on work on 5 or fewer incidents
outages lasting more than a day at some point in their career.

48% 1 or fewer

44% Up to 5

4% Six to 10

4% Over 10

How many people are in your on-call rotation?

On-call rotation can vary. In some instances, there are rotations of


44 rotations. Even companies with fewer than 50 people have varied
Percent of respondents

sizes in their on-call rotation. 30% of respondents working at


32
companies with fewer than 50 people report an on-call rotation of

two people while other values have fairly even representation. One

in 100 responses had 300 people in the on-call rotation, and another
14
had 150.

4 4
2
One of the respondents who indicated zero people in the on-call
0 1 2-5 6-10 10+ 100+ rotation explained they are in pre-production, so the on-call rotation

Number of people on-call and responsibilities do not yet exist.


13
Key finding 2

Catchpoint’s take
on Finding 2
Consider how many SREs your team will have and whether they will be able to support the applications and services adequately.

Include the appropriate people in the on-call rotation and ensure they have access to the alerting and notification systems. If you use

Slack, integrate alerts into the appropriate Slack channels to reduce the number of times people find out about an incident from a

coworker or users opening a support ticket.

Examine if there is a pattern when incidents occur. Do more incidents happen after code deploys? If so, consider if additional

monitoring or testing in pre-production or development can reduce that.

14
Key finding 3

Resolving incidents is
stressful: Understanding
post-incident stress
The survey defined post-incident stress as changes to physical and psychological well-being experienced up to two days after an

incident occurs. Post-incident stress can last for a few minutes or up to two days.
How often do you experience post incident stress?

11% 68% 21%


11% of respondents
said after every incident
68% of respondents
said after some incidents
21% of respondents
said never

67% of those that report post-incident stress worked on an incident in the last week, 14 % indicated they were currently working on an

incident.

One way to not experience post-incident stress is to not work on incidents. 18% of those that never experience post-incident stress

don’t remember the last time they worked on an incident or reported that they don’t work on incidents.

On the flip side, those who are the only SRE in an organization will always experience some stress. Post-incident stress happens after

some or all incidents for the 12 people who are the only SRE in their organization. There is never an incident where they don’t feel

some level of stress.

On average, after an incident how would you rate your stress level?

Stress level is subjective. One person may classify their stress level as low while
Very High
2%
another as moderate or high. What matters here is this is an individual’s perception

of their stress level. Just because something isn’t stressful for you, doesn’t mean it

isn’t stressful for somebody else.


High
One respondent commented, “Culture and tenure have a lot to do with 20%

post-incident stress levels. I am fortunate enough to have a good culture and long

tenure. Also, it is rarely my service that is the root cause of our incidents (knock

wood, no brag, thug life).” Analysis of stress level based on seniority or title did not Moderate
reveal any noticeable difference in the respondents. 49%

Low
29%

16
Do you experience higher levels of stress during more severe incidents
(i.e., widespread outage vs minor incident)?

Yes 82% of respondents


said yes
No
18% of respondents
said no

68% of those reporting low stress report the level varies based on the severity of the incident compared to 85-100% reporting

moderate to very high stress.

After recent incidents, do you notice a change in any of the following?

52%
Even those who reported never experiencing post-incident stress

identified a change in one of the above after working on incidents.


52% of respondents
While they may not classify this as stress, they do experience a said mood
physical or psychological reaction to managing incidents.

48% Concentration

38% Ability to sleep

38% Desire to be social

32% Ability to enjoy things

9% Appetite
1% None

Which of the following do you engage in to alleviate the effects of


post-incident stress?

61%
There are many things people do to relieve stress.
52% Spend time on a hobby

48% Get a good night’s sleep

61% of respondents 43% Spend time with people

said exercise / take a walk 35% Drink alcohol

17
Key finding 3

Catchpoint’s take
on Finding 3
The SRE role is stressful. There are steps organizations can take from a process and technology perspective to reduce that stress. One

way to reduce the amount of stress employees feel is to deploy alerting and notification solutions. Of the respondents that always

experience post-incident stress, 20% discover incidents from users contacting the help desk compared to 2-3% for those who never or

sometimes experience post-incident stress. Stress levels may decrease if employees receive notifications before users start

complaining.

If you already have an alerting and notification solution in place, but the stress levels are still high, explore whether this is due to alert

fatigue. Are there too many false positives and false negatives occurring? Are you missing critical notifications because you aren’t

monitoring all the critical elements of the application or service?

Conducting game-day drills can help prepare the team for live incidents.

If all these are in place, the solution may be people related and not solvable with technology, read on to learn more.
Key finding 4

A supportive team
reduces post-incident
stress
Employees that feel their employer cares about their well-being experience less stress. 76% of the people who experience stress after

every incident either feel neutral or disagree that their company cares about their well-being. SREs feel their teams care more about

their well-being than their companies do.


My company cares about my My team cares about my physical
physical and mental well being. and mental well being.

9% 15% 26% 32% 18% 5% 6% 20% 29% 40%

Strongly Disagree Neutral Agree Strongly Strongly Disagree Neutral Agree Strongly
disagree agree disagree agree

What does your organization do to alleviate post-incident stress?

61%
We opted not to include an option for “nothing” however 9% of

respondents wrote that in. 10% that report stress after every
61% of respondents
incident indicate their company does nothing to help alleviate the
said reinforces a just/blameless culture
stress. 16% of the people that reported high levels of stress after

an incident say their company does nothing to help alleviate the

impact of post-incident stress. 40% Provide extra time off

38% Checks in to see how you’re doing

7% Offers free massages

7% Reduces on-call rotation

How supported do you feel by your team during/after an incident?

How supported an SRE feels by their team during and after an During incident Post incident

incident influences their stress levels. Overall 80% feel supported by


Not supported 1 1% 5%
their team after an incident. These numbers drop for those who

experience stress after some or all incidents—64% for those feeling 2 4% 7%

stress after some, and 43% feeling stress after every incident. 20% of
3 13% 22%
those feeling stress after every incident feel little to no support after

an incident compared to 14% of those experiencing some stress and 4 31% 23%

5% who never experience stress.


Supported 5 49% 41%

20
Key finding 4

Catchpoint’s take
on Finding 4
Stress is considered “part of the job” but ignoring the stress is not healthy.

The notion of blameless post-mortems is good, but this doesn’t eliminate the stress experienced when resolving incidents.

Organizations need to deploy more concrete ways of reducing stress. Recognize that failure is stressful for many. Finding ways to

reduce the number of incidents, and most importantly the number of high priority incidents, will go a long way towards reducing

stress.

When people are on call, or when incidents occur, compensate those that worked on the incident. A few suggestions from

respondents include paying people to be on-call or offering the option for extra time off. One company referred to this policy as

"surge protection."

Conduct regular post-incident reviews. Document what went wrong, identify whether additional investments are needed to fix a

problem. Share knowledge and information across the organization and teams.

21
Methodology
In January 2019 Catchpoint conducted an SRE survey promoted via email lists and social media. The survey questioned technical

professionals from across a variety of industries about their role as a site reliability engineer, how they manage incidents, and the

prevalence of post-incident stress. 188 people responded to the survey.

22
About the Author

Dawn is a Director at Catchpoint where she uses her storytelling prowess to write and speak

about the intersection of technology and psychology. She makes technical information

accessible avoiding buzzwords and jargon whenever possible. Dawn has spoken at

DevOpsDays, Velocity, Interop, and Monitorama. Her articles have appeared in numerous

technical publications. She uses her non-existent spare time to serve as a chapter organizer

for Write/Speak/Code a non-profit organization to empower women and non-binary coders

to become speakers, writers, and leaders.

About Catchpoint

Catchpoint is revolutionizing end-user experience monitoring to help companies deliver amazing digital experiences. Our platform

provides complete visibility into your users’ experiences from anywhere – and real-time intelligence into your applications and services

to detect and fix issues faster. We are proud to partner with digital innovators like L'Oréal, Verizon, Oracle, LinkedIn, Honeywell,

Priceline, and Qualtrics, who trust Catchpoint to improve their brand experience and drive their business success. See how Catchpoint

can reduce your Mean Time to Detect at www.catchpoint.com/freetrial.

You might also like