SRE Foundation V1 - 0 - Value Added Resources 11 - 2019

SRE Foundation Course: Value Added Resources 

This document provides links to articles and videos related to the Site Reliability Engineering 
(SRE) course from the DevOps Institute. This information is provided to enhance your 
understanding of SRE-related concepts and terms and is not examinable. Of course, there 
is a wealth of other videos, blogs and case studies on the web. 
We welcome suggestions for additions. 

Videos Featured in the Course 

Module  Title & Description  Link 

1: SRE Principles & Practices  ‘What's the Difference  https://youtu.be/uTEL8Ff1Zvk 

Between DevOps and SRE?’ 
with Seth Vargo and Liz 
Fong-Jones of Google (05:10) 

2: Service Level Objectives &  ‘Risk and Error Budgets’ with  https://youtu.be/y2ILKr8kCJU 
Error Budgets  Seth Vargo and Liz Fong-Jones   
of Google (06:17) 

3: Reducing Toil  ‘Pragmatic Automation’ with  https://www.youtube.com/wat

Max Luebbe of GCP (04:45)  ch?v=oDcjAcFTFC0&t=0m56s 

4: Monitoring & Service Level  ‘SLI & Reliability Deep-Dive’  https://www.youtube.com/wat

Indicators  with David N. Blank-Edelman  ch?v=1iMo3SkdQqQ  
of Microsoft (08:35)   

5: SRE Tools & Automation  ‘Ironies of Automation: A  https://www.youtube.com/wat

Comedy in Three Parts’ with  ch?v=U3ubcoNzx9k 
Tanner Lund of Microsoft (18:32) 

6: Anti-Fragility & Learning from  ‘Sloth, a Tool for Inducing  https://www.usenix.org/confer

Failure  Network Failures’ with Preetha  ence/srecon17americas/prog
Appan of Indeed.com (04:45)  ram/presentation/appan 

7: Organizational Impact of SRE  ‘A History of SRE at Uber’ with  https://www.youtube.com/wat

  Rick Boone of Uber (06:24)  ch?v=qJnS-EfIIIE 

8: SRE, Other Frameworks, Trends  ‘A Look at ITIL4 & SRE’ with  https://dev.tube/video/vFyPXI
Jayne Groll of DevOps Institute  sUEhE 

SRE Reports 
Report Name  Writers/Publishers  Link 
2019 SRE Report  Catchpoint  http://pages.catchpoint.com/

What is SRE?   Kurt Andersen & Craig Sebenik  https://www.oreilly.com/librar

from O’Reilly Media  y/view/what-is-sre/9781492054

SRE Articles 
Article Title & Author  Relevant Module  Link 
‘Which Factors Affect  1: SRE Principles &  https://www.ncbi.nlm.nih.gov/pmc/art
Software Projects  Practices  icles/PMC3610582/ 
Maintenance Cost More?’ by   
Sayed Mehdi Hejazi 
Dehaghani and Nafiseh 

‘Measuring and Evaluating  1: SRE Principles &  https://medium.com/@serhatcan/me

Service Level Objectives  Practices  asuring-and-evaluating-service-level-o
(SLOs)’ by Serhat Can  bjectives-slos-84b0dc740a0a 

‘Bloomberg Bets Big on SREs’  1: SRE Principles &  https://www.techatbloomberg.com/bl

by Michael Rembetsy  Practices  og/bloomberg-bets-big-on-sres/ 

‘Site Reliability Engineering at  1: SRE Principles &  https://player.fm/series/devops-chat/si

Bloomberg’ by Stig Sorensen  Practices  te-reliability-engineering-sre-bloomber

‘What It Means To Be A Site  1: SRE Principles &  https://dev.to/molly_struve/what-it-me

Reliability Engineer’ by Molly  Practices  ans-to-be-a-site-reliability-engineer-32k
Struve  i 

‘Error Budgets – Practical  2: SLO’s & Error Budgets  https://www.slideshare.net/yaroslavm

Implementation’ by Yaroslav  olochko/implementing-error-budgets-1
Molochko  25400822 

‘How to Avoid the 5 SRE  2: SLO’s & Error Budgets  https://thenewstack.io/how-to-avoid-t

Implementation Traps that  he-5-sre-implementation-traps-that-cat
Catch Even the Best Teams’  ch-even-the-best-teams/ 
by Lyon Wong   

‘Site Reliability Engineering:  2: SLO’s & Error Budgets  https://www.appdynamics.com/blog/
DevOps 2.0’ by Saba Anees  engineering/site-reliability-engineering-

‘Getting Started with Site  2: SLO’s & Error Budgets  https://www.devops.talksplus.com/wp

Reliability Engineering’ by  -content/themes/dotc/2019_Melbourn
Jennifer Petoff  e/presentations/Getting%20Started%2

‘Invent More, Toil Less’ by  3: Reducing Toil  https://storage.googleapis.com/pub-t

Betsy Beyer, Brendan Gleason,  ools-public-publication-data/pdf/4576
Dave O’connor and Vivek  5.pdf 

‘SRE Lessons: Continuously  3: Reducing Toil  https://www.rundeck.com/blog/sre-les

Optimize to Reduce Toil’ by  sons-continuously-optimize-to-reduce-t
Damon Edwards  oil 

‘Toil: Finally a Name For a  3: Reducing Toil  https://www.rundeck.com/blog/toil-fin

Problem We've All Felt’ by  ally-a-name-for-a-problem 
Damon Edwards 

‘SRE Lessons: Continuously  3: Reducing Toil  https://www.rundeck.com/blog/sre-les

Optimize to Reduce Toil’ by  sons-continuously-optimize-to-reduce-t
Damon Edwards  oil 

‘Site Reliability Engineering  3: Reducing Toil  https://www.oreilly.com/ideas/site-reli

(SRE): A Simple Overview’ by  ability-engineering-sre-a-simple-overvi
Mac Slocum  ew 

‘What Is SRE?’ by Craig  3: Reducing Toil  https://www.oreilly.com/library/view/w

Sebenik & Kurt Andersen  hat-is-sre/9781492054429/ 

‘Is It Worth the Time?’ by Xkcd  3: Reducing Toil  https://imgs.xkcd.com/comics/is_it_wo


‘An Engineer’s Guide To SLA,  4: Monitoring & Service  https://plumbr.io/blog/monitoring/an-

SLO, and SLI’ by Ram Lyengar  Level Indicators  engineers-guide-to-sla-slo-and-sli 

‘Service Level Indicators in  4: Monitoring & Service  https://medium.com/@jerub/service-le

Practice’ by Stephen Thorne  Level Indicators  vel-indicators-in-practice-6a1125e24b

‘Stop Using Nagios (So It Can  4: Monitoring & Service  http://www.slideshare.net/superdupers

Die Peacefully)’ by Andy  Level Indicators  heep/stop-using-nagios-so-it-can-die-p
Sykes  eacefully 

‘Why Does (My) Monitoring  4: Monitoring & Service  https://www.usenix.org/conference/sr

Suck?’ by Todd Palion  Level Indicators  econ19asia/presentation/palino-monit

‘Observability — A 3-Year  4: Monitoring & Service  https://thenewstack.io/observability-a-
Retrospective’ by Charity  Level Indicators  3-year-retrospective/ 

‘Monitoring and Observability  4: Monitoring & Service  https://thenewstack.io/monitoring-and

— What’s the Difference and  Level Indicators  -observability-whats-the-difference-an
Why Does It Matter?’ by Peter  d-why-does-it-matter/ 

‘3 Ways to Reduce Alert Noise  4: Monitoring & Service  https://www.metricly.com/3-ways-red

in Monitoring’ by Christina  Level Indicators  uce-alert-noise/ 

‘Observability and  4: Monitoring & Service  https://www.infoq.com/articles/charity

Understanding the  Level Indicators  -majors-observability-failure/ 
Operational Ramifications of a 
System’ by Charity Majors 

‘Run a Service Level Indicator  4: Monitoring & Service  https://gds-way.cloudapps.digital/stan

(SLI) workshop’ BY GDS  Level Indicators  dards/slis.html 

‘The Evolution of Automation  5: SRE Tools &  https://landing.google.com/sre/sre-bo

at Google’ by Niall Murphy  Automation  ok/chapters/automation-at-google/ 

‘SRE at the Department for  5: SRE Tools &  https://dwpdigital.blog.gov.uk/catego

Work and Pensions’ by various  Automation  ry/site-reliability-engineering-sre/ 

‘Measuring and Evaluating  5: SRE Tools &  https://www.atlassian.com/blog/opsg

Service Level Objectives  Automation  enie/measuring-and-evaluating-servic
(SLOs)’ by Serhat Can  e-level-objectives 

‘Best NoSQL Databases 2019’  5: SRE Tools &  https://www.improgrammer.net/most-

Automation  popular-nosql-database/ 

‘On-Call Tools to Support a  5: SRE Tools &  https://victorops.com/blog/devops-on

DevOps Culture’ by Dan  Automation  -call-tools-to-support-culture 

‘Awesome Site Reliability  5: SRE Tools &  https://github.com/squadcastHQ/awe

Engineering Tools’ by Raghu  Automation  some-sre-tools 

‘Security & Compliance’ by  5: SRE Tools &  https://www.ansible.com/use-cases/se

Ansible  Automation  curity-and-compliance 

‘Secure Coding Best  5: SRE Tools &  https://www.owasp.org/images/0/08/

Practices’ by OWASP  Automation  OWASP_SCP_Quick_Reference_Guide

‘Testing in Production, the safe  5: SRE Tools &  https://medium.com/@copyconstruct/

way’ by Cindy Sridharan  Automation  testing-in-production-the-safe-way-18c

‘Amazon Andon Cord: What it  5: SRE Tools &  https://blueboard.io/resources/amazo
is and how to react’ by  Automation  n-andon-cord/ 
Velentin Bayard 

‘DevOps Tools Landscape’ by  5: SRE Tools &  https://about.gitlab.com/devops-tools

GitLab  Automation  / 

‘Measure Efficiency,  6: Antifragility &  http://devopsenterprise.io/media/DOE

Effectiveness, and Culture to  Learning from Failure​  S_forum_metrics_102015.pdf 
Optimize DevOps 
Transformations’ by IT 

‘Tracking Every Release’ by  6: Antifragility &  https://codeascraft.com/2010/12/08/tr

Mike Brittain  Learning from Failure​  ack-every-release/ 

‘A recovery point objective  6: Antifragility &  https://whatis.techtarget.com/definitio

(RPO)’ by Margaret Rouse  Learning from Failure​  n/recovery-point-objective-RPO 

‘The Learning Organization’  6: Antifragility &  https://www.slideshare.net/littleidea/th

by Andrew Shafer  Learning from Failure​  e-learning-organization-modev 

‘The Three Ways: The Principles  6: Antifragility &  https://itrevolution.com/the-three-way

Underpinning DevOps’ by  Learning from Failure​  s-principles-underpinning-devops/ 
Gene Kim 

‘A Typology of Organizational  6: Antifragility &  http://www.ncbi.nlm.nih.gov/pmc/arti

Cultures’ by R Westrum  Learning from Failure​  cles/PMC1765804/pdf/v013p0ii22.pdf 

‘Do You Want Your Cloud  6: Antifragility &  https://medium.com/@armankamran/

Solutions to Succeed? Start  Learning from Failure​  do-you-want-your-cloud-solutions-to-su
with Embracing Failures!’ by  cceed-start-with-embracing-failures-8f
Arman Kamran  5f40b57a64 

‘The Cost of IT Downtime’ by  7: Organizational  https://www.the20.com/blog/the-cost-

Michael Copeland  impact of SRE​  of-it-downtime/ 

‘How SRE teams are  7: Organizational  https://cloud.google.com/blog/produ

organized, and how to get  impact of SRE​  cts/devops-sre/how-sre-teams-are-org
started’ by Matt Brown  anized-and-how-to-get-started 

‘Kubernetes Up & Running’ by  7: Organizational  https://clouddamcdnprodep.azureed

Brendan Burns, Joe Beda &  impact of SRE​  ge.net/gdc/gdckTlBtc/original 
Kelsey Hightower 

‘Blameless PostMortems and a  7: Organizational  https://codeascraft.com/2012/05/22/b

Just Culture’ by John Allspaw  impact of SRE​  lameless-postmortems/ 

‘The Prime Directive’ by Norm  7: Organizational  https://retrospectivewiki.org/index.php

Kerth  impact of SRE​  ?title=The_Prime_Directive 

‘Creating Antifragile Systems:  7: Organizational  https://www.contino.io/files/Enterprise-

Site Reliability Engineering for  impact of SRE​  Site-Reliability-Engineering-Contino.pdf 
the Enterprise’ by Contino 

‘Scaling SRE Organizations:  7: Organizational  https://www.usenix.org/sites/default/fil
The journey from 1 to many  impact of SRE​  es/conference/protected-files/sre19a
teams’ by Gustavo Franco  mer_slides_franco.pdf 

‘The Convergence of  8: SRE, Other  http://itrevolution.com/the-convergen

DevOps’ by John Willis  Frameworks, Trends  ce-of-devops/ 

‘Site Reliability Engineer (SRE)  8: SRE, Other  https://victorops.com/blog/site-reliabili

Roles and Responsibilities’ by  Frameworks, Trends  ty-engineer-sre-roles-and-responsibilitie
Dan Holloran  s 

‘How ITIL4 and SRE align with  8: SRE, Other  https://techbeacon.com/enterprise-it/

DevOps’ by Jayne Groll  Frameworks, Trends  how-itil4-sre-align-devops 

‘Future of Reliability  8: SRE, Other  https://michael-kehoe.io/tags/future-o

Engineering’ by Michael  Frameworks, Trends  f-sre/ 

‘An Introduction to Database  8: SRE, Other  https://softwareengineeringdaily.com/

Reliability’ by Mackenzie Clark  Frameworks, Trends  2018/10/16/an-introduction-to-databa

‘Stop the Arguments: ITIL v4  8: SRE, Other  https://devopsinstitute.com/2019/11/

and SRE and DevOps All Are  Frameworks, Trends  05/stop-the-arguments-itil-v4-and-sre
Transformation Aids​’    -and-devops-all-are-transformation-a
Title  Link 
Usenix  https://www.usenix.org  

Honeycomb  https://www.honeycomb.io/  

Player FM – DevOps Chat  https://player.fm/series/devops-chat  

SRE Weekly  https://sreweekly.com/ 

Netflix  https://github.com/Netflix  

Downdetector  https://downdetector.co.uk  

SRE Blogs  
Blog  Link 
AppDynamics Blog  https://www.appdynamics.com/blog 

Atlassian Blog  https://www.atlassian.com/blog  

Prometheus Blog  https://prometheus.io/blog/ 

Rundeck Blog  https://www.rundeck.com/blog  

Tech At Bloomberg  https://www.techatbloomberg.com/blog 

VictorOps Blog  https://victorops.com/blog 


Additional Videos of Interest  

Relevant Module  Title  Link 
2: SLO’s & Error Budgets  ‘SLOs for Data-Intensive  https://www.youtube.com/wa
Services’ with Yoann Fouquet  tch?v=ZdguHXglT8M&feature=
(23:47)  youtu.be  

2: SLO’s & Error Budgets  ‘Latency SLOs Done Right’  https://www.youtube.com/w

with Heinrich Hartmann  atch?v=ycsc2kCaJxM&featu
(27:12)  re=youtu.be 

4: Monitoring & Service Level  ‘Building a Scalable  https://www.youtube.com/w

Indicators  Monitoring System’ with Molly  atch?v=vl1ecpFohZQ&featur
Struve (26.48)  e=youtu.be 

SRE Books 
Title  Author  Link 
Site Reliability Engineering  Betsy Beyer, Chris Jones,  https://landing.google.com/sre/
Jennifer Petoff and Niall Richard  sre-book/toc/index.html  

The Site Reliability Workbook  Betsy Beyer, Niall Richard  https://landing.google.com/sre/

Murphy, David K. Rensin, Kent  workbook/toc/  
Kawahara and Stephen Thorne   

Facts and Fallacies of Software  Robert L. Glass  https://www.amazon.com/Fact
Engineering  s-Fallacies-Software-Engineering

Chaos Engineering  Ali Basiri, Nora Jones, Aaron  https://www.oreilly.com/library/

Blohowiak, Lorin Hochstein,  view/chaos-engineering/978149
Casey Rosenthal  1988459/  

Case Stories Featured in the Course 
Company  Module  Link 
Accenture  3: Reducing Toil  https://techbeacon.com/devops/how-accenture-retrofitted-s

Bloomberg  1: SRE Principles  ● https://player.fm/series/devops-chat/site-reliability-

& Practices  engineering-sre-bloomberg-w-stig-sorenson 
● https://www.techatbloomberg.com/blog/bloomb
● https://www.ca.com/us/modern-software-factory/

Evernote  2: SLO's & Error  https://landing.google.com/sre/workbook/chapters/slo-engin

Budgets  eering-case-studies/ 

Home  2: SLO's & Error  https://landing.google.com/sre/workbook/chapters/slo-engin

Depot  Budgets  eering-case-studies/ 

Netflix  6: Antifragility  https://github.com/Netflix/SimianArmy 

and Learning 
from Failure 

Sage Group  7: Organizational  https://www.meetup.com/DevOpsNorthEast/events/26226323

Impact of SRE  1/ 

Standard  5: SRE Tools &  https://www.youtube.com/watch?v=d5IMvK0YHTg 

Chartered  Automation 

Trivago  4: Monitoring &  https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=

SLI's  web&cd=11&cad=rja&uact=8&ved=2ahUKEwj4m6HJ9qXj

VictorOps  8: SRE, Other  https://victorops.com/blog/site-reliability-engineer-sre-roles-a

(Splunk)  Frameworks,  nd-responsibilities 

