Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Skip header Section
Site Reliability Engineering: How Google Runs Production SystemsApril 2016
Publisher:
  • O'Reilly Media, Inc.
ISBN:978-1-4919-2912-4
Published:16 April 2016
Pages:
552
Skip Bibliometrics Section
Bibliometrics
Skip Abstract Section
Abstract

The overwhelming majority of a software systems lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Googles Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. Youll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficientlessons directly applicable to your organization. This book is divided into four sections: Introduction Learn what site reliability engineering is and why it differs from conventional IT industry practices Principles Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE) Practices Understand the theory and practice of an SREs day-to-day work: building and operating large distributed computing systems Management Explore Google's best practices for training, communication, and meetings that your organization can use

Cited By

  1. ACM
    Amaro R, Pereira R and Mira da Silva M (2024). DevOps Metrics and KPIs: A Multivocal Literature Review, ACM Computing Surveys, 56:9, (1-41), Online publication date: 31-Oct-2024.
  2. ACM
    Liu Y, Li H, Cheng Y, Ray S, Huang Y, Zhang Q, Du K, Yao J, Lu S, Ananthanarayanan G, Maire M, Hoffmann H, Holtzman A and Jiang J CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving Proceedings of the ACM SIGCOMM 2024 Conference, (38-56)
  3. ACM
    Pápay L, Pustelnik J, Rzadca K, Strack B, Stradomski P, Wołowiec B and Zasadzinski M An exabyte a day: throughput-oriented, large scale, managed data transfers with Effingo Proceedings of the ACM SIGCOMM 2024 Conference, (970-982)
  4. ACM
    Hong A, Malinovsky P and Damodaran S (2023). Towards Attack Detection in Multimodal Cyber-Physical Systems with Sticky HDP-HMM based Time Series Analysis, Digital Threats: Research and Practice, 5:1, (1-21), Online publication date: 31-Mar-2024.
  5. Zhang L and Shi Y (2024). Sparse and semi-attention guided faults diagnosis approach for distributed online services▪, Applied Soft Computing, 148:C, Online publication date: 1-Nov-2023.
  6. ACM
    Seemakhupt K, Stephens B, Khan S, Liu S, Wassel H, Yeganeh S, Snoeren A, Krishnamurthy A, Culler D and Levy H A Cloud-Scale Characterization of Remote Procedure Calls Proceedings of the 29th Symposium on Operating Systems Principles, (498-514)
  7. Zhu J, Huang N, Wang J and Qin X (2023). Availability Model for Data Center Networks With Dynamic Migration and Multiple Traffic Flows, IEEE Transactions on Network and Service Management, 20:3, (2975-2989), Online publication date: 1-Sep-2023.
  8. ACM
    Patel P, Gregersen T and Anderson T An Agile Pathway Towards Carbon-aware Clouds Proceedings of the 2nd Workshop on Sustainable Computer Systems, (1-8)
  9. Amaro R, Pereira R and da Silva M (2023). Capabilities and metrics in DevOps, Information and Management, 60:5, Online publication date: 1-Jul-2023.
  10. ACM
    Vandersanden M A Holistic Approach to Understand HTTP Adaptive Streaming Proceedings of the 14th Conference on ACM Multimedia Systems, (333-337)
  11. ACM
    Chakraborty S, Garg S, Agarwal S, Chauhan A and Saini S CausIL: Causal Graph for Instance Level Microservice Data Proceedings of the ACM Web Conference 2023, (2905-2915)
  12. ACM
    Dias A, Correia L and Malheiros N (2021). A Systematic Literature Review on Virtual Machine Consolidation, ACM Computing Surveys, 54:8, (1-38), Online publication date: 30-Nov-2022.
  13. Ikram A, Chakraborty S, Mitra S, Saini S, Bagchi S and Kocaoglu M Root cause analysis of failures in microservices through causal discovery Proceedings of the 36th International Conference on Neural Information Processing Systems, (31158-31170)
  14. ACM
    Kaur M, Parkin S, Janssen M and Fiebig T (2022). "I needed to solve their overwhelmness": How System Administration Work was Affected by COVID-19, Proceedings of the ACM on Human-Computer Interaction, 6:CSCW2, (1-30), Online publication date: 7-Nov-2022.
  15. ACM
    Qiu H, Mao W, Patke A, Wang C, Franke H, Kalbarczyk Z, Başar T and Iyer R SIMPPO Proceedings of the 13th Symposium on Cloud Computing, (306-322)
  16. Chen K, Faddi Z, Nagaraju V and Fiondella L Quantifying the Impact of Staged Rollout Policies on Software Process and Product Metrics 2022 Annual Reliability and Maintainability Symposium (RAMS), (1-6)
  17. Jalodia N, Taneja M, Davy A and Dezfouli B A Residual LSTM based Multi-Label Classification Framework for Proactive SLA Management in a Latency Critical NFV Application Use-Case 2022 IEEE 19th Annual Consumer Communications & Networking Conference (CCNC), (782-789)
  18. Hole K (2022). Tutorial on systems with antifragility to downtime, Computing, 104:1, (73-93), Online publication date: 1-Jan-2022.
  19. Leite L, Pinto G, Kon F and Meirelles P (2021). The organization of software teams in the quest for continuous delivery, Information and Software Technology, 139:C, Online publication date: 1-Nov-2021.
  20. ACM
    Nokleberg C and Hawkes B (2021). Application frameworks, Communications of the ACM, 64:7, (42-49), Online publication date: 1-Jul-2021.
  21. ACM
    Bronson N, Aghayev A, Charapko A and Zhu T Metastable failures in distributed systems Proceedings of the Workshop on Hot Topics in Operating Systems, (221-227)
  22. Alves I and Rocha C Qualifying software engineers undergraduates in DevOps - challenges of introducing technical and non-technical concepts in a project-oriented course Proceedings of the 43rd International Conference on Software Engineering: Joint Track on Software Engineering Education and Training, (144-153)
  23. Pereira C A functional paradigm for capacity planning of cloud computing workloads Proceedings of the 43rd International Conference on Software Engineering: Companion Proceedings, (281-283)
  24. Pope M and Sillito J Quartermaster Proceedings of the 43rd International Conference on Software Engineering: Companion Proceedings, (57-60)
  25. Aggarwal P, Gupta A, Mohapatra P, Nagar S, Mandal A, Wang Q and Paradkar A Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals Service-Oriented Computing – ICSOC 2020 Workshops, (137-149)
  26. Ngo K, Sen S and Lloyd W Tolerating slowdowns in replicated state machines using copilots Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, (583-598)
  27. Cho I, Saeed A, Fried J, Park S, Alizadeh M and Belay A Overload control for µs-scale RPCs with breakwater Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, (299-314)
  28. ACM
    Leite L, Kon F, Pinto G and Meirelles P Platform Teams Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, (505-511)
  29. Griebler D, Vogel A, De Sensi D, Danelutto M and Fernandes L (2019). Simplifying and implementing service level objectives for stream parallelism, The Journal of Supercomputing, 76:6, (4603-4628), Online publication date: 1-Jun-2020.
  30. ACM
    Maas M, Andersen D, Isard M, Javanmard M, McKinley K and Raffel C Learning-based Memory Allocation for C++ Server Workloads Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, (541-556)
  31. Hauer T, Hoffmann P, Lunney J, Ardelean D and Diwan A Meaningful availability Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementation, (545-558)
  32. Niedermaier S, Koetter F, Freymann A and Wagner S On Observability and Monitoring of Distributed Systems – An Industry Interview Study Service-Oriented Computing, (36-52)
  33. Wirfs-Brock R and Hvatum L Who will read my patterns? Proceedings of the 26th Conference on Pattern Languages of Programs, (1-21)
  34. ACM
    Gamez-Diaz A, Fernandez P, Ruiz-Cortés A, Molina P, Kolekar N, Bhogill P, Mohaan M and Méndez F The role of limitations and SLAs in the API industry Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, (1006-1014)
  35. Masson C, Rim J and Lee H (2019). DDSketch, Proceedings of the VLDB Endowment, 12:12, (2195-2205), Online publication date: 1-Aug-2019.
  36. ACM
    Wiedemann A, Wiesche M and Krcmar H Integrating Development and Operations in Cross-Functional Teams - Toward a DevOps Competency Model Proceedings of the 2019 on Computers and People Research Conference, (14-19)
  37. ACM
    Lou C, Huang P and Smith S Comprehensive and Efficient Runtime Checking in System Software through Watchdogs Proceedings of the Workshop on Hot Topics in Operating Systems, (51-57)
  38. ACM
    Lagar-Cavilla A, Ahn J, Souhlal S, Agarwal N, Burny R, Butt S, Chang J, Chaugule A, Deng N, Shahid J, Thelen G, Yurtsever K, Zhao Y and Ranganathan P Software-Defined Far Memory in Warehouse-Scale Computers Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, (317-330)
  39. ACM
    Sloss B, Nukala S and Rau V (2019). Metrics that matter, Communications of the ACM, 62:4, (88-88), Online publication date: 20-Mar-2019.
  40. ACM
    Sloss B, Nukala S and Rau V (2018). Metrics That Matter, Queue, 16:6, (86-105), Online publication date: 1-Dec-2018.
  41. ACM
    Nukala S and Rau V (2018). Why SRE documents matter, Communications of the ACM, 61:12, (45-51), Online publication date: 20-Nov-2018.
  42. Andreadis G, Versluis L, Mastenbroek F and Iosup A A reference architecture for datacenter scheduling Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-15)
  43. Andreadis G, Versluis L, Mastenbroek F and Iosup A A reference architecture for datacenter scheduling Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-15)
  44. ACM
    Ghirotti S, Reilly T and Rentz A (2018). Tracking and controlling microservice dependencies, Communications of the ACM, 61:11, (98-104), Online publication date: 26-Oct-2018.
  45. Veeraraghavan K, Meza J, Michelson S, Panneerselvam S, Gyori A, Chou D, Margulis S, Obenshain D, Padmanabha S, Shah A, Song Y and Xu T Maelstrom Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation, (373-389)
  46. ACM
    Weichbrodt L Measuring operational quality of recommendations Proceedings of the 12th ACM Conference on Recommender Systems, (485-485)
  47. Griebler D, De Sensi D, Vogel A, Danelutto M and Fernandes L Service Level Objectives via C++11 Attributes Euro-Par 2018: Parallel Processing Workshops, (745-756)
  48. ACM
    Nukala S and Rau V (2018). Why SRE Documents Matter, Queue, 16:4, (66-91), Online publication date: 1-Aug-2018.
  49. ACM
    Esparrachiari S, Reilly T and Rentz A (2018). Tracking and Controlling Microservice Dependencies, Queue, 16:4, (44-65), Online publication date: 1-Aug-2018.
  50. Gan E, Ding J, Tai K, Sharan V and Bailis P (2018). Moment-based quantile sketches for efficient high cardinality aggregation queries, Proceedings of the VLDB Endowment, 11:11, (1647-1660), Online publication date: 1-Jul-2018.
  51. ACM
    Mekuria R, McGrath M, Riccobene V, Bayon-Molino V, Tselios C, Thomson J and Dobrodub A Automated profiling of virtualized media processing functions using telemetry and machine learning Proceedings of the 9th ACM Multimedia Systems Conference, (150-161)
  52. Zhang Q, Yu G, Guo C, Dang Y, Swanson N, Yang X, Yao R, Chintalapati M, Krishnamurthy A and Anderson T Deepview Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation, (519-532)
  53. ACM
    Alvaro P and Tymon S (2017). Abstracting the geniuses away from failure testing, Communications of the ACM, 61:1, (54-61), Online publication date: 27-Dec-2017.
  54. ACM
    Treynor B, Dahlin M, Rau V and Beyer B (2017). The calculus of service availability, Communications of the ACM, 60:9, (42-47), Online publication date: 23-Aug-2017.
  55. Rong K and Bailis P (2017). ASAP, Proceedings of the VLDB Endowment, 10:11, (1358-1369), Online publication date: 1-Aug-2017.
  56. ACM
    Sloss B, Dahlin M, Rau V and Beyer B (2017). The Calculus of Service Availability, Queue, 15:2, (49-67), Online publication date: 1-Apr-2017.
Contributors
  • Google LLC

Recommendations