Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleFebruary 2024
Developer Ecosystems for Software Safety: Continuous assurance at scale
How to design and implement information systems so that they are safe and secure is a complex topic. Both high-level design principles and implementation guidance for software safety and security are well established and broadly accepted. For example, ...
- research-articleMay 2023
Beyond the Repository: Best practices for open source ecosystems researchers
Much of the existing research about open source elects to study software repositories instead of ecosystems. An open source repository most often refers to the artifacts recorded in a version control system and occasionally includes interactions around ...
- research-articleDecember 2022
Reinventing Backend Subsetting at Google: Designing an algorithm with reduced connection churn that could replace deterministic subsetting
Backend subsetting is useful for reducing costs and may even be necessary for operating within the system limits. For more than a decade, Google used deterministic subsetting as its default backend subsetting algorithm, but although this algorithm ...
- research-articleMarch 2022
Distributed Latency Profiling through Critical Path Tracing: CPT can provide actionable and precise latency analysis.
Low latency is an important feature for many Google applications such as Search, and latency-analysis tools play a critical role in sustaining low latency at scale. For complex distributed systems that include services that constantly evolve in ...
- research-articleNovember 2021
Federated Learning and Privacy: Building privacy-preserving systems for machine learning and data science on decentralized data
Centralized data collection can expose individuals to privacy risks and organizations to legal risks if data is not properly managed. Federated learning is a machine learning setting where multiple entities collaborate in solving a machine learning ...
- research-articleJuly 2021
Digging into Big Provenance (with SPADE): A user interface for querying provenance
Several interfaces exist for querying provenance. Many are not flexible in allowing users to select a database type of their choice. Some provide query functionality in a data model that is different from the graph-oriented one that is natural for ...
- research-articleMarch 2021
WebRTC - Realtime Communication for the Open Web Platform: What was once a way to bring audio and video to the web has expanded into more use cases we could ever imagine.
In this time of pandemic, the world has turned to Internet-based, RTC (realtime communication) as never before. The number of RTC products has, over the past decade, exploded in large part because of cheaper high-speed network access and more powerful ...
- research-articleJanuary 2021
Best Practice: Application Frameworks: While powerful, frameworks are not for everyone.
While frameworks can be a powerful tool, they have some disadvantages and may not make sense for all organizations. Framework maintainers need to provide standardization and well-defined behavior while not being overly prescriptive. When frameworks ...
- case-studyNovember 2020
Differential Privacy: The Pursuit of Protections by Default: A discussion with Miguel Guevara, Damien Desfontaines, Jim Waldo, and Terry Coatta
First formalized in 2006, differential privacy is an approach based on a mathematically rigorous definition of privacy that allows formalization and proof of the guarantees against re-identification offered by a system. While differential privacy has ...
- research-articleJune 2020
Debugging Incidents in Google’s Distributed Systems: How experts debug production issues in complex distributed systems
This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively. ...
- case-studyMarch 2020
To Catch a Failure: The Record-and-Replay Approach to Debugging: A discussion with Robert O’Callahan, Kyle Huey, Devon O’Dell, and Terry Coatta
When work began at Mozilla on the record-and-replay debugging tool called rr, the goal was to produce a practical, cost-effective, resource-efficient means for capturing low-frequency nondeterministic test failures in the Firefox browser. Much of the ...
- research-articleFebruary 2018
Canary Analysis Service: Automated canarying quickens development, improves production safety, and helps prevent outages.
It is unreasonable to expect engineers working on product development or reliability to have statistical knowledge; removing this hurdle led to widespread CAS adoption. CAS has proven useful even for basic cases that don’t need configuration, and has ...
- research-articleApril 2017
The Calculus of Service Availability: You’re only as available as the sum of your dependencies.
Most services offered by Google aim to offer 99.99 percent (sometimes referred to as the "four 9s") availability to users. Some services contractually commit to a lower figure externally but set a 99.99 percent target internally. This more stringent ...
- research-articleJune 2016
Idle-Time Garbage-Collection Scheduling: Taking advantage of idleness to reduce dropped frames and memory consumption
Google’s Chrome web browser strives to deliver a smooth user experience. An animation will update the screen at 60 FPS (frames per second), giving Chrome around 16.6 milliseconds to perform the update. Within these 16.6 ms, all input events have to be ...
- research-articleJanuary 2016
Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade
Though widespread interest in software containers is a relatively recent phenomenon, at Google we have been managing Linux containers at scale for more than ten years and built three different container-management systems in that time. Each system was ...