Talk given on 2019-05-21 at KubeCon Barcelona: https://kccnceu19.sched.com/event/MPcM/kubernetes-failure-stories-and-how-to-crash-your-clusters-henning-jacobs-zalando-se
Bootstrapping a Kubernetes cluster is easy, rolling it out to nearly 200 engineering teams and operating it at scale is a challenge. In this talk, we are presenting our approach to Kubernetes provisioning on AWS, operations and developer experience for our growing Zalando developer base. We will walk you through our horror stories of operating 100+ clusters and share the insights we gained from incidents, failures, user reports and general observations. Our failure stories will be sourced from recent and past incidents, so the talk will be up-to-date with our latest experiences.
Most of our learnings apply to other Kubernetes infrastructures (EKS, GKE, ..) as well. This talk strives to reduce the audience's unknown unknowns about running Kubernetes in production.
4. 4
ZALANDO AT A GLANCE
~ 5.4billion EUR
revenue 2018
> 250
million
visits
per
month
> 15.000
employees in
Europe
> 79%
of visits via
mobile devices
> 26
million
active customers
> 300.000
product choices
~ 2.000
brands
17
countries
19. 19
INCIDENT #1: CONTRIBUTING FACTORS
• Shared Ingress (per cluster)
• High latency of unrelated app (Solr)
caused high number of in-flight requests
• Skipper creates goroutine per HTTP request.
Goroutine costs 2kB memory + http.Request
• Memory limit was fixed at 500Mi (4x regular usage)
Fix for the memory issue in Skipper:
https://opensource.zalando.com/skipper/operation/operation/#scheduler
29. 29
INCIDENT #2: LESSONS LEARNED
• Fix Ingress to stay “healthy” during API server problems
• Fix Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"
NOTE: we dropped quotas recently
github.com/zalando-incubator/kubernetes-
on-aws/pull/2059
37. 37
INCIDENT #3: CONTRIBUTING FACTORS
• HTTP retries
• No DNS caching
• Kubernetes ndots:5 problem
• Short maximum lifetime of HTTP connections
• Fixed memory limit for CoreDNS
• Monitoring affected by DNS outage
github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
47. 47
INCIDENT #5: CONNECTION ISSUES
...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeouts in the APIserver and disconnects in the pod network.
...
Master Node
API Server
etcd
etcd-member
48. 48
INCIDENT #5: STOP THE BLEEDING
#!/bin/bash
while true; do
echo "sleep for 60 seconds"
sleep 60
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
49. 49
INCIDENT #5: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
50. 50
INCIDENT #5: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable
59. 59
E2E TESTS ON EVERY PR
github.com/zalando-incubator/kubernetes-on-aws
60. 60
RUNNING E2E TESTS (BEFORE)
Control plane
nodenode
branch: dev
Create Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
61. 61
RUNNING E2E TESTS (NOW)
Control plane
nodenode
Control plane
nodenode
branch: alpha (base) branch: dev (head)
Create Cluster Update Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
62. 62
INCIDENT #6: LESSONS LEARNED
• Automated e2e tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with previous configuration
• Apply new configuration
• Run end-to-end & conformance tests
github.com/zalando-incubator/kubernetes-on-aws/tree/dev/test/e2e
69. 69
INCIDENT #8: IMPACT
Error during Pod creation:
MountVolume.SetUp failed for volume
"outfit-delivery-api-credentials" :
secrets "outfit-delivery-api-credentials" not found
⇒ All new Kubernetes deployments fail
70. 70
INCIDENT #8: CREDENTIALS QUEUE
17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20
17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20
17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20
..
17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20
..
17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20
..
19:16:07 | [pool-6-thread-1 ] | Current queue size: 58381, current number of active workers: 20
72. 72
INCIDENT #8: WHAT HAPPENED
Scaled down IAM provider
to reduce Slack
+ Number of deployments increased
⇒ Process could not process credentials fast enough
73. 73
CPU/memory requests "block" resources on nodes.
Difference between actual usage and requests → Slack
SLACK
CPU
Memory
Node
"Slack"
75. 75
A MILLION WAYS TO CRASH YOUR CLUSTER?
• Switch to latest Docker to fix issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s),
switch from kube-dns to node-local dnsmasq+CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy: client-go still seems to have
issues with timeouts
• 502's during cluster updates: race condition during network setup
76. 76
MORE TOPICS
• Graceful Pod shutdown and
race conditions (endpoints, Ingress)
• Incompatible Kubernetes changes
• CoreOS ContainerLinux "stable" won't boot
• Kubernetes EBS volume handling
• Docker
77. 77
RACE CONDITIONS..
• Switch to the latest Docker version available to fix the issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts
• 502's during cluster updates: race condition
•
github.com/zalando-incubator/kubernetes-on-aws
78. 78
TIMEOUTS TO API SERVER..
github.com/zalando-incubator/kubernetes-on-aws
80. 80
WILL MANAGED K8S SAVE US?
GKE: monthly uptime percentage at 99.95% for regional clusters
81. 81
WILL MANAGED K8S SAVE US?
NO(not really)
e.g. AWS EKS uptime SLA is only for API server
82. 82
PRODUCTION PROOFING AWS EKS
List of things you might
want to look at for EKS
in production
https://medium.com/glia-tech/productionproofing-e
ks-ed52951ffd6c
83. 83
AWS EKS IN PRODUCTION
https://kubedex.com/90-days-of-aws-eks-in-production/
87. 87
KUBERNETES FAILURE STORIES
A compiled list of links to public failure stories related to Kubernetes.
k8s.af
We need more failure talks!
Istio? Anyone?