Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
HENNING JACOBS
@try_except_
Kubernetes
Failure Stories
Kubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe Barcelona
4
ZALANDO AT A GLANCE
~ 5.4billion EUR
revenue 2018
> 250
million
visits
per
month
> 15.000
employees in
Europe
> 79%
of visits via
mobile devices
> 26
million
active customers
> 300.000
product choices
~ 2.000
brands
17
countries
5
SCALE
118Clusters
380Accounts
6
DEVELOPERS USING KUBERNETES
7
47+ cluster
components
INCIDENTS ARE FINE
INCIDENT
#1
10
INCIDENT #1: CUSTOMER IMPACT
11
INCIDENT #1: CUSTOMER IMPACT
12
INCIDENT #1: INGRESS ERRORS
13
INCIDENT #1: AWS ALB 502
github.com/zalando/riptide
14
INCIDENT #1: AWS ALB 502
github.com/zalando/riptide
502 Bad Gateway
Server: awselb/2.0
...
15
INCIDENT #1: ALB HEALTHY HOST COUNT
3 healthy hosts
zero healthy hosts
2xx requests
16
LIFE OF A REQUEST (INGRESS)
Node Node
MyApp MyApp MyApp
EC2 network
K8s network
TLS
HTTP
Skipper Skipper
ALB
17
INCIDENT #1: SKIPPER MEMORY USAGE
Memory Limit
Memory Usage
18
INCIDENT #1: SKIPPER OOM
Node Node
MyApp MyApp MyApp
TLS
HTTP
Skipper Skipper
ALB
OOMKill
19
INCIDENT #1: CONTRIBUTING FACTORS
• Shared Ingress (per cluster)
• High latency of unrelated app (Solr)
caused high number of in-flight requests
• Skipper creates goroutine per HTTP request.
Goroutine costs 2kB memory + http.Request
• Memory limit was fixed at 500Mi (4x regular usage)
Fix for the memory issue in Skipper:
https://opensource.zalando.com/skipper/operation/operation/#scheduler
INCIDENT
#2
21
INCIDENT #2: CUSTOMER IMPACT
22
INCIDENT #1: IAM RETURNING 404
23
INCIDENT #1: NUMBER OF PODS
24
LIFE OF A REQUEST (INGRESS)
Node Node
MyApp MyApp MyApp
EC2 network
K8s network
TLS
HTTP
Skipper Skipper
ALB
25
ROUTES FROM API SERVER
Node Node
MyApp MyApp MyApp
Skipper
ALBAPI Server
Skipper
26
API SERVER DOWN
Node Node
MyApp MyApp MyApp
Skipper
ALBAPI Server
Skipper
OOMKill
27
INCIDENT #2: INNOCENT MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
containers:
...
28
INCIDENT #2: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
spec:
restartPolicy: Never
containers:
29
INCIDENT #2: LESSONS LEARNED
• Fix Ingress to stay “healthy” during API server problems
• Fix Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"
NOTE: we dropped quotas recently
github.com/zalando-incubator/kubernetes-
on-aws/pull/2059
INCIDENT
#3
31
INCIDENT #3: INGRESS ERRORS
32
INCIDENT #3: COREDNS OOMKILL
coredns invoked oom-killer:
gfp_mask=0x14000c0(GFP_KERNEL),
nodemask=(null), order=0, oom_score_adj=994
Memory cgroup out of memory: Kill process 6428
(coredns) score 2050 or sacrifice child
oom_reaper: reaped process 6428 (coredns),
now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
restarts
33
STOP THE BLEEDING: INCREASE MEMORY LIMIT
4Gi
2Gi
200Mi
34
SPIKE IN HTTP REQUESTS
35
SPIKE IN DNS QUERIES
36
INCREASE IN MEMORY USAGE
37
INCIDENT #3: CONTRIBUTING FACTORS
• HTTP retries
• No DNS caching
• Kubernetes ndots:5 problem
• Short maximum lifetime of HTTP connections
• Fixed memory limit for CoreDNS
• Monitoring affected by DNS outage
github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
INCIDENT
#4
39
INCIDENT #4: CLUSTER DOWN
40
INCIDENT #4: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
41
INCIDENT #4: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> [range_end]
42
Junior Engineers are Features, not Bugs
https://www.youtube.com/watch?v=cQta4G3ge44
https://www.outcome-eng.com/human-error-never-root-cause/
44
INCIDENT #4: LESSONS LEARNED
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots
INCIDENT
#5
46
INCIDENT #5: API LATENCY SPIKES
47
INCIDENT #5: CONNECTION ISSUES
...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeouts in the APIserver and disconnects in the pod network.
...
Master Node
API Server
etcd
etcd-member
48
INCIDENT #5: STOP THE BLEEDING
#!/bin/bash
while true; do
echo "sleep for 60 seconds"
sleep 60
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
49
INCIDENT #5: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
50
INCIDENT #5: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable
INCIDENT
#6
52
INCIDENT #6: IMPACT
Ingress
5XXs
53
INCIDENT #6: CLUSTER DOWN?
54
INCIDENT #6: THE TRIGGER
https://www.outcome-eng.com/human-error-never-root-cause/
56
CLUSTER UPGRADE
FLOW
57
CLUSTER LIFECYCLE MANAGER (CLM)
github.com/zalando-incubator/cluster-lifecycle-manager
58
CLUSTER CHANNELS
github.com/zalando-incubator/kubernetes-on-aws
Channel Description Clusters
dev Development and playground clusters. 3
alpha Main infrastructure clusters (important to us). 2
beta
Product clusters for the rest of the
organization (non-prod). 57+
stable
Product clusters for the rest of the
organization (prod). 57+
59
E2E TESTS ON EVERY PR
github.com/zalando-incubator/kubernetes-on-aws
60
RUNNING E2E TESTS (BEFORE)
Control plane
nodenode
branch: dev
Create Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
61
RUNNING E2E TESTS (NOW)
Control plane
nodenode
Control plane
nodenode
branch: alpha (base) branch: dev (head)
Create Cluster Update Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
62
INCIDENT #6: LESSONS LEARNED
• Automated e2e tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with previous configuration
• Apply new configuration
• Run end-to-end & conformance tests
github.com/zalando-incubator/kubernetes-on-aws/tree/dev/test/e2e
INCIDENT
#7
64
⇒ all containers
on this node down
#7: KERNEL OOM KILLER
65
INCIDENT #7: KUBELET MEMORY
66
UPSTREAM ISSUE REPORTED
https://github.com/kubernetes/kubernetes/issues/73587
67
INCIDENT #7: THE PATCH
https://github.com/kubernetes/kubernetes/issues/73587
INCIDENT
#8
69
INCIDENT #8: IMPACT
Error during Pod creation:
MountVolume.SetUp failed for volume
"outfit-delivery-api-credentials" :
secrets "outfit-delivery-api-credentials" not found
⇒ All new Kubernetes deployments fail
70
INCIDENT #8: CREDENTIALS QUEUE
17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20
17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20
17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20
..
17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20
..
17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20
..
19:16:07 | [pool-6-thread-1 ] | Current queue size: 58381, current number of active workers: 20
71
INCIDENT #8: CPU THROTTLING
72
INCIDENT #8: WHAT HAPPENED
Scaled down IAM provider
to reduce Slack
+ Number of deployments increased
⇒ Process could not process credentials fast enough
73
CPU/memory requests "block" resources on nodes.
Difference between actual usage and requests → Slack
SLACK
CPU
Memory
Node
"Slack"
74
DISABLING CPU THROTTLING
[Announcement] CPU limits will be disabled
⇒ Ingress Latency Improvements
kubelet … --cpu-cfs-quota=false
75
A MILLION WAYS TO CRASH YOUR CLUSTER?
• Switch to latest Docker to fix issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s),
switch from kube-dns to node-local dnsmasq+CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy: client-go still seems to have
issues with timeouts
• 502's during cluster updates: race condition during network setup
76
MORE TOPICS
• Graceful Pod shutdown and
race conditions (endpoints, Ingress)
• Incompatible Kubernetes changes
• CoreOS ContainerLinux "stable" won't boot
• Kubernetes EBS volume handling
• Docker
77
RACE CONDITIONS..
• Switch to the latest Docker version available to fix the issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts
• 502's during cluster updates: race condition
•
github.com/zalando-incubator/kubernetes-on-aws
78
TIMEOUTS TO API SERVER..
github.com/zalando-incubator/kubernetes-on-aws
79
MANAGED
KUBERNETES?
80
WILL MANAGED K8S SAVE US?
GKE: monthly uptime percentage at 99.95% for regional clusters
81
WILL MANAGED K8S SAVE US?
NO(not really)
e.g. AWS EKS uptime SLA is only for API server
82
PRODUCTION PROOFING AWS EKS
List of things you might
want to look at for EKS
in production
https://medium.com/glia-tech/productionproofing-e
ks-ed52951ffd6c
83
AWS EKS IN PRODUCTION
https://kubedex.com/90-days-of-aws-eks-in-production/
84
DOCKER.. (ON GKE)
https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0
39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
WELCOME TO
CLOUD NATIVE!
86
87
KUBERNETES FAILURE STORIES
A compiled list of links to public failure stories related to Kubernetes.
k8s.af
We need more failure talks!
Istio? Anyone?
88
OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws
AWS ALB Ingress controller
github.com/zalando-incubator/kube-ingress-aws-controller
Skipper HTTP Router & Ingress controller
github.com/zalando/skipper
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando-incubator/postgres-operator
Kubernetes Resource Report
github.com/hjacobs/kube-resource-report
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k

More Related Content

Kubernetes Failure Stories - KubeCon Europe Barcelona