Presented at GDG Devfest Ukraine 2018.
Prometheus has become the defacto monitoring system for cloud native applications, with systems like Kubernetes and Etcd natively exposing Prometheus metrics. In this talk Tom will explore all the moving part for a working Prometheus-on-Kubernetes monitoring system, including kube-state-metrics, node-exporter, cAdvisor and Grafana. You will learn about the various methods for getting to a working setup: the manual approach, using CoreOS’s Prometheus Operator, or using Prometheus Ksonnet Mixin. Tom will also share some little tips and tricks for getting the most out of your Prometheus monitoring, including the common pitfalls and what you should be alerting on.
4. Prometheus
• A monitoring & alerting system.
• Inspired by Google’s BorgMon
• Originally built by SoundCloud in
2012
• Open Source, now part of the CNCF
• Simple text-based metrics format
• Multidimensional datamodel
• Rich, concise query language
7. Prometheus’ data model is very simple:
<identifier> → [ (t0, v0), (t1, v1), ... ]
Timestamps are millisecond int64, values are float64
https://www.slideshare.net/Docker/monitoring-the-prometheus-way-julius-voltz-prometheus
12. And aggregate by a dimension…
PromQL: sum by (path) (rate(http_requests_total{job=“nginx”, status=~“5..”}[1m]))
{path=“/home”} 0.0666
{path=“/settings”} 3.3
...
13. Do binary operations…
PromQL: sum by (path) (rate(http_requests_total{job=“nginx”, status=~“5..”}[1m]))
/
sum by (path) (rate(http_requests_total{job=“nginx”}[1m]))
{path=“/home”} 0.001
{path=“/settings”} 1.0
...
14. Kubernetes
• Platform for managing containerized
workloads and services
• “operating system for you datacenter”
• Inspired by Google’s Borg
• Also part of the CNCF
• Distributed, fault tolerant architecture
• Rich object model for you
applications
22. USE Method
CPU Utilisation:
1 - avg(rate(node_cpu{mode=“idle"}[1m]))
CPU Saturation:
sum(node_load1)/ sum(node:node_num_cpu:sum)
23. USE Method
• Can also look at
container level metrics
from cAdvisor…
• …and combine them
with metadata from
kube-state-metrics.
24. USE Method
Container CPU usage by “app” label
sum by (namespace, label_name) (
sum by (pod_name, namespace (
rate(container_cpu_usage_seconds_total[5m])
)
* on (pod_name) group_left(label_name)
label_join(kube_pod_labels, "pod_name", ",", "pod")
)
26. RED Method
Most useful alert I’ve found:
100 * sum by(instance, job) (
rate(rest_client_requests_total{code!~”2..”}[5m])
)
/
sum by(instance, job) (
rate(rest_client_requests_total[5m])
)
27. ??? Method
Alert expressions are invariants that describe a healthy
system
kube_deployment_spec_replicas !=
kube_deployment_status_replicas_available
rate(kube_pod_container_status_restarts_total
[15m]) > 0
28. ??? Method
Alert expressions are invariants that describe a healthy system
(kube_pod_status_phase{phase!~”Running|Succeeded”}) > 0
sum(kube_pod_container_resource_requests_cpu_cores)
/ sum(node:node_num_cpu:sum)
>
(count(node:node_num_cpu:sum) - 1)
/ count(node:node_num_cpu:sum)
29. Cortex
• Horizontally scalable, HA
Prometheus
• Now part of the CNCF Sandbox
• Distributed, fault tolerant architecture
• Long term storage
• Multitenant
https://github.com/cortexproject/cortex
31. Getting setup
• github.com/coreos/prometheus-operator - Job to look after running
Prometheus on Kubernetes
• github.com/coreos/kube-prometheus - Set of configs for running all there
other things you need.
• github.com/grafana/jsonnet-libs/tree/master/prometheus-ksonnet - My
configs for running Prometheus, Alertmanager, Grafana etc
• github.com/kubernetes-monitoring/kubernetes-mixin - Joint project to
unify and improve common alerts for Kubernetes.