Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx
•
1 like•1,321 views
This document provides an introduction to open source observability tools including Grafana, Prometheus, Loki, and Tempo. It summarizes each tool and how they work together. Prometheus is introduced as a time series database that collects metrics. Loki is described as a log aggregation system that handles logs at scale without high costs. Tempo is explained as a tracing system that allows tracing from logs, metrics, and between services. The document emphasizes that these tools can be run together to gain observability across an entire system from logs to metrics to traces.
1 of 41
Download to read offline
More Related Content
Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx
6. | 6
Dirty secret
● What is “cloud native scale” is “Internet scale” of networks two
decades ago
● The combination of metrics and events has been standard in
power measurement for half a century
● Modern tools with good engineering practices map very well onto
brownfield technology
7. | 7
Buzzword alert!
● Cool new term, almost meaningless by now, what does it mean?
○ Pitfall alert: Cargo culting
○ It’s about changing the behaviour, not about changing the name
● “Monitoring” has taken on a meaning of collecting, not using data
○ One extreme: Full text indexing
○ Other extreme: Data lake
● “Observability” is about enabling humans to understand complex
systems
○ Ask why it’s not working instead of just knowing that it’s not
9. If you can’t ask new questions on the fly, it’s not
observability
| 9
10. | 10
Complexity
● Fake complexity, a.k.a. bad design
○ Can be reduced
● Real, system-inherent complexity
○ Can be moved (monolith vs client-server vs microservices)
○ Must be compartmentalized (service boundaries)
○ Should be distilled meaningfully
11. | 11
SRE, an instantiation of DevOps
● At its core: Align incentives across the org
○ Error budgets allow devs, ops, PMs, etc. to optimize for shared benefits
● Measure it!
○ SLI: Service Level Indicator: What you measure
○ SLO: Service Level Objective: What you need to hit
○ SLA: Service Level Agreement: When you need to pay
12. | 12
Shared understanding
● Everyone uses the same tools & dashboards
○ Shared incentive to invest into tooling
○ Pooling of institutional system knowledge
○ Shared language & understanding of services
13. | 13
Services
● Service?
○ Compartmentalized complexity, with an interface
○ Different owners/teams
○ Contracts define interfaces
● Why “contract”: Shared agreement which MUST NOT be broken
○ Internal and external customers rely on what you build and maintain
● Other common term: layer
○ The Internet would not exist without network layering
○ Enables innovation, parallelizes human engineering
● Other examples: CPUs, harddisk, compute nodes, your lunch
14. | 14
Alerting
● Customers care about services being up, not about individual components
● Discern between different SLIs
○ Primary: service-relevant, for alerting
○ Secondary: informational, debugging, might be underlying’s primary
Anything currently or imminently impacting customer service must be
alerted upon
But nothing(!) else
16. | 16
Prometheus 101
● Inspired by Google's Borgmon
● Time series database
● unit64 millisecond timestamp, float64 value
● Instrumentation & exporters
● Not for event logging
● Dashboarding via Grafana
17. | 17
Main selling points
● Highly dynamic, built-in service discovery
● No hierarchical model, n-dimensional label set
● PromQL: for processing, graphing, alerting, and export
● Simple operation
● Highly efficient
18. | 18
Main selling points
● Prometheus is a pull-based system
● Black-box monitoring: Looking at a service from the outside (Does the server
answer to HTTP requests?)
● White-box monitoring: Instrumenting code from the inside (How much time
does this subroutine take?)
● Every service should have its own metrics endpoint
● Hard API commitments within major versions
19. | 19
Time series
● Time series are recorded values which change over time
● Individual events are usually merged into counters and/or histograms
● Changing values are recorded as gauges
● Typical examples
○ Requests to a webserver (counter)
○ Temperatures in a datacenter (gauge)
○ Service latency (histograms)
21. | 21
Scale
● Kubernetes is Borg
● Prometheus is Borgmon
● Google couldn't have run Borg without Borgmon (plus Omega and Monarch)
● Kubernetes & Prometheus are designed and written with each other in mind
22. | 22
Prometheus scale
● 1,000,000+ samples/second no problem on current hardware
● ~200,000 samples/second/core
● 16 bytes/sample compressed to 1.36 bytes/sample
The highest we saw in production on a single Prometheus instance were 15
million active times series at once!
23. | 23
Long term storage
● Two long term storage solutions have Prometheus-team members
working on them
○ Thanos
■ Historically easier to run, but slower
■ Scales storage horizontally
○ Cortex
■ Easy to run these days
■ Scales both storage, ingester, and querier horizontally
24. | 24
Cortex @ Grafana (largest cluster, 2021-09)
● ~65 million active series (just the one cluster)
● 668 CPU cores
● 3,349 GiB RAM
One customer running at 3 billion active series
26. | 26
Loki 101
● Following the same label-based system like Prometheus
● No full text index needed, incredible speed
● Work with logs at scale, without the massive cost
● Access logs with the same label sets as metrics
● Turn logs into metrics, to make it easier to work with them
● Make direct use of syslog data, via promtail
28. | 28
Loki @ Grafana Labs
● Queries regularly see 40 GiB/s
● Query terabytes of data in under a minute
○ Including complex processing of result sets
30. | 30
Tempo
● Exemplars: Jump from relevant logs & metrics
○ Native to Prometheus, Cortex, Thanos, and Loki
○ Exemplars work at Google scale, with the ease of Grafana
● Index and search by labelsets available for those who need it
● Object store only: No Cassandra, Elastic, etc.
● 100% compatible with OpenTelemetry Tracing, Zipkin, Jaeger
● 100% of your traces, no sampling