Observability for Application Developers (1)-1.pptx

Observability for Application
Developers

About me
Abhinav Kapoor
• Principal Engineer, Nagarro
• Work as a Technical Architect for .NET Solutions.
• 15+ years of technical experience in architecting & developing solutions for Banking, Supply chain &
Embedded domains.
• I enjoy writing about software architecture, communication patterns, modernization & more on
Medium.com https://medium.com/@kapoorabhinav.
Accreditations
I‘m available on Linked at
https://www.linkedin.com/in/abhinav-kapoor
0394456/

What we’ll cover
1. Why do we need Observability?
2. How does it fit in a cloud native landscape?
3. Application Instrumentation
1. Logs
2. Traces
3. metrics
4. Health Check
4. OpenTelemetry
1. What is it & why is it needed?
2. Its components.
3. Its signals

Why do we need Observability?
● We want to avoid these reactions

Why do we need Observability?
● Issue Timeline in Operation Support
Detect Identify Fix Deploy
Issue
Start
Issue
End
Mean Time To Repair (MTTR)
Mean Time To Identify
(MTTI)
Mean Time To Detect (MTTD)

Why is Observability more critical now?
● Evolution to smaller but more services - high cohesion and low coupling.
“Complexities arises when the dependencies among the elements become important”
- We have reduced code complexity but introduced much more moving parts

Why is Observability more critical now?
Technology transformation trends
Microservices Polyglot Applications Ephemeral & cloud native environments

How does it fit in a cloud native
landscape?

How does it fit in a cloud native landscape?
Instrument Capture Transport Analyze Act
SDK, Agent
Scrapping/
Agent
Dashboard Alerts / Notifications

How does it fit in a cloud native landscape?
Active Directory / IAM
Certificate Manager
Resources – DB, Broker, API Gateway
Service A
Sidecar
Platform / Orchestrator
Service B
Sidecar
Shared
Services,
Infrastructure
Applications
Audit logs,
Activity logs,
Metrics
Central Aggregation
& Collection
Application logs,
metrics, traces Monitoring & Alerts
Access Control

Application Instrumentation –
What should I know & do as a
developer ?

Application Instrumentation
● Logs – What happened & who did it ?
● Traces – How did it happen ?
● Metrics – How much happened ?
● Health – How is the system doing ?

Big picture
● We don’t just want to have connected black boxes.
● While infrastructure & sidecars can give a good overview, there are always
blind spots.
● It speeds up development.
Logs, metrics,
Traces, Health
Logs, metrics,
Traces, Health
Logs, metrics,
Traces, Health
Logs, metrics,
Traces, Health

Logs – Components of Logging
Logging Framework
Log Message
- Severity, tags
Application
SDK
Filter based on Severity
- Destination Specific
Log Providers
ILogger

Types of Logs
● Based on Information they contain – Governs where they are stored, how long
they are stored, who can access them
○ Technical – Startup configurations, system event that the application receives
○ Business - Business and User Information
● Based on the schema
○ Unstructured – Simple statements with severity.
○ Structured / Schematic logs - Logs with Metadata to provide context. Pod, Service, IP address
& more.

Duality of Logs - Readable messages &
Quarriable metrics

Traces – Types of traces
● Tracing – This is similar to logging. But noisier and collects more information
from deeper parts of the application.
● Distributed Trancing – It’s the method to observe requests as they propagate
through different services.
○ Its structured log which comes from different processes, nodes and can be stitched together
to give an end to end view.

Distributed Tracing
Trace Collector
Services Web API Broker Background Services
Traces
HTTP Request
with tracing
context
TCP message
with tracing
context
TCP message
with tracing
context
DB
Store tracing
context with
data in db
Services
Read tracing
context
• Correlation IDs are passed in the remote calls/messages.
• The receiver regenerates context proceeds with its unit of work.
• Depending upon business transaction, traces can be of few seconds to hours or days.

Metrics–How much is happening?

Metrics – Components
Vendor specific
Instrumentation &
collection
Metrics-
Name, description, Labels &
Value
Application
Backend – time series database
- Adds timestamp
- Supports aggregation (eg PromQL)
SDK
/Metrics
scrape
Visualization
PromQL expressions
Alerts

Types of Prometheus Metrics - Counters
Track the occurrence of an event in application.
● The absolute value may not be helpful.
● The delta between two timestamps or the rate of change over time is helpful. Functions like Increase(),
Rate().
● Consider the case when applications restart.
● Examples - Orders processed counter, unsuccessful counters, business error, technical errors.
● Sample output of meta-endpoint (/metrics)
# HELP http_requests_total Total number of http api requests
# TYPE http_requests_total
counter http_requests_total{api=“add_Product“, company=“ITC”} 3433
counter http_requests_total{api=“add_Product“, company =“TATA”} 2000
PromQL
increase(http_requests_total{api="add_product“, company=“ITC”}[5m])

Types of Prometheus Metrics - Gauges
Snapshots of a metric at a single point in time. Not an event.
● Example is message queue size, CPU utilization at a given time.
● Useful functions - avg_over_time, max_over_time,
min_over_time, and quantile_over_time

Types of Prometheus Metrics - Histograms
● Histograms divide the entire range of measurements into a set of intervals—
named buckets—and count how many measurements fall into each bucket.
● These buckets are defined at compile time, they are upper inclusive (le).
● Histogram looks like the following
http_request_duration_seconds_sum{api="add_product" instance=" host1 "} 8953.332
http_request_duration_seconds_count{api="add_product" instance=" host1"} 27892
http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.05"}
1672
http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.1"} 8954
http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.25"}
14251
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))

Types of Prometheus Metrics - Summaries
● Similar to Histogram but they add quantile label to bound the summary.
● OpenTelemetry has been marked it legacy.
● Summary looks like the following
http_request_duration_seconds_sum{api="add_product" instance="host1.domain.com"} 8953.332
http_request_duration_seconds_count{api="add_product" instance="host1.domain.com"} 27892
http_request_duration_seconds{api="add_product" instance=" host1 " quantile="0"}
http_request_duration_seconds{api="add_product" instance="host1" quantile="0.5"} 0.232227334
● Considerations due to quantile.

Metrics - What happens after instrumentation?
● Enrichthe application instrumentation with infrastructure information such as
Service Name, Pod, IP address.
● The backend saves it with timestamp
● Add Alerts & dashboards. Help available at
https://grafana.com/grafana/dashboards/

Health Check– How is the system
doing ?

Health check
● Basic use is liveness probe.
● More useful when we add dependencies.
● Prebuilt libraries depending upon the language
● Considerations
○ Orchestrator may restart the service while the issue is in a
dependency.
○ Checking all downstream services could be time consuming.
○ Meta-end points (/health) may not be exposed. Instead of
writing to the response metrics may be used to indicate
issue.

What is OpenTelemetry ?
● It is an open-source observability framework for infrastructure & application
instrumentation.
● It’s the second-highest ranked CNCF project by activity and contributors after
K8s
● It aims to be vendor agnostic & the SDK is produced by OpenTelemetry.
● Its still young so not all features are available for all languages.
● As Open Telemetry doesn’t provide a backend implementation (its concern is
creating, collecting, and sending signals), the data will flow to another system
or systems for storage and querying.

Why OpenTelemetry?
● Before Open Telemetry
○ APM vendors had their own things – SDKs,
Agents, Grammer
○ Vendor lock-in.
○ Application developers, built wrappers,
event hooks to keep application isolated.
○ Competing Open standards – OpenTracing
(Traces), OpenCensus (Metrics, Traces)

OpenTelemetry Library -
Instrumentation (Automatic & Manual)
Exporters
Exporter
- Backends/APM
Application
Trace, metrics, Logs
SDK
Components of OpenTelemetry
Collector
Multiple receivers
Multiple exporters

Signals of OpenTelemetry - Metrics
● Differences from Prometheus metrics
○ Summary is marked deprecated
○ Some more functions & data type flexibility

Signals of OpenTelemetry -Tracing
Data model of an OpenTelemetry Span
Example tracing for a news article.
Source - https://www.timescale.com/blog/what-are-traces-and-how-sql-yes-sql-
and-opentelemetry-can-help-us-get-more-value-out-of-traces-to-build-better-
software/

Observability for Application Developers (1)-1.pptx

More Related Content

Observability for Application Developers (1)-1.pptx

Editor's Notes