Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Observability for Application
Developers
About me
Abhinav Kapoor
• Principal Engineer, Nagarro
• Work as a Technical Architect for .NET Solutions.
• 15+ years of technical experience in architecting & developing solutions for Banking, Supply chain &
Embedded domains.
• I enjoy writing about software architecture, communication patterns, modernization & more on
Medium.com https://medium.com/@kapoorabhinav.
Accreditations
I‘m available on Linked at
https://www.linkedin.com/in/abhinav-kapoor
0394456/
What we’ll cover
1. Why do we need Observability?
2. How does it fit in a cloud native landscape?
3. Application Instrumentation
1. Logs
2. Traces
3. metrics
4. Health Check
4. OpenTelemetry
1. What is it & why is it needed?
2. Its components.
3. Its signals
Why do we need Observability?
● We want to avoid these reactions
Why do we need Observability?
● Issue Timeline in Operation Support
Detect Identify Fix Deploy
Issue
Start
Issue
End
Mean Time To Repair (MTTR)
Mean Time To Identify
(MTTI)
Mean Time To Detect (MTTD)
Why is Observability more critical now?
● Evolution to smaller but more services - high cohesion and low coupling.
“Complexities arises when the dependencies among the elements become important”
- We have reduced code complexity but introduced much more moving parts
Why is Observability more critical now?
Technology transformation trends
Microservices Polyglot Applications Ephemeral & cloud native environments
How does it fit in a cloud native
landscape?
How does it fit in a cloud native landscape?
Instrument Capture Transport Analyze Act
SDK, Agent
Scrapping/
Agent
Dashboard Alerts / Notifications
How does it fit in a cloud native landscape?
Active Directory / IAM
Certificate Manager
Resources – DB, Broker, API Gateway
Service A
Sidecar
Platform / Orchestrator
Service B
Sidecar
Shared
Services,
Infrastructure
Applications
Audit logs,
Activity logs,
Metrics
Central Aggregation
& Collection
Application logs,
metrics, traces Monitoring & Alerts
Access Control
Application Instrumentation –
What should I know & do as a
developer ?
Application Instrumentation
● Logs – What happened & who did it ?
● Traces – How did it happen ?
● Metrics – How much happened ?
● Health – How is the system doing ?
Big picture
● We don’t just want to have connected black boxes.
● While infrastructure & sidecars can give a good overview, there are always
blind spots.
● It speeds up development.
Logs, metrics,
Traces, Health
Logs, metrics,
Traces, Health
Logs, metrics,
Traces, Health
Logs, metrics,
Traces, Health
Logs – What Happened ?
Logs – Components of Logging
Logging Framework
Log Message
- Severity, tags
Application
SDK
Filter based on Severity
- Destination Specific
Log Providers
ILogger
Types of Logs
● Based on Information they contain – Governs where they are stored, how long
they are stored, who can access them
○ Technical – Startup configurations, system event that the application receives
○ Business - Business and User Information
● Based on the schema
○ Unstructured – Simple statements with severity.
○ Structured / Schematic logs - Logs with Metadata to provide context. Pod, Service, IP address
& more.
Duality of Logs - Readable messages &
Quarriable metrics
Traces - How did it happen ?
Traces – Types of traces
● Tracing – This is similar to logging. But noisier and collects more information
from deeper parts of the application.
● Distributed Trancing – It’s the method to observe requests as they propagate
through different services.
○ Its structured log which comes from different processes, nodes and can be stitched together
to give an end to end view.
Distributed Tracing
Trace Collector
Services Web API Broker Background Services
Traces
HTTP Request
with tracing
context
TCP message
with tracing
context
TCP message
with tracing
context
DB
Store tracing
context with
data in db
Services
Read tracing
context
• Correlation IDs are passed in the remote calls/messages.
• The receiver regenerates context proceeds with its unit of work.
• Depending upon business transaction, traces can be of few seconds to hours or days.
Metrics–How much is happening?
Metrics – Components
Vendor specific
Instrumentation &
collection
Metrics-
Name, description, Labels &
Value
Application
Backend – time series database
- Adds timestamp
- Supports aggregation (eg PromQL)
SDK
/Metrics
scrape
Visualization
PromQL expressions
Alerts
Types of Prometheus Metrics - Counters
Track the occurrence of an event in application.
● The absolute value may not be helpful.
● The delta between two timestamps or the rate of change over time is helpful. Functions like Increase(),
Rate().
● Consider the case when applications restart.
● Examples - Orders processed counter, unsuccessful counters, business error, technical errors.
● Sample output of meta-endpoint (/metrics)
# HELP http_requests_total Total number of http api requests
# TYPE http_requests_total
counter http_requests_total{api=“add_Product“, company=“ITC”} 3433
counter http_requests_total{api=“add_Product“, company =“TATA”} 2000
PromQL
increase(http_requests_total{api="add_product“, company=“ITC”}[5m])
Types of Prometheus Metrics - Gauges
Snapshots of a metric at a single point in time. Not an event.
● Example is message queue size, CPU utilization at a given time.
● Useful functions - avg_over_time, max_over_time,
min_over_time, and quantile_over_time
Types of Prometheus Metrics - Histograms
● Histograms divide the entire range of measurements into a set of intervals—
named buckets—and count how many measurements fall into each bucket.
● These buckets are defined at compile time, they are upper inclusive (le).
● Histogram looks like the following
http_request_duration_seconds_sum{api="add_product" instance=" host1 "} 8953.332
http_request_duration_seconds_count{api="add_product" instance=" host1"} 27892
http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.05"}
1672
http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.1"} 8954
http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.25"}
14251
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
Types of Prometheus Metrics - Summaries
● Similar to Histogram but they add quantile label to bound the summary.
● OpenTelemetry has been marked it legacy.
● Summary looks like the following
http_request_duration_seconds_sum{api="add_product" instance="host1.domain.com"} 8953.332
http_request_duration_seconds_count{api="add_product" instance="host1.domain.com"} 27892
http_request_duration_seconds{api="add_product" instance=" host1 " quantile="0"}
http_request_duration_seconds{api="add_product" instance="host1" quantile="0.5"} 0.232227334
● Considerations due to quantile.
Metrics - What happens after instrumentation?
● Enrichthe application instrumentation with infrastructure information such as
Service Name, Pod, IP address.
● The backend saves it with timestamp
● Add Alerts & dashboards. Help available at
https://grafana.com/grafana/dashboards/
Health Check– How is the system
doing ?
Health check
● Basic use is liveness probe.
● More useful when we add dependencies.
● Prebuilt libraries depending upon the language
● Considerations
○ Orchestrator may restart the service while the issue is in a
dependency.
○ Checking all downstream services could be time consuming.
○ Meta-end points (/health) may not be exposed. Instead of
writing to the response metrics may be used to indicate
issue.
OpenTelemetry
What is OpenTelemetry ?
● It is an open-source observability framework for infrastructure & application
instrumentation.
● It’s the second-highest ranked CNCF project by activity and contributors after
K8s
● It aims to be vendor agnostic & the SDK is produced by OpenTelemetry.
● Its still young so not all features are available for all languages.
● As Open Telemetry doesn’t provide a backend implementation (its concern is
creating, collecting, and sending signals), the data will flow to another system
or systems for storage and querying.
Why OpenTelemetry?
● Before Open Telemetry
○ APM vendors had their own things – SDKs,
Agents, Grammer
○ Vendor lock-in.
○ Application developers, built wrappers,
event hooks to keep application isolated.
○ Competing Open standards – OpenTracing
(Traces), OpenCensus (Metrics, Traces)
OpenTelemetry Library -
Instrumentation (Automatic & Manual)
Exporters
Exporter
- Backends/APM
Application
Trace, metrics, Logs
SDK
Components of OpenTelemetry
Collector
Multiple receivers
Multiple exporters
Signals of OpenTelemetry - Metrics
● Differences from Prometheus metrics
○ Summary is marked deprecated
○ Some more functions & data type flexibility
Signals of OpenTelemetry -Tracing
Data model of an OpenTelemetry Span
Example tracing for a news article.
Source - https://www.timescale.com/blog/what-are-traces-and-how-sql-yes-sql-
and-opentelemetry-can-help-us-get-more-value-out-of-traces-to-build-better-
software/
Questions ?
Thank you!

More Related Content

Observability for Application Developers (1)-1.pptx

  • 2. About me Abhinav Kapoor • Principal Engineer, Nagarro • Work as a Technical Architect for .NET Solutions. • 15+ years of technical experience in architecting & developing solutions for Banking, Supply chain & Embedded domains. • I enjoy writing about software architecture, communication patterns, modernization & more on Medium.com https://medium.com/@kapoorabhinav. Accreditations I‘m available on Linked at https://www.linkedin.com/in/abhinav-kapoor 0394456/
  • 3. What we’ll cover 1. Why do we need Observability? 2. How does it fit in a cloud native landscape? 3. Application Instrumentation 1. Logs 2. Traces 3. metrics 4. Health Check 4. OpenTelemetry 1. What is it & why is it needed? 2. Its components. 3. Its signals
  • 4. Why do we need Observability? ● We want to avoid these reactions
  • 5. Why do we need Observability? ● Issue Timeline in Operation Support Detect Identify Fix Deploy Issue Start Issue End Mean Time To Repair (MTTR) Mean Time To Identify (MTTI) Mean Time To Detect (MTTD)
  • 6. Why is Observability more critical now? ● Evolution to smaller but more services - high cohesion and low coupling. “Complexities arises when the dependencies among the elements become important” - We have reduced code complexity but introduced much more moving parts
  • 7. Why is Observability more critical now? Technology transformation trends Microservices Polyglot Applications Ephemeral & cloud native environments
  • 8. How does it fit in a cloud native landscape?
  • 9. How does it fit in a cloud native landscape? Instrument Capture Transport Analyze Act SDK, Agent Scrapping/ Agent Dashboard Alerts / Notifications
  • 10. How does it fit in a cloud native landscape? Active Directory / IAM Certificate Manager Resources – DB, Broker, API Gateway Service A Sidecar Platform / Orchestrator Service B Sidecar Shared Services, Infrastructure Applications Audit logs, Activity logs, Metrics Central Aggregation & Collection Application logs, metrics, traces Monitoring & Alerts Access Control
  • 11. Application Instrumentation – What should I know & do as a developer ?
  • 12. Application Instrumentation ● Logs – What happened & who did it ? ● Traces – How did it happen ? ● Metrics – How much happened ? ● Health – How is the system doing ?
  • 13. Big picture ● We don’t just want to have connected black boxes. ● While infrastructure & sidecars can give a good overview, there are always blind spots. ● It speeds up development. Logs, metrics, Traces, Health Logs, metrics, Traces, Health Logs, metrics, Traces, Health Logs, metrics, Traces, Health
  • 14. Logs – What Happened ?
  • 15. Logs – Components of Logging Logging Framework Log Message - Severity, tags Application SDK Filter based on Severity - Destination Specific Log Providers ILogger
  • 16. Types of Logs ● Based on Information they contain – Governs where they are stored, how long they are stored, who can access them ○ Technical – Startup configurations, system event that the application receives ○ Business - Business and User Information ● Based on the schema ○ Unstructured – Simple statements with severity. ○ Structured / Schematic logs - Logs with Metadata to provide context. Pod, Service, IP address & more.
  • 17. Duality of Logs - Readable messages & Quarriable metrics
  • 18. Traces - How did it happen ?
  • 19. Traces – Types of traces ● Tracing – This is similar to logging. But noisier and collects more information from deeper parts of the application. ● Distributed Trancing – It’s the method to observe requests as they propagate through different services. ○ Its structured log which comes from different processes, nodes and can be stitched together to give an end to end view.
  • 20. Distributed Tracing Trace Collector Services Web API Broker Background Services Traces HTTP Request with tracing context TCP message with tracing context TCP message with tracing context DB Store tracing context with data in db Services Read tracing context • Correlation IDs are passed in the remote calls/messages. • The receiver regenerates context proceeds with its unit of work. • Depending upon business transaction, traces can be of few seconds to hours or days.
  • 21. Metrics–How much is happening?
  • 22. Metrics – Components Vendor specific Instrumentation & collection Metrics- Name, description, Labels & Value Application Backend – time series database - Adds timestamp - Supports aggregation (eg PromQL) SDK /Metrics scrape Visualization PromQL expressions Alerts
  • 23. Types of Prometheus Metrics - Counters Track the occurrence of an event in application. ● The absolute value may not be helpful. ● The delta between two timestamps or the rate of change over time is helpful. Functions like Increase(), Rate(). ● Consider the case when applications restart. ● Examples - Orders processed counter, unsuccessful counters, business error, technical errors. ● Sample output of meta-endpoint (/metrics) # HELP http_requests_total Total number of http api requests # TYPE http_requests_total counter http_requests_total{api=“add_Product“, company=“ITC”} 3433 counter http_requests_total{api=“add_Product“, company =“TATA”} 2000 PromQL increase(http_requests_total{api="add_product“, company=“ITC”}[5m])
  • 24. Types of Prometheus Metrics - Gauges Snapshots of a metric at a single point in time. Not an event. ● Example is message queue size, CPU utilization at a given time. ● Useful functions - avg_over_time, max_over_time, min_over_time, and quantile_over_time
  • 25. Types of Prometheus Metrics - Histograms ● Histograms divide the entire range of measurements into a set of intervals— named buckets—and count how many measurements fall into each bucket. ● These buckets are defined at compile time, they are upper inclusive (le). ● Histogram looks like the following http_request_duration_seconds_sum{api="add_product" instance=" host1 "} 8953.332 http_request_duration_seconds_count{api="add_product" instance=" host1"} 27892 http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.05"} 1672 http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.1"} 8954 http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.25"} 14251 sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
  • 26. Types of Prometheus Metrics - Summaries ● Similar to Histogram but they add quantile label to bound the summary. ● OpenTelemetry has been marked it legacy. ● Summary looks like the following http_request_duration_seconds_sum{api="add_product" instance="host1.domain.com"} 8953.332 http_request_duration_seconds_count{api="add_product" instance="host1.domain.com"} 27892 http_request_duration_seconds{api="add_product" instance=" host1 " quantile="0"} http_request_duration_seconds{api="add_product" instance="host1" quantile="0.5"} 0.232227334 ● Considerations due to quantile.
  • 27. Metrics - What happens after instrumentation? ● Enrichthe application instrumentation with infrastructure information such as Service Name, Pod, IP address. ● The backend saves it with timestamp ● Add Alerts & dashboards. Help available at https://grafana.com/grafana/dashboards/
  • 28. Health Check– How is the system doing ?
  • 29. Health check ● Basic use is liveness probe. ● More useful when we add dependencies. ● Prebuilt libraries depending upon the language ● Considerations ○ Orchestrator may restart the service while the issue is in a dependency. ○ Checking all downstream services could be time consuming. ○ Meta-end points (/health) may not be exposed. Instead of writing to the response metrics may be used to indicate issue.
  • 31. What is OpenTelemetry ? ● It is an open-source observability framework for infrastructure & application instrumentation. ● It’s the second-highest ranked CNCF project by activity and contributors after K8s ● It aims to be vendor agnostic & the SDK is produced by OpenTelemetry. ● Its still young so not all features are available for all languages. ● As Open Telemetry doesn’t provide a backend implementation (its concern is creating, collecting, and sending signals), the data will flow to another system or systems for storage and querying.
  • 32. Why OpenTelemetry? ● Before Open Telemetry ○ APM vendors had their own things – SDKs, Agents, Grammer ○ Vendor lock-in. ○ Application developers, built wrappers, event hooks to keep application isolated. ○ Competing Open standards – OpenTracing (Traces), OpenCensus (Metrics, Traces)
  • 33. OpenTelemetry Library - Instrumentation (Automatic & Manual) Exporters Exporter - Backends/APM Application Trace, metrics, Logs SDK Components of OpenTelemetry Collector Multiple receivers Multiple exporters
  • 34. Signals of OpenTelemetry - Metrics ● Differences from Prometheus metrics ○ Summary is marked deprecated ○ Some more functions & data type flexibility
  • 35. Signals of OpenTelemetry -Tracing Data model of an OpenTelemetry Span Example tracing for a news article. Source - https://www.timescale.com/blog/what-are-traces-and-how-sql-yes-sql- and-opentelemetry-can-help-us-get-more-value-out-of-traces-to-build-better- software/

Editor's Notes

  1. We’ll take the questions at the end.
  2. Monitoring is collecting metrics & logs Observability is the ability to observe what is going on inside the system. Consider it an NFR.
  3. MTTD -> metrics or synthetic tests MTTI -> Traces
  4. Emergence for micro-services – Lot more Story of a certificate expiring somewhere instead of a single server.
  5. Everything writing to one file. Or few files is lot more simple. Lot more moving parts. Speed of development
  6. Conceptual View
  7. Its happening at multiple levels
  8. Logs – important events. State changes, who did. Trances – Are flow within the program or outside the program. Matrics – are numbers which can be aggregated to have formulas – SLOs, KPI & alerts Health – current state of health
  9. Automatic & manual – instrumentation
  10. Cost!!
  11. All services should have it.
  12. Do consider the situation when it turns to zero. https://grafana.com/grafana/dashboards/ including ones from OPSTREE
  13. May be used in cases where strict KPIs are enforced on duration for example
  14. https://grafana.com/grafana/dashboards/ Which help me detect issue & fix the issues faster. Which help management find datapoints. Which can report compliance & SLO.
  15. Alternatives 1. Have a monitoring on solution 2. cache the results
  16. Receivers can be used in migration
  17. As a developer, you don’t really need to worry about the models—but having a basic understanding helps.