What Is OpenTelemetry? The Ultimate Guide

OpenTelemetry is not merely an observability platform, but also a set of best practices and standards that can be integrated into platform engineering or DevOps.

Aug 25th, 2024 6:00am by B. Cameron Gain

What Is OpenTelemetry?

OpenTelemetry offers a standardized way of capturing observability data. It is vendor-neutral and is used to make sense of telemetry data consisting of metrics, logs, traces, and standardized metadata, with more types of signals still being developed. But it is also more than just being vendor-neutral. This is because OpenTelemetry is designed to allow the user to integrate the observability tools of their choice into a common approach, unifying them. It is not a tool or a platform, but rather an approach likened to a set of best practices and standards integrated into platform engineering or DevOps.

One of the most exciting features of OpenTelemetry is that it provides a standard data format and protocol for observability data. Standardization is crucial as it provides a consistent framework for observability suppliers and those building systems from scratch. This allows different observability tools or processes to be used interchangeably or together in a single solution or view. Thus, if you want to change an observability provider without completely upgrading or starting over, another tool or platform can seamlessly integrate, thanks to the standardization OpenTelemetry was designed to provide.

What Is Observability?

Telemetry data consisting of logs, metrics, and, more recently, traces, offers the required data to be scrutinized or collected. But once it’s collected and observed through monitoring, it doesn’t mean that much if the data has not been parsed or channeled in an appropriate way to eliminate irrelevant telemetry data.

What this means in practice is users will look to observability to gain an understanding of and do other things. Usually, you can ask your live collection of information about your systems, such as why a server ran out of memory, why a trace is slow, why a request is slow, or why there are error logs, and be able to get that answer without doing something new.

At the same time, the observation of events or performance as an operator using different telemetry data is certainly useful to a certain extent. But it falls short of observability, which involves the analysis and drawing actionable insights based on inferences from the data that was collected using monitoring.

What Are the Components of OpenTelemetry Architecture?

Source: “Learning OpenTelemetry.”

OpenTelemetry supports observability. It uses libraries instrumented with the OpenTelemetry API to collect telemetry data from various sources, such as databases, clients, HTTP, and other systems. OpenTelemetry users can choose to use the API/SDKs, or the automatic instrumentation agents that OpenTelemetry provides for most languages, while most users choose the instrumentation agents, Morgan McLean, an OpenTelemetry co-founder and senior director of product management for Splunk, said. Most users also use the OpenTelemetry Collector, which captures system metrics, system and application logs, metrics from third-party applications like databases and message queues and telemetry from other sources like OpenTelemetry SDKs, OpenTelemetry language agents, Prometheus and Telegraf agents, etc. “The Collector is the most widely used OpenTelemetry component, by far,” McLean said.

The API and SDK are decoupled. The main benefit of decoupling API and SDK is that custom instrumentation driven by service owners can be made sustainable, Cedric Ziel, senior product manager at Grafana Labs, said. “Should they ever need to switch to a different OpenTelemetry-compatible SDK provider, they can be certain that given they relied on the contract of the API, the custom instrumentation is portable,” Ziel said.

What Are the Golden Signals?

The bedrock of OpenTelemetry, as well as observability, consists of the so-called golden signals. As defined in the Google SRE handbook, these golden signals include latency, traffic, errors and saturation:

Latency measures how long it takes for requests to be completed. Traffic gauges the load on the system, indicating whether it is handling an appropriate amount of requests or becoming saturated. Errors quantify the number and types of errors that have occurred, ranging from backend server errors to user errors. Saturation assesses a system’s capacity and to what extent it has become saturated. This might include RAM consumption, CPU consumption and network bandwidth — essentially any factor that could tax a system to the point of producing error messages.

The manifestation of these golden signals is evident in the three pillars of observability: traces, logs, and metrics. Metrics are the quantification of the various golden signals and can be applied to each of these signals either separately or concurrently. Logs provide detailed records and timestamps of activities, whether errors occurred or not, and are essential for pinpointing issues and troubleshooting. Tracing is arguably the most critical aspect, as it involves capturing all events that occur along the path of a transaction. Through tracing, one can analyze the interactions of different signals and metrics to derive inferences about past outages, predict future issues, and improve overall system performance.

Source: “Learning OpenTelemetry.”

“OpenTelemetry, as mentioned above, facilitates this process by standardizing the collection of metrics, traces, and logs. As Ted Young and Austin Parker describe in their book “Learning OpenTelemetry,” these metrics, traces, and logs are often misleadingly referred to as the “three pillars of observability.” They are more accurately intertwined and correlated to make proper observability possible.

“While the term ‘three pillars’ does explain the way traditional observability is architected, it is also problematic — it makes this architecture sound like a good idea! Which it isn’t,” Young and Parker write. “It’s cheeky, but I prefer a different turn of phrase — the “three browser tabs of observability.” Because that’s what you’re actually getting.”

Indeed, integrating metrics, logs and traces is necessary in order to “achieve full visibility by using them interchangeably,” Yrieix Garnier, vice president of products for Datadog, said. “When you dissociate them, it doesn’t work. You end up with three different silos, lacking real traceability and root-cause analysis.”

OpenTelemetry offers a standardized process for observability. It can be seen as three main components: standards, SDKs, and the collector. The standards ensure interoperability, the SDKs simplify application instrumentation and the collector acts as a vendor-neutral agent.

It is used to make sense of telemetry data consisting of metrics, logs, and traces. It is also more than just being vendor-neutral in that it is designed to allow the user to integrate the observability tools of their choice into a common approach, unifying them.

What’s Next?

OpenTelemetry has already emerged as an essential component of the observability experience, as its continued development increasingly covers DevOps needs for both the developer and operations. However, as one of the major open source projects at this time and arguably already an essential component for observability, remains a work in progress. Its success hinges on the continued support and hard work of the community.

OpenTelemetry’s profiling capabilities should prove to be useful for users because it goes deeper for observability analysis by extending to the code level. It instrumentalizes a deeper analysis of metrics, traces and logs by extending telemetric data pulled together in a unified stream, which extends to the code level for applications throughout the network. Code is analyzed and stored.

In practice, this means that when a problem arises, or when looking at certain performance aspects that an observability data stream offers — such as when a CPU is running slow or when an end user’s request for data is taking too long — the profile discerns the code at issue. With the right additional tools for observability, fixes should be provided faster, as users will pinpoint problem code more easily through their queries.

The OpenTelemetry Profiler should be finalized this year. It represents the project’s latest milestone following the completion of logs capabilities with OpenTelemetry in 2023.

As stated in the OpenTelemetry project’s documentation, the creators are finalizing client instrumentation to extend OpenTelemetry’s capabilities and offer end-to-end observability for all stakeholders. This extension aims to capture end-to-end latency (E2E) and a chain of back-end service events, as well as infrastructure-side performance statistics initiated by a single-user interaction.

To achieve comprehensive observability, OpenTelemetry needs to support various platforms, including web pages, JavaScript (JS), mobile applications, and desktop applications. Although OpenTelemetry JS has technically supported capturing spans from web browsers since its initial releases, this behavior was largely unspecified. Additionally, there were no equivalent functionalities for client applications such as Android, iOS, or Windows, according to the documentation.

Following the establishment of the client instrumentation Special Interest Group (SIG) in 2021, the project’s creators are now working to specify client instrumentation behavior. This effort aims to ensure consistency in data capture across developer-facing telemetry interfaces and different types of client applications, thereby providing a unified approach to monitoring and observability, according to the documentation.

Splunk is donating its Android instrumentation to OpenTelemetry, and work is also ongoing for iOS and web, McLean said.

OpenTelemetry has historically been focused on capturing data from backend services and infrastructure, and we are expanding this to include client telemetry from web browsers, iOS, Android apps and more,” McLean said. “This will allow developers and operators to understand true system performance all the way from local application performance, to internet connection issues, to backend service performance, to backend infrastructure performance.”

How Can Platforms and Tool Providers Support OpenTelemetry?

Leading observability providers offer support, maintenance, and development of OpenTelemetry. These include organizations such as Grafana, Honeycomb, Datadog, Splunk, and others. They have a collective interest in making OpenTelemetry better. To that end, the future of OpenTelemetry depends primarily on the community and their contributions. As DevOps observability needs will change — not least of which involving AI’s role — OpenTelemetry must adapt with these changes.

Its usefulness also depends on the observability tools and platforms used in conjunction with OpenTelemetry. In other words, OpenTelemetry is not designed to replace observability platforms.

“I think it’s correct to assume that for every reasonably big codebase as soon as people would stop contributing, it would immediately start to rot, while we can say the same thing about almost any kind of project,” Ziel said. “We have to acknowledge that OpenTelemetry has had enough momentum to have vendors create entirely new offers on top of it and incentivize established vendors who had their own models to adapt the new formats.”

BC Gain is founder and principal analyst for ReveCom Media. His obsession with computers began when he hacked a Space Invaders console to play all day for 25 cents at the local video arcade in the early 1980s. He then...