OpenTelemetry for the First Timers

Introduction

Let us start by asking this question:

What is observability and why do we need telemetry?

Although observability and telemetry might sound like interchangeable words to use, observability is the whole thing that consists of instrumentation to viewing the metrics.

Here telemetry is the instrumentation part. And in this post, we are majorly going to talk about this part only.

In my words, telemetry provides data about a running system which could be useful for taking more aware/educated decisions in the business. Telemetry enables software developers to proactively identify issues, understand the root cause of problems, and optimize performance in real time.

Logs are the most basic kind of information we can gather from a running system if we have put proper logs statements in the codebase. The act of writing console.log or log.Println in the code to print out the state of the system can be called instrumentation too. In fact, logging is among the other 2 mechanisms using which we can instrument. We’ll use the word instrumentation a lot in this post. ;)

Later in this post, we will learn about how logs are not the only type of information we can do analytics on.

Let us first see what is OpenTelemetry.

What is OpenTelemetry?

OpenTelemetry is the standard available for telemetry in software systems.

As per OpenTelemetry landing page

OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

A more formal introduction can be found at: https://opentelemetry.io/docs/concepts/what-is-opentelemetry/

Why OpenTelemetry?

It’s open source
It’s a standard. And implementation is available in a wide range of programming languages.

What to expect from this post

I was not an OTel wizard when I started writing this post. OpenTelemetry has been a recent encounter for me at my workplace. I had heard about it previously but never had hands-on.

This article is nothing more than an introduction to OpenTelemetry from my perspective.

But what I could say is, if you have previously used AWS Cloudtrail, you might be able to relate.

Understanding the Concepts

Telemetry refers to data collected from a running system. This data can come in these forms:

Logging
Metrics
Tracing

Logging: Logs are basically timestamped pieces of information emitted from a service. If you used logging for observation, there is a better alternative.
Metrics: Metrics refer to the collection of data over a particular time. These data can include system error rate, CPU utilization, request rate, etc for a given service.
Tracing: Trace is the new Log. While logs can’t associate with any particular user request or transaction, traces can. Which is very useful for tracking code executions.

There are different concepts related to them so let’s see them.

Logs

Data that is not part of Metrics or Traces are known as logs.

In the context of OTel, an Event is a type of log.

Metrics

Metrics are important indicators of availability and performance. Collected data can be used to alert of an outage or trigger scheduling decisions to scale up a deployment automatically upon high demand.

This part is where you might be able to relate it to AWS Cloudtrail and ASG.

Frankly speaking, I don’t have first-hand experience with Metrics on OTel. But when I ran down through the docs, I found some interesting information:

OpenTelemetry defines three metric instruments today:
counter: a value that is summed over time – you can think of this like an odometer on a car; it only ever goes up.
measure: a value that is aggregated over time. This is more akin to the trip odometer on a car, it represents a value over some defined range.
observer: captures a current set of values at a particular point in time, like a fuel gauge in a vehicle.

And then this:

Unlike request tracing, which is intended to capture request lifecycles and provide context to the individual pieces of a request, metrics are intended to provide statistical information in the aggregate. Some examples of use cases for metrics include:
Reporting the total number of bytes read by a service, per protocol type.
Reporting the total number of bytes read and the bytes per request.
Reporting the duration of a system call.
Reporting request sizes in order to determine a trend.
Reporting CPU or memory usage of a process.
Reporting average balance values from an account.
Reporting current active requests being handled.

You can read more about it here: https://opentelemetry.io/docs/concepts/signals/metrics/

Traces

To understand what a trace is, imagine you have a system that has 3 microservices and when you make a request, the request goes through 2 of them depending on the flow of the app.

The idea of tracing is to be able to relate information about the execution of two different services/processes to a single entity. Because that is what those 3 services as a whole are; a single entity.

Traces are where our main area of focus is today. Traces are also preferred over conventional logs as we have already read about them. Here are some questions to ask if you are new to the open telemetry world.

What is a Tracer?

Tracer is an object which stores information about a trace. Suppose you have a system that has 3 microservices and when you make a request, the request goes through 2 of them depending on the flow of the app. Although they are 2 different processes, Tracer will have info from both of them as part of a single trace.

One Tracer can have multiple Spans.

Tracers are created by TracerProviders.

What is a TracerProvider?

TracerProvider is a factory for Tracer.

Tracer Provider initialization also includes Resource and Exporter initialization. It is typically the first step in tracing with OpenTelemetry.

What is a Resource?

Resource describes the application being instrumented. It is used for identification which will later be used within the visualization phase.

What is an Exporter?

Let’s talk about Exporter or Trace Exporters now.

To visualize and analyze your traces and metrics, you will need to export them to a backend.

Exporters are packages that allow telemetry data to be emitted somewhere - either to the console or to a remote system or collector for further analysis and/or enrichment. OpenTelemetry supports a variety of exporters through its ecosystem including popular open-source tools like Jaeger, Zipkin, Prometheus, and oltp exporter, which is understood by many community packages.

What is a Span?

Taking from the example of 3 microservices, we made only one request to the entrypoint of the microservice, but different services got invoked. When a request reaches the first service, it realizes it has some dependency, and the trace id is passed to the second system as a Span.

Spans are part of a Trace. One Trace can have multiple spans and one span can have other sub-spans. A Span can consist of execution time data, logs, and attributes. All of which can be configured with SDK.

If we are creating a span, we have to pass a context to the child span so that it can be visualized later in the observability pipeline.

Each span can have multiple kinds of data associated with it. You can read more at Spans in OpenTelemetry.

What is a Collector?

The main function of an OpenTelemetry Collector is to collect, process, and export telemetry data from applications and services. The collector serves as a central hub for collecting telemetry data from a variety of sources, including distributed tracing data, logs, and metrics.

Some examples are Jeager, SigNoz, Prometheus, etc.

Instrumenting OpenTelemetry

Instrumentation for most frameworks is available already. So you just need to modify your code at one place where the framework is loading instead of instrumenting every handler etc. This is known as automatic instrumentation.

On the contrary, it is not always possible to use automatic instrumentation for whatever reason. Maybe the framework is not supported yet, or you want to observe something which is out of the scope of current implementations. In that case, you would revert back to the good old days by writing instrumentation manually.

In the next post, we would see an example of instrumentation with the golang application. Used their SDK to export to the console. More info: https://opentelemetry.io/docs/instrumentation/go/getting-started/