Tracing and Metrics: Getting the Most Out of Istio

Are you considering or using a service mesh to help manage your microservices infrastructure? If so, here are some basics on how a service mesh can help, the different architectural options, and tips and tricks on using some key CNCF tools that integrate well with Istio to get the most out of it.

The beauty of a service mesh is that it bundles so many capabilities together, freeing engineering teams from having to spend inordinate amounts of time managing microservices architectures. Kubernetes has solved many build and deploy challenges, but it is still time consuming and difficult to ensure reliability and security at runtime. A service mesh handles the difficult, error-prone parts of cross-service communication such as latency-aware load balancing, connection pooling, service-to-service encryption, instrumentation, and request-level routing.

Once you have decided a service mesh makes sense to help manage your microservices, the next step is deciding what service mesh to use. There are several architectural options, from the earliest model of a library approach, the node agent architecture, and the model which seems to be gaining the most traction – the sidecar model. We have also seen an evolution from data plane proxies like Envoy, to service meshes such as Istio which provide distributed control and data planes. We're active users of Istio, and believers in the sidecar architecture striking the right balance between a robust set of features and a lightweight footprint, so let’s take a look at how to get the most out of tracing and metrics with Istio.

Tracing

One of the capabilities Istio provides is distributed tracing. Tracing provides service dependency analysis for different microservices and it provides tracking for requests as they are traced through multiple microservices. It’s also a great way to identify performance bottlenecks and zoom into a particular request to define things like which microservice contributed to the latency of a request or which service created an error.

We use and recommend Jaeger for tracing as it has several advantages:

  • OpenTracing compatible API
  • Flexible & scalable architecture
  • Multiple storage backends
  • Advanced sampling
  • Accepts Zipkin spans
  • Great UI
  • CNCF project and active OS community

Metrics

Another powerful thing you gain with Istio is the ability to collect metrics. Metrics are key to understanding historically what has happened in your applications, and when they were healthy compared to when they were not. A service mesh can gather telemetry data from across the mesh and produce consistent metrics for every hop. This makes it easier to quickly solve problems and build more resilient applications in the future.

We use and recommend Prometheus for gathering metrics for several reasons:

  • Pull model
  • Flexible query API
  • Efficient storage
  • Easy integration with Grafana
  • CNCF project and active OS community

We also use Cortex, which is a powerful tool to enhance Prometheus. Cortex provides:

  • Long term durable storage
  • Scalable Prometheus query API
  • Multi-tenancy

Check out this webinar for a deeper look into what you can do with these tools and more.


Observability, or "Knowing What Your Microservices Are Doing"

Microservicin’ ain’t easy, but it’s necessary. Breaking your monolith down into smaller pieces is a must in a cloud native world, but it doesn’t automatically make everything easier. Some things actually become more difficult. An obvious area where it adds complexity is communications between services; observability into service to service communications can be hard to achieve, but is critical to building an optimized and resilient architecture.

The idea of monitoring has been around for a while, but observability has become increasingly important in a cloud native landscape. Monitoring aims to give an idea of the overall health of a system, while observability aims to provide insights into the behavior of systems. Observability is about data exposure and easy access to information which is critical when you need a way to see when communications fail, do not occur as expected or occur when they shouldn’t. The way services interact with each other at runtime needs to be monitored, managed and controlled. This begins with observability and the ability to understand the behavior of your microservice architecture.

A primary microservices challenges is trying to understand how individual pieces of the overall system are interacting. A single transaction can flow through many independently deployed microservices or pods, and discovering where performance bottlenecks have occurred provides valuable information.

It depends who you ask, but many considering or implementing a service mesh say that the number one feature they are looking for is observability. There are many other features a mesh provides, but those are for another blog. Here, I’m going to cover the top observability features provided by a service mesh.

Tracing

An overwhelmingly important things to know about your microservices architecture is specifically which microservices are involved in an end-user transaction. If many teams are deploying their dozens of microservices, all independently of one another, it’s difficult to understand the dependencies across your services. Service mesh provides uniformity which means tracing is programming-language agnostic, addressing inconsistencies in a polyglot world where different teams, each with its own microservice, can be using different programming languages and frameworks.

Distributed tracing is great for debugging and understanding your application’s behavior. The key to making sense of all the tracing data is being able to correlate spans from different microservices which are related to a single client request. To achieve this, all microservices in your application should propagate tracing headers. If you’re using a service mesh like Aspen Mesh, which is built on Istio, the ingress and sidecar proxies automatically add the appropriate tracing headers and reports the spans to a tracing collector backend. Istio provides distributed tracing out of the box making it easy to integrate tracing into your system. Propagating tracing headers in an application can provide nice hierarchical traces that graph the relationship between your microservices. This makes it easy to understand what is happening when your services interact and if there are any problems.

Metrics

A service mesh can gather telemetry data from across the mesh and produce consistent metrics for every hop. Deploying your service traffic through the mesh means you automatically collect metrics that are fine-grained and provide high level application information since they are reported for every service proxy. Telemetry is automatically collected from any service pod providing network and L7 protocol metrics. Service mesh metrics provide a consistent view by generating uniform metrics throughout. You don’t have to worry about reconciling different types of metrics emitted by various runtime agents, or add arbitrary agents to gather metrics for legacy apps. It’s also no longer necessary to rely on the development process to properly instrument the application to generate metrics. The service mesh sees all the traffic, even into and out of legacy “black box” services, and generates metrics for all of it.

Valuable metrics that a service mesh gathers and standardizes include:

  • Success Rates
  • Request Volume
  • Request Duration
  • Request Size
  • Request and Error Counts
  • Latency
  • HTTP Error Codes

These metrics make it simpler to understand what is going on across your architecture and how to optimize performance.

Most failures in the microservices space occur during the interactions between services, so a view into those transactions helps teams better manage architectures to avoid failures. Observability provided by a service mesh makes it much easier to see what is happening when your services interact with each other, making it easier to build a more efficient, resilient and secure microservice architecture.