Observability, or "Knowing What Your Microservices Are Doing"

Microservicin’ ain’t easy, but it’s necessary. Breaking your monolith down into smaller pieces is a must in a cloud native world, but it doesn’t automatically make everything easier. Some things actually become more difficult. An obvious area where it adds complexity is communications between services; observability into service to service communications can be hard to achieve, but is critical to building an optimized and resilient architecture.

The idea of monitoring has been around for a while, but observability has become increasingly important in a cloud native landscape. Monitoring aims to give an idea of the overall health of a system, while observability aims to provide insights into the behavior of systems. Observability is about data exposure and easy access to information which is critical when you need a way to see when communications fail, do not occur as expected or occur when they shouldn’t. The way services interact with each other at runtime needs to be monitored, managed and controlled. This begins with observability and the ability to understand the behavior of your microservice architecture.

A primary microservices challenges is trying to understand how individual pieces of the overall system are interacting. A single transaction can flow through many independently deployed microservices or pods, and discovering where performance bottlenecks have occurred provides valuable information.

It depends who you ask, but many considering or implementing a service mesh say that the number one feature they are looking for is observability. There are many other features a mesh provides, but those are for another blog. Here, I’m going to cover the top observability features provided by a service mesh.

Tracing

An overwhelmingly important things to know about your microservices architecture is specifically which microservices are involved in an end-user transaction. If many teams are deploying their dozens of microservices, all independently of one another, it’s difficult to understand the dependencies across your services. Service mesh provides uniformity which means tracing is programming-language agnostic, addressing inconsistencies in a polyglot world where different teams, each with its own microservice, can be using different programming languages and frameworks.

Distributed tracing is great for debugging and understanding your application’s behavior. The key to making sense of all the tracing data is being able to correlate spans from different microservices which are related to a single client request. To achieve this, all microservices in your application should propagate tracing headers. If you’re using a service mesh like Aspen Mesh, which is built on Istio, the ingress and sidecar proxies automatically add the appropriate tracing headers and reports the spans to a tracing collector backend. Istio provides distributed tracing out of the box making it easy to integrate tracing into your system. Propagating tracing headers in an application can provide nice hierarchical traces that graph the relationship between your microservices. This makes it easy to understand what is happening when your services interact and if there are any problems.

Metrics

A service mesh can gather telemetry data from across the mesh and produce consistent metrics for every hop. Deploying your service traffic through the mesh means you automatically collect metrics that are fine-grained and provide high level application information since they are reported for every service proxy. Telemetry is automatically collected from any service pod providing network and L7 protocol metrics. Service mesh metrics provide a consistent view by generating uniform metrics throughout. You don’t have to worry about reconciling different types of metrics emitted by various runtime agents, or add arbitrary agents to gather metrics for legacy apps. It’s also no longer necessary to rely on the development process to properly instrument the application to generate metrics. The service mesh sees all the traffic, even into and out of legacy “black box” services, and generates metrics for all of it.

Valuable metrics that a service mesh gathers and standardizes include:

  • Success Rates
  • Request Volume
  • Request Duration
  • Request Size
  • Request and Error Counts
  • Latency
  • HTTP Error Codes

These metrics make it simpler to understand what is going on across your architecture and how to optimize performance.

Most failures in the microservices space occur during the interactions between services, so a view into those transactions helps teams better manage architectures to avoid failures. Observability provided by a service mesh makes it much easier to see what is happening when your services interact with each other, making it easier to build a more efficient, resilient and secure microservice architecture.


Tracing gRPC with Istio

At Aspen Mesh we love gRPC. Most of our public facing and many internal APIs use it. To give you a brief background in case you haven’t heard about it (would be really difficult with gRPC’s belle of the ball status), it is a new, highly efficient and optimized Remote Procedure Call (RPC) framework. It is based on the battle tested protocol buffers serialization format and HTTP/2 network protocol.

Using HTTP/2 protocol, gRPC applications can benefit from multiplexing requests, efficient connection utilization and host of other enhancements over other protocols like HTTP/1.1 which is very well documented here. Additionally, protocol buffers are an easy and extensible way for serializing structured data in binary format which in itself gives you significant performance improvements over text based formats. Combining these two results in a low latency and highly scalable RPC framework, which is in essence what gRPC is. Additionally, the growing ecosystem gives you the ability to write your applications in many supported languages like (C++, Java, Go, etc.) and an extensive set of third party libraries to use.

Apart from the benefits I listed above, what I like most about gRPC is the simplicity and intuitiveness with which you can specify your RPCs (using the protobufs IDL) and how a client application can invoke methods on the server application as if it was a local function call. A lot of the code (service descriptions and handlers, client methods, etc.) gets auto generated for you making it very convenient to use.

Now that I have laid out some background, let’s turn our attention to the main topic of this blog. Here I’m going to cover how to add tracing in your applications built on gRPC, especially if you’re using Istio or Aspen Mesh.

Tracing is great for debugging and understanding your application’s behavior. The key to making sense of all the tracing data is being able to correlate spans from different microservices which are related to a single client request.

To achieve this, all microservices in your application should propagate tracing headers. If you’re using a service mesh like Istio or Aspen Mesh, the ingress and sidecar proxies automatically add the appropriate tracing headers and report the spans to the tracing collector backend like Jaeger or Zipkin. The only thing left for applications to do is propagate tracing headers from incoming requests (which sidecar or ingress proxy adds) to any outgoing requests it makes to other microservices.

Propagating Headers from gRPC to gRPC Requests

The easiest way to do tracing header propagation is to use the grpc opentracing middleware library’s client interceptors. This can be used if your application is making a new outbound request upon receiving the incoming request. Here’s the sample code to correctly propagate tracing headers from the incoming to outgoing request:

  import (
    "golang.org/x/net/context"
    "github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing"
    "ot "github.com/opentracing/opentracing-go"
  )

  // ctx is the incoming gRPC request's context
  // addr is the address for the new outbound request
  func createGRPCConn(ctx context.Context, addr string) (*grpc.ClientConn, error) {
  	var opts []grpc.DialOption
  	opts = append(opts, grpc.WithStreamInterceptor(
  		grpc_opentracing.StreamClientInterceptor(
  			grpc_opentracing.WithTracer(ot.GlobalTracer()))))
  	opts = append(opts, grpc.WithUnaryInterceptor(
  		grpc_opentracing.UnaryClientInterceptor(
  			grpc_opentracing.WithTracer(ot.GlobalTracer()))))
  	conn, err := grpc.DialContext(ctx, addr, opts...)
  	if err != nil {
  		glog.Error("Failed to connect to application addr: ", err)
  		return nil, err
  	}
  	return conn, nil
  }

Pretty simple right?

Adding the opentracing client interceptors ensures that making any new unary or streaming gRPC request on the client connection injects the correct tracing headers. If the passed context has the tracing headers present (which should be the case if you are using Aspen Mesh or Istio and passing the incoming request’s context), then the new span is created as the child span of the span present in the passed context. On the other hand if the context has no tracing information, a new root span is created for the outbound request.

Propagating Headers from gRPC to HTTP Requests

Now let’s look at the scenario if your application makes a new outbound HTTP/1.1 request upon receiving a new incoming gRPC request. Here’s the sample code to accomplish header propagating in this case:

  import (
    "net/http"
    "golang.org/x/net/context"
    "golang.org/x/net/context/ctxhttp"
    "ot "github.com/opentracing/opentracing-go"
  )

  // ctx is the incoming gRPC request's context
  // addr is the address of the application being requested
  func makeNewRequest(ctx context.Context, addr string) {
    if span := ot.SpanFromContext(ctx); span != nil {
      req, _ := http.NewRequest("GET", addr, nil)

      ot.GlobalTracer().Inject(
        span.Context(),
        ot.HTTPHeaders,
        ot.HTTPHeadersCarrier(req.Header))

      resp, err := ctxhttp.Do(ctx, nil, req)
      // Do something with resp
    }
  }

This is quite standard for serializing tracing headers from incoming request’s (HTTP or gRPC) context.

Great! So far we have been able to use libraries or standard utility code to get what we want.

Propagating Headers When Using gRPC-Gateway

One of the libraries commonly used in gRPC applications is the grpc-gateway library to expose services as RESTful JSON APIs. This is very useful when you want to consume gRPC from clients like curl, web browser, etc. which don’t understand it or maintain a RESTful architecture. More details on how to expose RESTful APIs using grpc-gateway can be found in this great blog. I highly encourage you to read it if you’re new to this architecture.

When you start using grpc-gateway and want to propagate tracing headers there are few very interesting interactions that are worth mentioning. The grpc-gateway documentation states that all IANA permanent HTTP headers are prefixed with grpcgateway- and added as request headers. This is great but as tracing headers like x-b3-traceidx-b3-spanid, etc. are not IANA recognized permanent HTTP headers they are not copied over to gRPC requests when grpc-gateway proxies HTTP requests. This means as soon as you add the grpc-gateway to your application, the header propagation logic will stop working.

Isn’t that typical? You add one awesome thing which breaks the current working setup. No worries, I have a solution for you!

Here’s a way to ensure you don’t lose the tracing information when proxying between HTTP and gRPC using grpc-gateway:

  import (
    "net/http"
    "golang.org/x/net/context"
    "google.golang.org/grpc/metadata"
    "github.com/grpc-ecosystem/grpc-gateway/runtime"
  )

  const (
  	prefixTracerState  = "x-b3-"
  	zipkinTraceID      = prefixTracerState + "traceid"
  	zipkinSpanID       = prefixTracerState + "spanid"
  	zipkinParentSpanID = prefixTracerState + "parentspanid"
  	zipkinSampled      = prefixTracerState + "sampled"
  	zipkinFlags        = prefixTracerState + "flags"
  )

  var otHeaders = []string{
  	zipkinTraceID,
  	zipkinSpanID,
  	zipkinParentSpanID,
  	zipkinSampled,
  	zipkinFlags}

  func injectHeadersIntoMetadata(ctx context.Context, req *http.Request) metadata.MD {
  	pairs := []string{}
  	for _, h := range otHeaders {
  		if v := req.Header.Get(h); len(v) > 0 {
  			pairs = append(pairs, h, v)
  		}
  	}
  	return metadata.Pairs(pairs...)
  }

  type annotator func(context.Context, *http.Request) metadata.MD

  func chainGrpcAnnotators(annotators ...annotator) annotator {
  	return func(c context.Context, r *http.Request) metadata.MD {
  		mds := []metadata.MD{}
  		for _, a := range annotators {
  			mds = append(mds, a(c, r))
  		}
  		return metadata.Join(mds...)
  	}
  }

  // Main function of your application. Insert tracing headers into gRPC
  // metadata using annotators
  func run() {
    ...
	  annotators := []annotator{injectHeadersIntoMetadata}

	  gwmux := runtime.NewServeMux(
		  runtime.WithMetadata(chainGrpcAnnotators(annotators...)),
	  )
    ...
  }

In the code above, I have used the runtime.WithMetadata API provided by the grpc-gateway library. This API is useful for reading attributes from HTTP request and adding it to the metadata, which is exactly what we want! A little bit more work, but still using the APIs exposed by the library.

The injectHeadersIntoMetadata annotator looks for the tracing headers in the HTTP requests and appends it to the metadata, thereby ensuring that the tracing headers can be further propagated from gRPC to outbound requests using the techniques mentioned in the previous sections.

Another interesting thing you might have observed is the wrapper chainGrpcAnnotators function. The runtime.WithMetadata API only allows a single annotator to be added which might not be enough for all scenarios. In our case, we had a tracing annotator (like the one show above) and an authentication annotator which appended auth data from HTTP request to the gRPC metadata. UsingchainGrpcAnnotators allows you to add multiple annotators and the wrapper function joins the metadata from various annotators into a single metadata for the request.