Distributed Tracing, Istio and Your Applications

In the microservices world, distributed tracing is slowly becoming the most important tool for debugging and understanding your application dependencies. During my recent conversations in MeetUps and conferences, I found there was a lot of interest in how distributed tracing works but at the same time there was a fair amount of confusion on how tracing interacts with service meshes like Istio and Aspen Mesh. In particular, I had these following questions asked frequently:

  • How does tracing work with Istio? What information is collected and reported in the spans?
  • Do I have to change my applications to benefit from distributed tracing in Istio?
  • If I am currently reporting spans in my application how will it interact with spans reported from Istio?

In this blog I am going to try and answer these questions. Before we get deeper into these questions, a quick background on why or how I ended up writing tracing related blogs. If you follow the Aspen Mesh blog you would have noticed I wrote two blogs related to tracing, one on tracing requests to AWS services when using Istio, and the second on tracing gRPC applications with Istio.

We have a pretty small engineering team at Aspen Mesh and as it goes in most startups if you work frequently on a sub-system or component you quickly become (or labeled or assigned) a resident expert. I added tracing in our microservices and integrated it with Istio in the AWS environment and in that process uncovered various interesting interactions which I thought might be worth sharing. Over the last few months we have been using tracing very heavily to gain understanding of our microservices and it has now become the first place we look when things break. With that let's move on to answering the questions I mentioned above.

How does tracing work with Istio?

Istio injects a sidecar proxy (Envoy) in the Pod in which your application container is running. This sidecar proxy transparently intercepts (iptables magic) all network traffic going in and out of your application. Because of this interception the sidecar proxy is in a unique position to automatically trace all network requests (HTTP/1.1, HTTP/2.0 & gRPC).

Let's see what changes sidecar proxy makes to an incoming request to a Pod from a client (external or other microservices). From this point on I'm going to assume tracing headers are in Zipkin format for simplicity.

  • If the incoming request doesn't have any tracing headers, the sidecar proxy will create a root span (span where trace, parent and span IDs are all the same) before passing the request to the application container in the same Pod.
  • If the incoming request has tracing information (which should be the case if you're using Istio ingress or your microservice is being called from another microservice with sidecar proxy injected), the sidecar proxy will extract the span context from these headers, create a new sibling span (same trace, span and parent ID as incoming headers) before passing the request to the application container in the same Pod.

In the reverse directon when the application container is making outbound requests (external services or services in the cluster), the sidecar proxy in the Pod performs the following actions before making the request to the upstream service:

  • If no tracing headers are present, the sidecar proxy creates root span and injects the span context as tracing headers into the new request.
  • If tracing headers are present, the sidecar proxy extracts the span context from the headers, creates child span from this context. The new context is propagated as tracing headers in the request to the upstream service.

Based on the above explanation you should note that for every hop in your microservice chain you will get two spans reported from Istio, one from the client sidecar (span.kind set to client) and one from the server sidecar (span.kind set to server). All the spans created by the sidecars are automatically reported by the sidecars to the configured tracing backend systems like Jaeger or Zipkin.

Next let's look at the information reported in the spans. The spans contain the following information:

  • x-request-id: Reported as guid:x-request-id which is very useful in correlating access logs with spans.
  • upstream cluster: The upstream service to which the request is being made. If the span is tracking an incoming request to a Pod this is typically set to in.<name>. If the span is tracking an outbound request this is set to out.<name>.
  • HTTP headers: Following HTTP headers are reported when available:
    • URL
    • Method
    • User agent
    • Protocol
    • Request size
    • Response size
    • Response Flags
  • Start and end times for each span.
  • Tracing metadata: This includes the trace ID, span ID and the span kind (client or server). Apart from these the operation name is also reported for every span. The operation name is set to the configured virtual service (or route rule in v1alpha1) which affected the route or "default-route" if the default route was chosen. This is very useful in understanding which Istio route configuration is in effect for a span.

With that let's move on to the second question.

Do I have to change my application to gain benefit from tracing in Istio?

Yes, you will need to add logic in your application to propagate tracing headers from incoming to outgoing requests to gain full benefit from Istio's distributed tracing.

If the application container makes a new outbound request in the context of an incoming request and doesn't propagate the tracing headers from the incoming request, the sidecar proxy creates a root span for the outbound request. This means you will always see traces with only two microservices. On the other hand if the application container does propagate the tracing headers from incoming to outgoing requests, the sidecar proxy will create child spans as described above. Creation of the child spans gives you the ability to understand dependencies across multiple microservices.

There are couple of options for propagating tracing headers in your application.

  1. Look for tracing headers as mentioned in the istio docs and transfer the headers from incoming to outgoing requests. This method is simple and works in almost all cases. However, it has a major drawback that you cannot add custom tags to the spans like user information. You cannot create child spans related to events in the application which you might want to report. As you are simply transferring headers without understanding the span formats or contexts there is limited ability to add application specific information.
  2. The second method is setting up a tracing client in your application and use the Opentracing APIs to propagate tracing headers from incoming to outgoing requests. I have created a sample tracing-go package which provides an easy way to setup jaeger-client-go in your applications which is compatible with Istio. Following snippet should be included in the main function of your application:
       import (
         "log"

         "github.com/spf13/cobra"
         "github.com/spf13/viper"

         "github.com/aspenmesh/tracing-go"
       )

       func setupTracing() {
         // Configure Tracing
         tOpts := &tracing.Options{
           ZipkinURL:     viper.GetString("trace_zipkin_url"),
           JaegerURL:     viper.GetString("trace_jaeger_url"),
           LogTraceSpans: viper.GetBool("trace_log_spans"),
         }
         if err := tOpts.Validate(); err != nil {
           log.Fatal("Invalid options for tracing: ", err)
         }
         var tracer io.Closer
         if tOpts.TracingEnabled() {
           tracer, err = tracing.Configure("myapp", tOpts)
           if err != nil {
             tracer.Close()
             log.Fatal("Failed to configure tracing: ", err)
           } else {
             defer tracer.Close()
           }
         }
       }

The key point to note is in the tracing-go package I have set the Opentracing global tracer to the Jaeger tracer. This enables me to use the Opentracing APIs for propagating headers from incoming to outgoing requests like this:

   import (
     "net/http"
     "golang.org/x/net/context"
     "golang.org/x/net/context/ctxhttp"
     "ot "github.com/opentracing/opentracing-go"
   )

   func injectTracingHeaders(incomingReq *http.Request, addr string) {
     if span := ot.SpanFromContext(incomingReq.Context()); span != nil {
       outgoingReq, _ := http.NewRequest("GET", addr, nil)
       ot.GlobalTracer().Inject(
         span.Context(),
         ot.HTTPHeaders,
         ot.HTTPHeadersCarrier(outgoingReq.Header))

       resp, err := ctxhttp.Do(ctx, nil, outgoingReq)
       // Do something with resp
     }
   }

You can also use the Opentracing APIs to set span tags or create child spans
from the tracing context added by Istio like this:

   func SetSpanTag(incomingReq *http.Request, key string, value interface{}) {
     if span := ot.SpanFromContext(incomingReq.Context()); span != nil {
       span.SetTag(key, value)
     }
   }

Apart from these benefits you don't have to deal with tracing headers directly but the tracer (in this case Jaeger) handles it for you. I strongly recommend using this approach as it sets the foundation in your application to add enhanced tracing capabilities without much overhead.

Now let's move on to the third question.

How does spans reported by Istio interact with spans created by applications?

If you want the spans reported by your application to be child spans of the tracing context added by Istio you should use Opentracing API StartSpanFromContext instead of using StartSpan. The StartSpanFromContext creates a child span from the parent context if present else creates a root span.

Note that in all the examples above I have used Opentracing Go APIs but you should be able to use any tracing client library written in the same language as your application as long as it is Opentracing API compatible.


Using AWS Services from Istio Service Mesh with Go

This is a quick blog on how we use AWS services from inside of an Istio Service Mesh. Why does it matter that you’re inside the mesh? Because the service mesh wants to manage all the traffic in/out of your application. This means it needs to be able to inspect the traffic and parse it if it is HTTP. Nothing too fancy here, just writing it down in case it can save you a few keystrokes.

Our example is for programs written in Go.

Step 1: Define an Egress Rule

You need to make an egress to allow the application to talk to the AWS service at all. Here’s an example egress rule to allow dynamo:

Step 2: Delegate Encryption to the Sidecar

This part is the trick we were missing. If you want to get maximum service mesh benefits, you need to pass unencrypted traffic to the sidecar. The sidecar will inspect it, apply policy and encrypt it before egressing to the AWS service (in our case Dynamo).

Don’t worry, your traffic is not going out on any real wires unencrypted. Only the loopback wire from your app container to the sidecar. In Kubernetes, this is its own network namespace so even other containers on the same system cannot see it unencrypted.

package awswrapper
import (
"net/http"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/endpoints"
"github.com/aws/aws-sdk-go/aws/request"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/golang/glog"
"github.com/you/repo/pkg/tracing"
)
type Config struct {
InMesh bool
Endpoint string // http://dynamodb.us-west-2.amazonaws.com
Label string // Used in logging messages to identify
}
func istioEgressEPResolver(service, region string, optFns ...func(*endpoints.Options)) (endpoints.ResolvedEndpoint, error) {
ep, err := endpoints.DefaultResolver().EndpointFor(service, region, optFns...)
if err != nil {
return ep, err
}
ep.URL = ep.URL + ":443"
return ep, nil
}
func AwsConfig(cfg Config) *aws.Config {
config := aws.NewConfig().
WithEndpoint(cfg.Endpoint)
if cfg.InMesh {
glog.Infof("Using http for AWS for %s", cfg.Label)
config = config.WithDisableSSL(true).
WithEndpointResolver(endpoints.ResolverFunc(istioEgressEPResolver))
}
return config
}
func AwsSession(label string, cfg *aws.Config) (*session.Session, error) {
sess, err := session.NewSession(cfg)
if err != nil {
return nil, err
}
// This has to be the first handler before core.SendHandler which
// performs the operation of sending request over the wire.
// Note that Send Handler is used which are invoked after the signing of
// request is completed which means Tracing headers would not be signed.
// Signing of tracing headers causes request failures as Istio changes the
// headers and signature validation fails.
sess.Handlers.Send.PushFront(addTracingHeaders)
sess.Handlers.Send.PushBack(func(r *request.Request) {
glog.V(6).Infof("%s: %s %s://%s%s",
label,
r.HTTPRequest.Method,
r.HTTPRequest.URL.Scheme,
r.HTTPRequest.URL.Host,
r.HTTPRequest.URL.Path,
)
})
// This handler is added after core.SendHandler so that the tracing headers
// can be removed. This is required in case of retries, the request is signed
// again and if the request headers contain Tracing headers retry signature
// validation will fail as Istio will update these headers.
sess.Handlers.Send.PushBack(removeTracingHeaders)
return sess, nil
}
view rawwrapAwsSession.go hosted with ❤ by GitHub

AwsConfig() is the core - you need to make a new aws.Session with these options.

The first option, WithDisableSSL(true), tells the AWS libraries to not use HTTPS and instead just speak plain HTTP. This is very bad if you are not in the mesh. But, since we are in the mesh, we’re only going to speak plain HTTP over to the sidecar, which will convert HTTP into HTTPS and send it out over the wire. In Kubernetes, the sidecar is in an isolated network namespace with your app pod, so there’s no chance for other pods or processes to snoop this plaintext traffic.

When you set the first option, the library will try to talk to http://dynamodb.<region>.us-west-2.amazonaws.com on port 80 (hey, you asked it to disable SSL). But that’s not what we want - we want to act like we’re talking to 443 so that the right egress rule gets invoked and the sidecar encrypts traffic. That’s what istioEgressEPResolver is for.

We do it this way for a little bit of belts-and-suspenders safety - we really want to avoid ever accidentally speaking HTTP to dynamo. Here are the various failure scenarios:

  • Our service is in Istio, and the user properly configured InMesh=true: everything works and is HTTPS via the sidecar.
  • Our service is not in Istio, and the user properly configured InMesh=false: everything works and is HTTPS via the AWS go library.
  • Our service is not in Istio, but oops! the user set InMesh=true: the initial request goes out to dynamo on port 443 as plain HTTP. Dynamo rejects it, so we know it’s broken before sending a bunch of data via plain HTTP.
  • Our service is in Istio, but oops! the user set InMesh=false: the sidecar rejects the traffic as it is already-encrypted HTTPS that it can’t make any sense of.

OK, now you’ve got an aws.Session instance ready to go. Pass it to your favorite AWS service interface and go:

import (
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/service/dynamodb"
"github.com/you/repo/pkg/awswrapper"
)
type Dynamo struct {
Session *session.Session
Db *dynamodb.DynamoDB
}
func NewWithConfig(cfg *aws.Config) (*Dynamo, error) {
sess, err := awswrapper.AwsSession("Test", cfg)
if err != nil {
return nil, err
}
dyn := &Dynamo{
Session: sess,
Db: dynamodb.New(sess),
}
return dyn, nil
}
view rawuseWrapped.go hosted with ❤ by GitHub

p.s. What’s up with addTracingHeaders() and removeTracingHeaders()? Check out Neeraj’s post. While you’re at it, you can add just a few more lines and get great end-to-end distributed tracing.


Distributed tracing with Istio in AWS

Everybody loves tracing! Am I right? If you attended KubeCon (my bad, CloudNativeCon!) 2017 at Austin or saw any of the keynotes posted online, you would have noticed the recurring theme explaining the benefits of tracing, especially as it relates to DevOps tools. Istio and service mesh were hot topics and many sessions discussed how Istio provides distributed tracing out of the box making it easier for application developers to integrate tracing into their system.

Indeed, a great benefit of using service mesh is getting more visibility and understanding of your applications. Since this is a tech post (I remember categorizing it as such) let’s dig deeper in how Istio provides application tracing.

When using Istio, a sidecar envoy proxy is automatically injected next to your applications (in Kubernetes this means adding containers to the application Pod). This sidecar proxy intercepts all traffic and can add/augment tracing headers to the requests entering/leaving the application container. Additionally, the sidecar proxy also handles asynchronous reporting of spans to the tracing backends like Jaeger, Zipkin, etc. Sounds pretty awesome!

One thing that the applications do need to implement is propagating tracing headers from incoming to outgoing requests as mentioned in this Istio guide. Simple enough right? Well it’s about to get interesting.

Before we proceed further, first a little background on why I’m writing this blog. We, here at Aspen Mesh offer a supported enterprise service mesh built on open source Istio. Not only do we offer a service mesh product but we also use it in our production SaaS platform hosted in AWS (isn’t that something?).

I was tasked with propagating tracing headers in our applications so that we get nice hierarchical traces graphing the relationship between our microservices. As we are hosted in AWS, many of our microservices make outgoing requests to AWS services. During this exercise, I found some interesting interactions between adding tracing headers and using Istio with AWS services that I decided to share my experience. This blog describes various iterations I went through to get it all working together.

The application in question for this post is a simple web server. When it receives a HTTP request it makes an outbound DynamoDB query to fetch an item. As it is deployed in the Istio service mesh, the sidecar proxy automatically adds tracing headers to the incoming request. I wanted to propagate the tracing headers from the incoming request to the DynamoDB query request for getting all the traces tied together.

First Iteration

In order to achieve this I decided to pass a custom function as request options to the AWS DynamoDB API which allows you to augment request headers before they are transmitted over the wire. In the snippet below I’m using the AWS go-sdk’s dynamo.GetItemWithContext for fetching an item and passing AddTracingHeaders as the request.Option. Note that the AddTracingHeaders method uses standard opentracing API for injecting headers from a input context.

func AddTracingHeaders() awsrequest.Option {
  return func(req *awsrequest.Request) {
    if span := ot.SpanFromContext(req.Context()); span != nil {
      ot.GlobalTracer().Inject(
      span.Context(),
      ot.HTTPHeaders,
      ot.HTTPHeadersCarrier(req.HTTPRequest.Header))
    }
  }
}

// ctx is the incoming request's context as received from the mesh
func makeDynamoQuery(ctx context.Context ) {
  // Note that AddTracingHeaders is passed as awsrequest.Option
  result, err := dynamo.GetItemWithContext(ctx, ..., AddTracingHeaders())
  // Do something with result
}

Ok, time for testing this solution! The new version compiles, and I verified locally that it is able to fetch items from DynamoDB. After deploying the new version in production with Istio (sidecar injected) I’m hoping to see the traces nicely tied together. Indeed, the traces look much better but wait all of the responses from DynamoDB are now HTTP Status Code 400. Bummer!

Looking at the error messages from aws-go-sdk we are getting AccessDeniedException which according to AWS docs indicate that the signature is not valid. Adding tracing headers seems to have broken signature validation which is odd, yet interesting as I had tested in my dev environment (without sidecar proxy) and the DynamoDB requests worked fine, but in production it stopped working. Typical developer nightmare!

Digging into the AWS sdk package, I found that the client code signs every request including headers with a few hardcoded exceptions. The difference between the earlier and the new version is the addition of tracing headers to the request which are now getting signed and then handed to the sidecar proxy. Istio’s sidecar proxy (in this case Envoy) changes these tracing headers (as it should!) before sending it to DynamoDB service which breaks the signature validation at the server.

To get this fixed we need to ensure that the tracing headers are added after the request is signed but before it is sent out by the AWS sdk. This is getting more complicated, but still doable.

Second Iteration

I couldn’t find an easy way to whitelist these tracing headers and prevent them from getting them signed. But, AWS session package provides a very flexible API for adding custom handlers which get invoked in various stages of the request lifecycle. Additionally, providing a session handler has the benefit of being added in all AWS service requests (not just DynamoDB) which use that session. Perfect!

Here’s the AddTracingHeaders method above added as a session handler:

sess, err := session.NewSession(cfg)

// Add the AddTracingHeaders as the first Send handler. This is important as one
// of the default Send handlers does the work of sending the request.
sess.Handlers.Send.PushFront(AddTracingHeaders)

This looks promising. Testing showed that the first request to the AWS DynamoDB service is successful (200 Ok!) Traces look good too! We are getting somewhere, time to test some failure scenarios.

I added a Istio fault injection rule to return a HTTP 500 error on outgoing DynamoDB requests to exercise the AWS sdk’s retry logic. Snap! receiving the HTTP Status Code 400 with AccessDeniedException error again on every retry.

Looking at the AWS request send logic, it appears that on retriable errors the code makes a copy of the previous request, signs it and then invokes the Send handlers. This means that on retries the previously added tracing headers would get signed again (i.e. earlier problem is back, hence 400s) and then the AddTracingHeaders handler would add back the tracing headers.

Now that we understand the issue, the solution we came up with is to add the tracing headers after the request is signed and before it is sent out just like the earlier implementation. In addition, to make retries work we now need to remove these headers after the request is sent so that the resigning and reinvocation of AddTracingHeaders is handled correctly.

Final Interation

Here’s what the final working version looks like:

func injectFromContextIntoHeader(ctx context.Context, header http.Header) {
  if span := ot.SpanFromContext(ctx); span != nil {
    ot.GlobalTracer().Inject(
    span.Context(),
    ot.HTTPHeaders,
    ot.HTTPHeadersCarrier(header))
  }
}

func AddTracingHeaders() awsrequest.Option {
  return func(req *awsrequest.Request) {
    injectFromContextIntoHeader(req.Context(), req.HTTPRequest.Header)
  }
}

// This is a bit odd, inject tracing headers into an empty header map so that
// we can remove them from the request.
func RemoveTracingHeaders(req *awsrequest.Request) {
  header := http.Header{}
  injectFromContextIntoHeader(req.Context(), header)
  for k := range header {
    req.HTTPRequest.Header.Del(k)
  }
}

sess, err := session.NewSession(cfg)

// Add the AddTracingHeaders as the first Send handler.
sess.Handlers.Send.PushFront(AddTracingHeaders)

// Pushback is used here so that this handler is added after the request has
// been sent.
sess.Handlers.Send.PushBack(RemoveTracingHeaders)

Agreed, above solution looks far from elegant but it does work. I hope this post helps if you are in a similar situation.

If you have a better solution feel free to reach out to me at neeraj@aspenmesh.io