Protocol Sniffing Service Mesh

Protocol Sniffing in Production

Istio 1.3 introduced a new capability to automatically sniff the protocol used when two containers communicate. This is a powerful benefit to easily get started with Istio, but it has some tradeoffs.  Aspen Mesh recommends that production deployments of Aspen Mesh (built on Istio) do not use protocol sniffing, and Aspen Mesh 1.3.3-am2 turns off protocol sniffing by default. This blog explains the tradeoffs and the reasoning we think turning off protocol sniffing is the better tradeoff.  

What Protocol Sniffing Is

Protocol sniffing predates Istio. For our purposes, we're going to define it as examining some communication stream and classifying it as implementing one protocol (like HTTP) or another (like SSH), without additional information. For example, here's two streams from client to server, if you've ever debugged these protocols you won't have a hard time telling them apart:

Protocol Sniffing Service Mesh

In an active service mesh, the Envoy sidecars will be handling thousands of these streams a second.  The sidecar is a proxy, so it reads every byte in the stream from one side, examines it, applies policy to it and then sends it on.  In order to apply proper policy ("Send all PUTs to /create_* to create-handler.foo.svc.cluster.local"), Envoy needs to understand the bytes it is reading.  Without protocol sniffing, that's done by configuring Envoy:

  • Layer 7 (HTTP): "All streams with a destination port of 8000 are going to follow the HTTP protocol"
  • Layer 4 (SSH): "All streams with a destination port of 22 are going to follow the SSH protocol"

When Envoy sees a stream with destination port 8000, it reads each byte and runs its own HTTP protocol implementation to understand those bytes and then apply policy.  Port 22 has SSH traffic; Envoy doesn't have an SSH protocol implementation so Envoy treats it as opaque TCP traffic. In proxies this is often called "Layer 4 mode" or "TCP mode"; this is when the proxy doesn't understand the higher-level protocol inside, so it can only apply a simpler subset of policy or collect a subset of telemetry.

For instance, Envoy can tell you how many bytes went over the SSH stream, but it can't tell you anything about whether those bytes indicated a successful SSH session or not.  But since Envoy can understand HTTP, it can say "90% of HTTP requests are successful and get a 200 OK response".

Here's an analogy - I speak English but not Italian; however, I can read and write the Latin alphabet that covers both.  So I could copy an Italian message from one piece of paper to another without understanding what's inside. Suppose I was your proxy and you said, "Andrew, copy all mail correspondence into email for me" - I could do that whether you received letters from English-speaking friends or Italian-speaking ones.  Now suppose you say, "Copy all mail correspondence into email unless it has Game of Thrones spoilers in it."  I can detect spoilers in English correspondence because I actually understand what's being said, but not Italian, where I can only copy the letters from one page to the other.

If I were a proxy, I'm a layer 7 English proxy but I only support Italian in layer 4 mode.

In Aspen Mesh and Istio, the protocol for a stream is configured in the Kubernetes service.  These are the options:

  • Specify a layer 7 protocol: Start the name of the service with a layer 7 protocol that Istio and Envoy understand, for example "http-foo" or "grpc-bar".
  • Specify a layer 4 protocol: Start the name of the service with a layer 4 protocol, for example "tcp-foo".  (You also use this if you know the layer 7 protocol but it's not one that Istio and Envoy support; for example, you might name a port "tcp-ssh")
  • Don't specify protocol at all: Name it without a protocol prefix, e.g. "clients".

If you don't specify a protocol at all, then Istio has to make a choice.  Before protocol sniffing was a feature, Istio chose to treat this with layer 4 mode.  Protocol sniffing is a new behavior that says, "try reading some of it - if it looks like a protocol you know, treat it like that protocol".

An important note here is that this sniffing applies for both passive monitoring and active management.  Istio both collects metrics and applies routing and policy. This is important because if a passive system has a sniff failure, it results only in a degradation of monitoring - details for a request may be unavailable.  But if an active system has a sniff failure, it may misapply routing or policy; it could send a request to the wrong service.

Benefits of Protocol Sniffing

The biggest benefit of protocol sniffing is that you don't have to specify the protocols.  Any communication stream can be sniffed without human intervention. If it happens to be HTTP, you can get detailed metrics on it.

That removes a significant amount of configuration burden and reduces time-to-value for your first service mesh install.  Drop it in and instantly get HTTP metrics.

Protocol Sniffing Failure Modes

However, as with most things, there is a tradeoff.  In some cases, protocol sniffing can produce results that might surprise you.  This happens when the sniffer classifies a stream differently than you or some other system would.

False Positive Match

This occurs when a protocol happens to look like HTTP, but the administrator doesn't want it to be treated by the proxy as HTTP.

One way this can happen is if the apps are speaking some custom protocol where the beginning of the communication stream looks like HTTP, but it later diverges.  Once it diverges and is no longer conforming to HTTP, the proxy has already begun treating it as HTTP and now must terminate the connection. This is one of the differences between passive sniffers and active sniffers - a passive sniffer could simply "cancel" sniffing.

Behavior Change:

  • Without sniffing: Stream is considered Layer 4 and completes fine.
  • With sniffing: Stream is considered Layer 7, and then when it later diverges, the proxy closes the stream.

False Negative Match

This occurs when the client and server think they are speaking HTTP, but the sniffer decides it isn't HTTP.  In our case, that means the sniffer downgrades to Layer 4 mode. The proxy no longer applies Layer 7 policy (like Istio's HTTP Authorization) or collects Layer 7 telemetry (like request success/failure counts).

One case where this occurs is when the client and server are both technically violating a specification but in a way that they both understand.  A classic example in the HTTP space is line termination - technically, lines in HTTP must be terminated with a CRLF; two characters 0x0d 0x0a.  But most proxies and web servers will also accept HTTP where lines are only terminated with LF (just the 0x0a character), because some ancient clients and hacked-together UNIX tools just sent LFs.

That example is usually harmless but a riskier one is if a client can speak something that looks like HTTP, that the server will treat as HTTP, but the sniffer will downgrade.  This allows the client to bypass any Layer 7 policies the proxy would enforce. Istio currently applies sniffing to outbound traffic where the outbound target is unknown (often occurs for Egress traffic) or the outbound target is a service port without a protocol annotation.

Here's an example: I know of two non-standard behaviors that node.js' HTTP server framework allows.  The first is allowing extra spaces between the Request-URI and the HTTP-Version in the Request-Line. The second is allowing spaces in a Header field-name.  Here's an example with the weird parts highlighted:

If I send this to a node.js server, it accepts it as a valid HTTP request (for the curious, the extra whitespace in the request line is dropped, and the whitespace in the Header field-name is included so the header is named "x-foo   bar"). Node.js' HTTP parser is taken from nginx which also accepts the extra spaces. Nginx is pretty darn popular so other web frameworks and a lot of servers accept this. Interestingly, so does the HTTP parser in Envoy (but not the HTTP inspector).

Suppose I have a situation like this:  We just added a new capability to delete in-progress orders to the beta version of our service, so we want all DELETE requests to be routed to "foo-beta" and all other normal requests routed to "foo".  We might write an Istio VirtualService to route DELETE requests like this:

If I send a request like this, it is properly routed to foo-2.

But if I send one like this, I bypass the route and go to foo-1.  Oops!

This means that clients can choose to "step around" routing if they can find requests that trick the sniffer. If those requests aren't accepted by the server at the other end, it should be OK.  However, if they are accepted by the server, bad things can happen. Additionally, you won't be able to audit or detect this case because you won't have Jaeger traces or access logs from the proxy since it thought the request wasn't HTTP.

(We investigated this particular case and ran our results past the Envoy and Istio security vulnerability teams before publishing this blog. While it didn't rise to the level of security issue, we want it to be obvious to our users what the tradeoffs are. While the benefits of protocol sniffing may be worthwhile in many cases, most users will want to avoid protocol sniffing in security-sensitive applications.)

Behavior Change:

  • Without sniffing: Stream is Layer 7 and invalid requests are consistently rejected.
  • With sniffing: Some streams may be classified as Layer 4 and bypass Layer 7 routing or policy.

Recommendation

Protocol sniffing lessens the configuration burden to get started with Istio, but creates uncertainty about behaviors.  Because this uncertainty can be controlled by the client, it can be surprising or potentially hazardous. In production, I'd prefer to tell the proxy everything I know and have the proxy reject everything that doesn't look as expected.  Personally, I like my test environments to look like my prod environments ("Test Like You Fly") so I'm going to also avoid sniffing in test.

I would use protocol sniffing when I first dropped a service mesh into an evaluation scenario, when I'm at the stage of, "Let's kick the tires and see what this thing can tell me about my environment."

For this reason, Aspen Mesh recommends users don't rely on protocol sniffing in production.  All service ports should be declared with a name that specifies the protocol (things like "http-app" or "tcp-custom").  Our users will continue to receive "vet" warnings for service ports that don't comply, so they can be confident that their clusters will behave predictably.


Enterprise service mesh

Aspen Mesh Open Beta Makes Istio Enterprise-ready

As companies build modern applications, they are leveraging microservices to effectively build and manage them. As they do, they realize that they are increasing complexity and de-centralizing ownership and control. These new challenges require a new way to monitor, manage and control microservice-based applications at runtime.

A service mesh is an emerging pattern that is helping ensure resiliency and uptime - a way to more effectively monitor, control and secure the modern application at runtime. Companies are adopting Istio as their service mesh of choice as it provides a toolbox of different features that address various microservices challenges. Istio provide a solution to many challenges, but leaves some critical enterprise challenges on the table. Enterprises require additional features that address observability, policy and security. With this in mind, we have built new enterprise features into a platform that runs on top of Istio, to provide all the functionality and flexibility of open source, plus features, support and guarantees needed to power enterprise applications.

At KubeCon North America 2018, Aspen Mesh announced open beta. With Aspen Mesh you get all the features of Istio, plus:

Advanced Policy Features
Aspen Mesh provides RBAC capabilities you don’t get with Istio.

Configuration Vets
Istio Vet (an Aspen Mesh open source contribution to help ensure correct configuration of your mesh) is built into Aspen Mesh and you get additional features as part of Aspen Mesh that don’t come with open source Istio Vet.

Analytics and Alerting
The Aspen Mesh platform provides insights into key metrics (latency, error rates, mTLS status) and immediate alerts so you can take action to minimize MTTD/MTTR.

Multi-cluster/Multi-cloud
See multiple clusters that live in different clouds in a single place to see what’s going on in your microservice architecture through a single pane of glass.

Canary Deploys
Aspen Mesh Experiments lets you quickly test new versions of microservices so you can qualify new versions in a production environment without disrupting users.

An Intuitive UI
Get at-a-glance views of performance and security posture as well as the ability to see service details.

Full Support
Our team of Istio experts makes it easy to get exactly what you need out of service mesh. 

You can take advantage of these features for free by signing up for Aspen Mesh Beta access.


Technology

Why You Should Care About Istio Gateways

If you’re breaking apart the monolith, one of the huge advantages to using Istio to manage your microservices is that it enables a configuration that leverages an ingress model that is similar to traditional load balancers and application delivery controllers. In the load balancers world, Virtual IPs and Virtual Servers have long been used as concepts that enable operators to configure ingress traffic in a flexible and scalable manner (Lori Macvittie has some relevant thoughts on this).

In Istio Gateways control the exposure of services at the edge of the mesh. Gateways allow operators to specify L4-L6 settings like port and TLS settings. For L7 settings of the Ingress traffic Istio allows you to tie Gateways to VirtualServices. This separation makes it easy to manage traffic flow into the mesh in much the same way you would tie Virtual IPs to Virtual Servers in traditional load balancers. This enables users of legacy technologies to migrate to microservices in a seamless way. It’s a natural progression for teams that are used to monoliths and edge load balancers rather than an entirely new way to think about networking configuration.

One important thing to note is that routing traffic within a service mesh and getting external traffic into the mesh are a bit different. Within the mesh, you specify exceptions from normal traffic since Istio by default (compatibility with Kubernetes) allows everything to talk to everything once inside the mesh. It’s necessary to add a policy if you don’t want certain services communicating. Getting traffic into the mesh is a reverse proxy situation (similar to traditional load balancers) where you have to specify exactly what you want allowed in.

Earlier version of Istio leveraged Kubernetes Ingress resource, but Istio v1alpha3 APIs leverage Gateways for richer functionality as Kubernetes Ingress has proven insufficient for Istio applications. Kubernetes Ingress API merges specification for L4-6 and L7, which makes it difficult for different teams in organizations with separate trust domains (like SecOps and NetOps) to own Ingress traffic management. Additionally, the Ingress API is less expressive than the routing capability that Istio provides with Envoy. The only way to do advanced routing in Kubernetes Ingress API is to add annotations for different ingress controllers. Separate concerns and trust domains within an organization warrant the need for a more capable way to manage ingress, which is provided by Istio Gateways and VirtualServices.

Once the traffic is into the mesh, it’s good to be able to provide a separation of concerns for VirtualServices so different teams can manage traffic routing for their services. L4-L6 specifications are generally something that SecOps or NetOps might be concerned with. L7 specifications are of most concern to cluster operators or application owners. So it’s critical to have the right separation of concerns. As we believe in the power of team based responsibilities, we think this is an important capability. And since we believe in the power of Istio, we are submitting an RFE in the Istio community which we hope will help enable ownership semantics for traffic management within the mesh.

We’re excited Istio has hit 1.0 and are excited to keep contributing to the project and community as it continues to evolve.

Originally published at https://thenewstack.io/why-you-should-care-about-istio-gateways/

Istio 0.8 Updates

It’s here! Istio 0.8 landed on May 31, 2018. It brought a slew of new features, stability and performance improvements, and new APIs. I thought it would be helpful to share some of our favorites and experience with them.

Pilot Scalability

The first big change we are excited about in 0.8  is related to pilot scalability. Before 0.8, whenever we used Istio in clusters with more than a dozen services and more than 40-50 pods we started seeing catastrophically bad pilot performance. The earliest bug we were aware of around this was #2728 but there’s been a bunch of different manifestations. The root cause of the pain is computing the new pilot configuration when something in the cluster changes (like a new service, or a pod going away), combined with the fact that every sidecar needed to poll pilot for changes. The new alphav3 config APIs were added to pilot using the new Envoy v2 config APIs which uses a push model (documented here and here).  And that work has paid off!  

We were really happy when we deployed 0.8 in our test cluster and scaled services up and down, nuked and paved large apps, did everything we can think of. If you previously tried Istio and got a bit of a queasy feeling around behaviors when you grew your service mesh beyond an example app, now might be the time to give it another shot. (Istio 1.0 isn’t too far off, either).

Sidecar Injection

Istio is built on the sidecar model - every pod in the mesh has a dataplane proxy running right alongside it to add the service mesh smarts. All traffic to and from the application container goes through the sidecar first. The application container should be unaware - the less the application has to be coupled to the sidecar, the better.

But something has to make sure the sidecar (and its friend, the init container) get started alongside the app. Previous versions of Istio let you inject manually or at the time that the Kubernetes Deployment was created. The drawback of Deployment-time injection comes when you want to upgrade your service mesh to run a newer version of the sidecar.  You either have to patch the Deployments in your Kubernetes cluster (“manually” injecting the upgrade), or delete and recreate them.

With 0.8, we can use a MutatingWebhook Pod Admission Controller - although the title is a mouthful, the end result is a pretty neat example of Kubernetes extensibility. Whenever Kubernetes is about to create a pod, it lets its pod admission controllers take a look and decide to allow, reject or allow-with-changes that pod. Every (properly authenticated and configured) admission controller gets a shot.  Istio provides one of these admission controllers that injects the current sidecar and init container. (It’s called a MutatingWebhook because it can mutate the pod to add the sidecar instead of only accept/reject, and it runs in some other service reachable via webhook aside from the Kubernetes API server.)

We like this because it makes it easier for us to upgrade our service mesh. We’re happiest if all of our pods are running the most current version of the sidecar and that version correlates to the release of the Istio control plane version we’ve deployed.  We don’t like to think about different pods using different (older) sidecars if we don’t have to. Now, once we install an upgraded control plane, we trigger some sort of rolling upgrade on the pods behind each service and they pick up the new sidecar.

There are multiple ways to say “Hey Istio, please inject a sidecar into this” or “Hey Istio, please leave this alone”. We use namespace annotations to turn it on and then pod metadata annotations to disable it for particular pods if needed.

v1alpha3 APIs and composability

Finally, a bit about new APIs. Istio 0.8 introduces a bunch of new Kubernetes resources for configuration. Before, we had what are called “v1alpha1” resources like RouteRules. Now, we have “v1alpha3” resources like DestinationPolicies and VirtualServices. 0.8 supports both v1alpha1 and v1alpha3 resources as a migration point from v1alpha1 to v1alpha3. The future is v1alpha3 config resources so we should all be porting any of our existing config.

We found this process to be not too difficult for our existing configs. Minor note - we all think of incoming traffic first but don’t forget to also port Egress config to ServiceEntries or opt-out IP ranges. Personally, even though it can be a bit surprising at first to have to configure external access, we really like the control and visibility we get from managing Egresses with ServiceEntries for production environments.

One of the big changes in the new v1alpha3 APIs is that you define the routing config for a service all in one place like the VirtualService custom resource.  Before, you could define this config in multiple RouteRules that were ordered by precedence. A downside of the multiple RouteRules approach is that if you want to model what’s going to happen in your head, you’ve got to read and sort different RouteRule resources. The new approach is definitely simpler, but the tradeoff for that simplicity is some difficulty if you actually want your config to come from multiple resources.

We think this kind of thing can happen when you have multiple teams contributing to a larger service mesh and the apps on it (think a Platform Ops team contributing “base” policy, a Platform Security team contributing some policy on top, and finally an individual app team fine-tuning their particular routes). We’re not sure what the solution is, but my colleague Neeraj started the conversation on the istio-dev list and we’re looking forward to seeing where it goes in the future.

Better external service support

It used to be the case that if you wanted a service in the mesh to communicate with a TLS service outside the mesh, you had to modify your service to speak http over port 443, so that istio could route it correctly. Now that istio can use SNI to route traffic, you can leave your service alone, and configure istio to allow it to communicate with that external service by hostname. Since this is TLS passthrough, you don’t get L7 visibility of the egress traffic, but since you don’t need to modify your service, it allows you to add services to the mesh that you might not have been able to before.

We see great potential with the added features and increased stability in 0.8. We’re looking forward to seeing what is in store with 1.0 and are excited to see how teams use Istio now that it seems ready for production deployments.


Tracing and Metrics: Getting the Most Out of Istio

Are you considering or using a service mesh to help manage your microservices infrastructure? If so, here are some basics on how a service mesh can help, the different architectural options, and tips and tricks on using some key CNCF tools that integrate well with Istio to get the most out of it.

The beauty of a service mesh is that it bundles so many capabilities together, freeing engineering teams from having to spend inordinate amounts of time managing microservices architectures. Kubernetes has solved many build and deploy challenges, but it is still time consuming and difficult to ensure reliability and security at runtime. A service mesh handles the difficult, error-prone parts of cross-service communication such as latency-aware load balancing, connection pooling, service-to-service encryption, instrumentation, and request-level routing.

Once you have decided a service mesh makes sense to help manage your microservices, the next step is deciding what service mesh to use. There are several architectural options, from the earliest model of a library approach, the node agent architecture, and the model which seems to be gaining the most traction – the sidecar model. We have also seen an evolution from data plane proxies like Envoy, to service meshes such as Istio which provide distributed control and data planes. We're active users of Istio, and believers in the sidecar architecture striking the right balance between a robust set of features and a lightweight footprint, so let’s take a look at how to get the most out of tracing and metrics with Istio.

Tracing

One of the capabilities Istio provides is distributed tracing. Tracing provides service dependency analysis for different microservices and it provides tracking for requests as they are traced through multiple microservices. It’s also a great way to identify performance bottlenecks and zoom into a particular request to define things like which microservice contributed to the latency of a request or which service created an error.

We use and recommend Jaeger for tracing as it has several advantages:

  • OpenTracing compatible API
  • Flexible & scalable architecture
  • Multiple storage backends
  • Advanced sampling
  • Accepts Zipkin spans
  • Great UI
  • CNCF project and active OS community

Metrics

Another powerful thing you gain with Istio is the ability to collect metrics. Metrics are key to understanding historically what has happened in your applications, and when they were healthy compared to when they were not. A service mesh can gather telemetry data from across the mesh and produce consistent metrics for every hop. This makes it easier to quickly solve problems and build more resilient applications in the future.

We use and recommend Prometheus for gathering metrics for several reasons:

  • Pull model
  • Flexible query API
  • Efficient storage
  • Easy integration with Grafana
  • CNCF project and active OS community

We also use Cortex, which is a powerful tool to enhance Prometheus. Cortex provides:

  • Long term durable storage
  • Scalable Prometheus query API
  • Multi-tenancy

Check out this webinar for a deeper look into what you can do with these tools and more.


Tracing gRPC with Istio

At Aspen Mesh we love gRPC. Most of our public facing and many internal APIs use it. To give you a brief background in case you haven’t heard about it (would be really difficult with gRPC’s belle of the ball status), it is a new, highly efficient and optimized Remote Procedure Call (RPC) framework. It is based on the battle tested protocol buffers serialization format and HTTP/2 network protocol.

Using HTTP/2 protocol, gRPC applications can benefit from multiplexing requests, efficient connection utilization and host of other enhancements over other protocols like HTTP/1.1 which is very well documented here. Additionally, protocol buffers are an easy and extensible way for serializing structured data in binary format which in itself gives you significant performance improvements over text based formats. Combining these two results in a low latency and highly scalable RPC framework, which is in essence what gRPC is. Additionally, the growing ecosystem gives you the ability to write your applications in many supported languages like (C++, Java, Go, etc.) and an extensive set of third party libraries to use.

Apart from the benefits I listed above, what I like most about gRPC is the simplicity and intuitiveness with which you can specify your RPCs (using the protobufs IDL) and how a client application can invoke methods on the server application as if it was a local function call. A lot of the code (service descriptions and handlers, client methods, etc.) gets auto generated for you making it very convenient to use.

Now that I have laid out some background, let’s turn our attention to the main topic of this blog. Here I’m going to cover how to add tracing in your applications built on gRPC, especially if you’re using Istio or Aspen Mesh.

Tracing is great for debugging and understanding your application’s behavior. The key to making sense of all the tracing data is being able to correlate spans from different microservices which are related to a single client request.

To achieve this, all microservices in your application should propagate tracing headers. If you’re using a service mesh like Istio or Aspen Mesh, the ingress and sidecar proxies automatically add the appropriate tracing headers and report the spans to the tracing collector backend like Jaeger or Zipkin. The only thing left for applications to do is propagate tracing headers from incoming requests (which sidecar or ingress proxy adds) to any outgoing requests it makes to other microservices.

Propagating Headers from gRPC to gRPC Requests

The easiest way to do tracing header propagation is to use the grpc opentracing middleware library’s client interceptors. This can be used if your application is making a new outbound request upon receiving the incoming request. Here’s the sample code to correctly propagate tracing headers from the incoming to outgoing request:

  import (
    "golang.org/x/net/context"
    "github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing"
    "ot "github.com/opentracing/opentracing-go"
  )

  // ctx is the incoming gRPC request's context
  // addr is the address for the new outbound request
  func createGRPCConn(ctx context.Context, addr string) (*grpc.ClientConn, error) {
  	var opts []grpc.DialOption
  	opts = append(opts, grpc.WithStreamInterceptor(
  		grpc_opentracing.StreamClientInterceptor(
  			grpc_opentracing.WithTracer(ot.GlobalTracer()))))
  	opts = append(opts, grpc.WithUnaryInterceptor(
  		grpc_opentracing.UnaryClientInterceptor(
  			grpc_opentracing.WithTracer(ot.GlobalTracer()))))
  	conn, err := grpc.DialContext(ctx, addr, opts...)
  	if err != nil {
  		glog.Error("Failed to connect to application addr: ", err)
  		return nil, err
  	}
  	return conn, nil
  }

Pretty simple right?

Adding the opentracing client interceptors ensures that making any new unary or streaming gRPC request on the client connection injects the correct tracing headers. If the passed context has the tracing headers present (which should be the case if you are using Aspen Mesh or Istio and passing the incoming request’s context), then the new span is created as the child span of the span present in the passed context. On the other hand if the context has no tracing information, a new root span is created for the outbound request.

Propagating Headers from gRPC to HTTP Requests

Now let’s look at the scenario if your application makes a new outbound HTTP/1.1 request upon receiving a new incoming gRPC request. Here’s the sample code to accomplish header propagating in this case:

  import (
    "net/http"
    "golang.org/x/net/context"
    "golang.org/x/net/context/ctxhttp"
    "ot "github.com/opentracing/opentracing-go"
  )

  // ctx is the incoming gRPC request's context
  // addr is the address of the application being requested
  func makeNewRequest(ctx context.Context, addr string) {
    if span := ot.SpanFromContext(ctx); span != nil {
      req, _ := http.NewRequest("GET", addr, nil)

      ot.GlobalTracer().Inject(
        span.Context(),
        ot.HTTPHeaders,
        ot.HTTPHeadersCarrier(req.Header))

      resp, err := ctxhttp.Do(ctx, nil, req)
      // Do something with resp
    }
  }

This is quite standard for serializing tracing headers from incoming request’s (HTTP or gRPC) context.

Great! So far we have been able to use libraries or standard utility code to get what we want.

Propagating Headers When Using gRPC-Gateway

One of the libraries commonly used in gRPC applications is the grpc-gateway library to expose services as RESTful JSON APIs. This is very useful when you want to consume gRPC from clients like curl, web browser, etc. which don’t understand it or maintain a RESTful architecture. More details on how to expose RESTful APIs using grpc-gateway can be found in this great blog. I highly encourage you to read it if you’re new to this architecture.

When you start using grpc-gateway and want to propagate tracing headers there are few very interesting interactions that are worth mentioning. The grpc-gateway documentation states that all IANA permanent HTTP headers are prefixed with grpcgateway- and added as request headers. This is great but as tracing headers like x-b3-traceidx-b3-spanid, etc. are not IANA recognized permanent HTTP headers they are not copied over to gRPC requests when grpc-gateway proxies HTTP requests. This means as soon as you add the grpc-gateway to your application, the header propagation logic will stop working.

Isn’t that typical? You add one awesome thing which breaks the current working setup. No worries, I have a solution for you!

Here’s a way to ensure you don’t lose the tracing information when proxying between HTTP and gRPC using grpc-gateway:

  import (
    "net/http"
    "golang.org/x/net/context"
    "google.golang.org/grpc/metadata"
    "github.com/grpc-ecosystem/grpc-gateway/runtime"
  )

  const (
  	prefixTracerState  = "x-b3-"
  	zipkinTraceID      = prefixTracerState + "traceid"
  	zipkinSpanID       = prefixTracerState + "spanid"
  	zipkinParentSpanID = prefixTracerState + "parentspanid"
  	zipkinSampled      = prefixTracerState + "sampled"
  	zipkinFlags        = prefixTracerState + "flags"
  )

  var otHeaders = []string{
  	zipkinTraceID,
  	zipkinSpanID,
  	zipkinParentSpanID,
  	zipkinSampled,
  	zipkinFlags}

  func injectHeadersIntoMetadata(ctx context.Context, req *http.Request) metadata.MD {
  	pairs := []string{}
  	for _, h := range otHeaders {
  		if v := req.Header.Get(h); len(v) > 0 {
  			pairs = append(pairs, h, v)
  		}
  	}
  	return metadata.Pairs(pairs...)
  }

  type annotator func(context.Context, *http.Request) metadata.MD

  func chainGrpcAnnotators(annotators ...annotator) annotator {
  	return func(c context.Context, r *http.Request) metadata.MD {
  		mds := []metadata.MD{}
  		for _, a := range annotators {
  			mds = append(mds, a(c, r))
  		}
  		return metadata.Join(mds...)
  	}
  }

  // Main function of your application. Insert tracing headers into gRPC
  // metadata using annotators
  func run() {
    ...
	  annotators := []annotator{injectHeadersIntoMetadata}

	  gwmux := runtime.NewServeMux(
		  runtime.WithMetadata(chainGrpcAnnotators(annotators...)),
	  )
    ...
  }

In the code above, I have used the runtime.WithMetadata API provided by the grpc-gateway library. This API is useful for reading attributes from HTTP request and adding it to the metadata, which is exactly what we want! A little bit more work, but still using the APIs exposed by the library.

The injectHeadersIntoMetadata annotator looks for the tracing headers in the HTTP requests and appends it to the metadata, thereby ensuring that the tracing headers can be further propagated from gRPC to outbound requests using the techniques mentioned in the previous sections.

Another interesting thing you might have observed is the wrapper chainGrpcAnnotators function. The runtime.WithMetadata API only allows a single annotator to be added which might not be enough for all scenarios. In our case, we had a tracing annotator (like the one show above) and an authentication annotator which appended auth data from HTTP request to the gRPC metadata. UsingchainGrpcAnnotators allows you to add multiple annotators and the wrapper function joins the metadata from various annotators into a single metadata for the request.


Distributed tracing with Istio in AWS

Everybody loves tracing! Am I right? If you attended KubeCon (my bad, CloudNativeCon!) 2017 at Austin or saw any of the keynotes posted online, you would have noticed the recurring theme explaining the benefits of tracing, especially as it relates to DevOps tools. Istio and service mesh were hot topics and many sessions discussed how Istio provides distributed tracing out of the box making it easier for application developers to integrate tracing into their system.

Indeed, a great benefit of using service mesh is getting more visibility and understanding of your applications. Since this is a tech post (I remember categorizing it as such) let’s dig deeper in how Istio provides application tracing.

When using Istio, a sidecar envoy proxy is automatically injected next to your applications (in Kubernetes this means adding containers to the application Pod). This sidecar proxy intercepts all traffic and can add/augment tracing headers to the requests entering/leaving the application container. Additionally, the sidecar proxy also handles asynchronous reporting of spans to the tracing backends like Jaeger, Zipkin, etc. Sounds pretty awesome!

One thing that the applications do need to implement is propagating tracing headers from incoming to outgoing requests as mentioned in this Istio guide. Simple enough right? Well it’s about to get interesting.

Before we proceed further, first a little background on why I’m writing this blog. We, here at Aspen Mesh offer a supported enterprise service mesh built on open source Istio. Not only do we offer a service mesh product but we also use it in our production SaaS platform hosted in AWS (isn’t that something?).

I was tasked with propagating tracing headers in our applications so that we get nice hierarchical traces graphing the relationship between our microservices. As we are hosted in AWS, many of our microservices make outgoing requests to AWS services. During this exercise, I found some interesting interactions between adding tracing headers and using Istio with AWS services that I decided to share my experience. This blog describes various iterations I went through to get it all working together.

The application in question for this post is a simple web server. When it receives a HTTP request it makes an outbound DynamoDB query to fetch an item. As it is deployed in the Istio service mesh, the sidecar proxy automatically adds tracing headers to the incoming request. I wanted to propagate the tracing headers from the incoming request to the DynamoDB query request for getting all the traces tied together.

First Iteration

In order to achieve this I decided to pass a custom function as request options to the AWS DynamoDB API which allows you to augment request headers before they are transmitted over the wire. In the snippet below I’m using the AWS go-sdk’s dynamo.GetItemWithContext for fetching an item and passing AddTracingHeaders as the request.Option. Note that the AddTracingHeaders method uses standard opentracing API for injecting headers from a input context.

func AddTracingHeaders() awsrequest.Option {
  return func(req *awsrequest.Request) {
    if span := ot.SpanFromContext(req.Context()); span != nil {
      ot.GlobalTracer().Inject(
      span.Context(),
      ot.HTTPHeaders,
      ot.HTTPHeadersCarrier(req.HTTPRequest.Header))
    }
  }
}

// ctx is the incoming request's context as received from the mesh
func makeDynamoQuery(ctx context.Context ) {
  // Note that AddTracingHeaders is passed as awsrequest.Option
  result, err := dynamo.GetItemWithContext(ctx, ..., AddTracingHeaders())
  // Do something with result
}

Ok, time for testing this solution! The new version compiles, and I verified locally that it is able to fetch items from DynamoDB. After deploying the new version in production with Istio (sidecar injected) I’m hoping to see the traces nicely tied together. Indeed, the traces look much better but wait all of the responses from DynamoDB are now HTTP Status Code 400. Bummer!

Looking at the error messages from aws-go-sdk we are getting AccessDeniedException which according to AWS docs indicate that the signature is not valid. Adding tracing headers seems to have broken signature validation which is odd, yet interesting as I had tested in my dev environment (without sidecar proxy) and the DynamoDB requests worked fine, but in production it stopped working. Typical developer nightmare!

Digging into the AWS sdk package, I found that the client code signs every request including headers with a few hardcoded exceptions. The difference between the earlier and the new version is the addition of tracing headers to the request which are now getting signed and then handed to the sidecar proxy. Istio’s sidecar proxy (in this case Envoy) changes these tracing headers (as it should!) before sending it to DynamoDB service which breaks the signature validation at the server.

To get this fixed we need to ensure that the tracing headers are added after the request is signed but before it is sent out by the AWS sdk. This is getting more complicated, but still doable.

Second Iteration

I couldn’t find an easy way to whitelist these tracing headers and prevent them from getting them signed. But, AWS session package provides a very flexible API for adding custom handlers which get invoked in various stages of the request lifecycle. Additionally, providing a session handler has the benefit of being added in all AWS service requests (not just DynamoDB) which use that session. Perfect!

Here’s the AddTracingHeaders method above added as a session handler:

sess, err := session.NewSession(cfg)

// Add the AddTracingHeaders as the first Send handler. This is important as one
// of the default Send handlers does the work of sending the request.
sess.Handlers.Send.PushFront(AddTracingHeaders)

This looks promising. Testing showed that the first request to the AWS DynamoDB service is successful (200 Ok!) Traces look good too! We are getting somewhere, time to test some failure scenarios.

I added a Istio fault injection rule to return a HTTP 500 error on outgoing DynamoDB requests to exercise the AWS sdk’s retry logic. Snap! receiving the HTTP Status Code 400 with AccessDeniedException error again on every retry.

Looking at the AWS request send logic, it appears that on retriable errors the code makes a copy of the previous request, signs it and then invokes the Send handlers. This means that on retries the previously added tracing headers would get signed again (i.e. earlier problem is back, hence 400s) and then the AddTracingHeaders handler would add back the tracing headers.

Now that we understand the issue, the solution we came up with is to add the tracing headers after the request is signed and before it is sent out just like the earlier implementation. In addition, to make retries work we now need to remove these headers after the request is sent so that the resigning and reinvocation of AddTracingHeaders is handled correctly.

Final Interation

Here’s what the final working version looks like:

func injectFromContextIntoHeader(ctx context.Context, header http.Header) {
  if span := ot.SpanFromContext(ctx); span != nil {
    ot.GlobalTracer().Inject(
    span.Context(),
    ot.HTTPHeaders,
    ot.HTTPHeadersCarrier(header))
  }
}

func AddTracingHeaders() awsrequest.Option {
  return func(req *awsrequest.Request) {
    injectFromContextIntoHeader(req.Context(), req.HTTPRequest.Header)
  }
}

// This is a bit odd, inject tracing headers into an empty header map so that
// we can remove them from the request.
func RemoveTracingHeaders(req *awsrequest.Request) {
  header := http.Header{}
  injectFromContextIntoHeader(req.Context(), header)
  for k := range header {
    req.HTTPRequest.Header.Del(k)
  }
}

sess, err := session.NewSession(cfg)

// Add the AddTracingHeaders as the first Send handler.
sess.Handlers.Send.PushFront(AddTracingHeaders)

// Pushback is used here so that this handler is added after the request has
// been sent.
sess.Handlers.Send.PushBack(RemoveTracingHeaders)

Agreed, above solution looks far from elegant but it does work. I hope this post helps if you are in a similar situation.

If you have a better solution feel free to reach out to me at neeraj@aspenmesh.io