How to Debug Istio Mutual TLS (mTLS) Policy Issues Using Aspen Mesh

Users Care About Secure Service to Service Communication

Mutual TLS (mTLS) communication between services is a key Istio feature driving adoption as applications do not have to be altered to support it. mTLS provides client and server side security for service to service communications, enabling organizations to enhance network security with reduced operational burden (e.g. certificate management is handled by Istio). If you are interested in learning more about this, checkout Istio's mTLS docs here. From regulatory concerns to auditing requirements and a host of other reasons businesses need to demonstrate they are following burgeoning security practices in a microservice landscape.

Many techniques evolved to help ease this requirement and enable businesses to focus on business value. Unfortunately, many of these techniques require expertise to ensure they are developed or configured properly, from IPSec to a wide range of other solutions. Unless you are a security expert, it is challenging to implement these techniques correctly. Managing ciphers, algorithms, rotating keys, certificates, and updating system libraries when CVEs are found is difficult for software developers, DevOps and sys admins to keep abreast of. Even seasoned security professionals can find it difficult to implement and audit such systems. As security is a core feature, this is where a service mesh like Aspen Mesh can help. A service mesh helps to alleviate these concerns with the goal of drastically lessening the burden of securing and auditing such systems, enabling users to focus on their core products.

Gradually Adopting mTLS Within Istio

At Aspen Mesh we recommend installing Istio with global mTLS enabled. However, very few deployments of Istio are in green-field environments where services are slowly adopted, created and can be monitored independently before new services are rolled out. In these cases, users will most likely adopt mTLS gradually service-by-service and will carefully monitor traffic behavior before proceeding to the next service.

A common problem that many users experience when enabling mTLS for service communication in their service mesh is inadvertently breaking traffic. A misconfigured AuthenticationPolicy or DestinationRule can affect communication unbeknownst to a user until other issues arise.

It is difficult to monitor for such specific failures because they occur at the transport layer (L4) where raw TCP connections are first established by the underlying OS after which the TLS handshake takes places. If a problem happens during this handshake the Envoy sidecar is not be able to create detailed diagnostic metrics and messaging as this error is not at the application layer (L7). While 503 errors can surface due to misconfiguration, a 503 alone is not specific enough to determine if the issue is due to misconfiguration or a misbehaving service. We are working with the Istio community to add telemetry for traffic failures related to mTLS misconfiguration. This requires surfacing the relevant information from Envoy which we are collaborating with in this pull request. Until such capability exists there are techniques and tools which we will discuss to aid you in debugging traffic management issues.

At Aspen Mesh we are trying to enable our users to feel confident in their ability to manage their infrastructure. Kubernetes, Istio and Aspen Mesh are the platform, but business value is derived from software written and configured in-house, so quickly resolving issues is paramount to our customer's success.

Debugging Policy Issues With Aspen Mesh

We will now walk-through debugging policy issues when using Aspen Mesh. Many of the following techniques are relevant to Istio in case you don't have Aspen Mesh installed.

In the below example, bookinfo was installed into the bookinfo namespace using Aspen Mesh with global mTLS set to PERMISSIVE. We then deployed three deployments that communicated with the productpage service spanning three different namespaces.

A namespace policy was created to set mTLS to be STRICT. However, no DestinationRules were created and as a result the system started to experience mTLS errors.

The graph suggests that there is a problem with traffic generator communicating with productpage. We will first inspect policy settings and logs of our services.

Determining the DestinationRule for a workload and an associated service is pretty straight-forward.

$ istioctl authn tls-check -n bookinfo traffic-generator-productpage-6b88d69f-xxfkn productpage.bookinfo.svc.cluster.local

where traffic-generator-productpage-6b88d69f-xxfkn is the name of a pod within the bookinfo namespace and productpage.bookinfo.svc.cluster.local is the server. The output will be similar to the following:

HOST:PORT                                       STATUS       SERVER     CLIENT     AUTHN POLICY         DESTINATION RULE
productpage.bookinfo.svc.cluster.local:9080     CONFLICT     mTLS       HTTP       default/bookinfo     destrule-productpage/bookinfo

if no conflict is found then the STATUS column will say OK, but for this example there is a conflict that exists between the AuthenticationPolicy and the DestinationRule. Inspecting the output closely, we see that there is a namespace wide AuthenticationPolicy used--determined by its name of default--and what appears, by name, to be a host-specific DestinationRule.

Using kubectl we can directly inspect the contents of the DestinationRule:

$ kubectl get destinationrule -n bookinfo destrule-productpage -o yaml
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"networking.istio.io/v1alpha3","kind":"DestinationRule","metadata":{"annotations":{},"name":"destrule-productpage","namespace":"bookinfo"},"spec":{"exportTo":["*"],"host":"productpage","trafficPolicy":{"tls":{"mode":"DISABLE"}}}}
  creationTimestamp: "2019-10-10T20:24:48Z"
  generation: 1
  name: destrule-productpage
  namespace: bookinfo
  resourceVersion: "4874298"
  selfLink: /apis/networking.istio.io/v1alpha3/namespaces/bookinfo/destinationrules/destrule-productpage
  uid: 01612af7-eb9c-11e9-a719-06457fb661c2
spec:
  exportTo:
  - '*'
  host: productpage
  trafficPolicy:
    tls:
      mode: DISABLE

A conflict does exist and we can simply fix this by altering our DestinationRule to have a mode of ISTIO_MUTUAL instead of DISABLE. For this example it was a fairly simple fix. At times, however, it is possible that you may see a DestinationRule that is different from the one you expect. Reasoning about the correct DestinationRule object can be difficult without first knowing the resolution hierarchy established by Istio. In our example, the DestinationRule above also applies to the traffic-generator workloads in the other namespaces as well.

DestinationRule Hierarchy Resolution

When Istio configures the sidecars for service to service communication, it must make a determination on which DestinationRule, if any, should be used to handle communication between each service. When a client attempts to contact a server, the client's request is first routed to its sidecar and that sidecar inspects its configuration to determine the method by which it should establish a communication with the server's sidecar.

The rules by which Istio creates these sidecar configurations are as follows: clients first look for DestinationRules in their own namespace that match the FQDN of the requested server. If no DestinationRule is found then the server's namespace is checked for a DestinationRule; again, if no DestinationRule is found then the Istio root namespace (default is istio-system) is checked for a matching DestinationRule.

DestinationRules that use wildcards, specific ports and/or make use of exportTo can further make it arduous to determine DestinationRule resolution. Istio has set of guidelines to help users adopt rule changes found here.

It is also worthwhile to note that when a new DestinationRule is created to adhere to an AuthenticationPolicy change, it is important to keep any previously applied traffic rules, otherwise a behavioral change in service communication within your system may be experienced. For instance, if load balancing was previously LEAST_CONN for service to service communication due to a client namespace DestinationRule targeted for another namespace, then the new DestinationRule should inherit the load balancing setting, or the user will see a behavioral change in traffic patterns within their service mesh as load balancing for that service will use the default setting, ROUND_ROBIN.

Our product helps simplify this by respecting the rules set by Istio in 1.1.0+ and inspecting existing AuthenticationPolicies and DestinationRules when creating new ones.

Even so, it is best to use the fewest number of DestinationRules possible in a service mesh. While it is an incredibly powerful feature, it's best used with discretion and intent.

Debugging Traffic Issues

Besides globally enabling mTLS and setting the outbound traffic policy to be more restrictive, we also recommend setting global.proxy.accessLogFile to log to /dev/stdout instead of /dev/null. This will enable you to view the access logs from the Envoy sidecar within your cluster when debugging Istio configuration and policy issues.

After applying an AuthenticationPolicy or a DestinationRule it is possible that 503 HTTP Status codes will start appearing. Here are a couple of checks to aid you in diagnosing the issue to see if it is related to an mTLS issue.

First, we we will repeat what we did above, with the name of the POD being the pod seeing the 503 HTTP Status return codes:

$ istioctl authn tls-check <PODNAME> <DESTINATION SERVICE FQDN FORMAT>

In most cases this will be all of the debugging you will have to do. However, we can also dig deeper to understand the issue and it never hurts to know more about the underlying infrastructure of your system.

Remember that in a distributed system changes may take a while to propagate through a system and both Pilot and Mixer are responsible for passing configuration and enforcing policy, respectively. Let's start looking at some logs and configuration of sidecars.

By enabling proxy access logs we can view them directly:

$ kubectl logs -n <POD NAMESPACE> <PODNAME> -c istio-proxy 

where you may see logs similar to the following:

[2019-10-07T21:54:37.175Z] "GET /productpage HTTP/1.1" 503 UC "-" "-" 0 95 1 - "-" "curl/7.35.0" "819c2e8b-ddad-4579-8508-794ab7de5a55" "productpage:9080" "XXX.XXX.XXX.XXX:9080" outbound|9080||productpage.bookinfo.svc.cluster.local - XXX.XXX.XXX.XXX:9080 XXX.XXX.XXX.XXX:33834 -
[2019-10-07T21:54:38.188Z] "GET /productpage HTTP/1.1" 503 UC "-" "-" 0 95 1 - "-" "curl/7.35.0" "290b42e7-5140-4881-ae87-778b352adcad" "productpage:9080" "XXX.XXX.XXX.XXX:9080" outbound|9080||productpage.bookinfo.svc.cluster.local - XXX.XXX.XXX.XXX:9080 XXX.XXX.XXX.XXX:33840 -

It is important to note the 503 UC in the above access logs. The UC according to Envoy's documentation states that UC means Upstream connection termination in addition to 503 response code. This helps us understand that it is likely to be an mTLS issue.

If the containers inside of your service mesh contain curl (or equivalent) you can also run the following command within a pod that is experiencing 503s:

$ kubectl exec -c <CONTAINER> <PODNAME> -it -- curl -vv http://<DESTINATIONSERVICE FQDN>:PORT

which may then output something akin to

* Rebuilt URL to: http://productpage.bookinfo.svc.cluster.local:9080/
* Hostname was NOT found in DNS cache
*   Trying XXX.XXX.XXX.XXX...
* Connected to productpage.bookinfo.svc.cluster.local (XXX.XXX.XXX.XXX) port 9080 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: productpage.bookinfo.svc.cluster.local:9080
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-length: 95
< content-type: text/plain
< date: Mon, 07 Oct 2019 22:09:23 GMT
* Server envoy is not blacklisted
< server: envoy
<
* Connection #0 to host productpage.bookinfo.svc.cluster.local left intact
upstream connect error or disconnect/reset before headers. reset reason: connection termination

The last line is what's important. HTTP headers were not able to be sent before the underlying TCP connection was terminated. This is a very strong indication that the TLS handshake failed.

And lastly, you can inspect the configuration sent by Pilot to your pod's sidecar using istioctl.

$ istioctl proxy-config cluster -n <POD NAMESPACE> <PODNAME> -o json

whereby if you search for the destination service name you will see an embedded metadata JSON element that names the specific DestinationRule that pod is currently using to communicate with the external service.

{
    "metadata": {
      "filterMetadata": {
        "istio": {
          "config": "/apis/networking/v1alpha3/namespaces/traffic-generator/destination-rule/named-destrule"
        }
      }
    }
}

If you look closely at that returned object you can also inspect and verify rules being applied. The source of truth for a given moment is always found in your pod's Envoy sidecar configuration so while you don't need to become an expert and learn all of the nuances of debugging Istio it is another tool in your debugging toolbelt.

The Future

Istio is a incredibly sophisticated and powerful tool. Similar to other such tools, it requires expertise to get the most out of it, but the rewards are greater than the challenge. Aspen Mesh is committed to enabling Istio and our customers to succeed. As our platform matures, we will continue to help users by surfacing use cases and examples like in the above service graph, along with further in-depth ways to diagnose and troubleshoot issues. Lowering the mean time to detect (MTTD) and mean time to resolve (MTTR) for our users is critical to their success.

There are some exciting things that Aspen Mesh is planning to help our users tackle some of the hurdles we've found when adopting Istio. Keep an eye on our blog for future announcements.


Why Is Policy Hard?

Aspen Mesh spends a lot of time talking to users about policy, even if we don’t always start out calling it that. A common pattern we see with clients is:

  1. Concept: "Maybe I want this service mesh thing"
  2. Install: "Ok, I've got Aspen Mesh installed, now what?"
  3. Observe: "Ahhh! Now I see how my microservices are communicating.  Hmmm, what's that? That pod shouldn't be talking to that database!"
  4. Act: "Hey mesh, make sure that pod never talks to that database"

The Act phase is interesting, and there’s more to it than might be obvious at first glance. I'll propose that in this blog, we work through some thought experiments to delve into how service mesh can help you act on insights from the mesh.

First, put yourself in the shoes of the developer that just found out their test pod is accidentally talking to the staging database. (Ok, you're working from home today so you don't have to put on shoes; the cat likes sleeping on your shoeless feet better anyways.) You want to control the behavior of a narrow set of software for which you're the expert; you have local scope and focus.

Next, put on the shoes of a person responsible for operating many applications; people we talk to often have titles that include Platform, SRE, Ops, Infra. Each day they’re diving into different applications so being able to rapidly understand applications is key. A consistent way of mapping across applications, datacenters, clouds, etc. is critical. Your goal is to reduce "snowflake architecture" in favor of familiarity to make it easier when you do have to context switch.

Now let's change into the shoes of your org's Compliance Officer. You’re on the line for documenting and proving that your platform is continually meeting compliance standards. You don't want to be the head of the “Department of No”, but what’s most important to you is staying out of the headlines. A great day at work for you is when you've got clarity on what's going on across lots of apps, databases, external partners, every source of data your org touches AND you can make educated tradeoffs to help the business move fast with the right risk profile. You know it’s ridiculous to be involved in every app change, so you need separation-of-concerns.

I'd argue that all of these people have policy concerns. They want to be able to specify their goals at a suitably high level and leave the rote and repetitive portions to an automated system.  The challenging part is there's only one underlying system ("the kubernetes cluster") that has to respond to each of these disparate personas.

So, to me policy is about transforming a bunch of high-level behavioral prescriptions into much lower-level versions through progressive stages. Useful real-world policy systems do this in a way that is transparent and understandable to all users, and minimizes the time humans spend coordinating. Here's an example "day-in-the-life" of a policy:

At the top is the highest level goal: "Devs should test new code without fear". Computers are hopeless to implement this. At the bottom is a rule suitable for a computer like a firewall to implement.

The layers in the middle are where a bad policy framework can really hurt. Some personas (the hypothetical Devs) want to instantly jump to the bottom - they're the "4.3.2.1" in the above example. Other personas (the hypothetical Compliance Officer) is way up top, going down a few layers but not getting to the bottom on a day-to-day basis.

I think the best policy frameworks help each persona:

  • Quickly find the details for the layer they care about right now.
  • Help them understand where did this come from? (connect to higher layers)
  • Help them understand is this doing what I want? (trace to lower layers)
  • Know where do I go to change this? (edit/create policy)

As an example, let's look at iptables, one of the firewalling/packet mangling frameworks for Linux.  This is at that bottom layer in my example stack - very low-level packet processing that I might look at if I'm an app developer and my app's traffic isn't doing what I'd expect.  Here's an example dump:


root@kafka-0:/# iptables -n -L -v --line-numbers -t nat
Chain PREROUTING (policy ACCEPT 594K packets, 36M bytes)
num   pkts bytes target     prot opt in out   source destination
1     594K 36M ISTIO_INBOUND  tcp -- * * 0.0.0.0/0            0.0.0.0/0

Chain INPUT (policy ACCEPT 594K packets, 36M bytes)
num   pkts bytes target     prot opt in out   source destination

Chain OUTPUT (policy ACCEPT 125K packets, 7724K bytes)
num   pkts bytes target     prot opt in out   source destination
1      12M 715M ISTIO_OUTPUT  tcp -- * * 0.0.0.0/0            0.0.0.0/0

Chain POSTROUTING (policy ACCEPT 12M packets, 715M bytes)
num   pkts bytes target     prot opt in out   source destination

Chain ISTIO_INBOUND (1 references)
num   pkts bytes target     prot opt in out   source destination
1        0 0 RETURN     tcp -- * *     0.0.0.0/0 0.0.0.0/0            tcp dpt:22
2     594K 36M RETURN     tcp -- * *   0.0.0.0/0 0.0.0.0/0            tcp dpt:15020
3        2 120 ISTIO_IN_REDIRECT  tcp -- * * 0.0.0.0/0            0.0.0.0/0

Chain ISTIO_IN_REDIRECT (1 references)
num   pkts bytes target     prot opt in out   source destination
1        2 120 REDIRECT   tcp -- * *     0.0.0.0/0 0.0.0.0/0            redir ports 15006

Chain ISTIO_OUTPUT (1 references)
num   pkts bytes target     prot opt in out   source destination
1      12M 708M ISTIO_REDIRECT  all -- * lo 0.0.0.0/0           !127.0.0.1
2        7 420 RETURN     all -- * *     0.0.0.0/0 0.0.0.0/0            owner UID match 1337
3        0 0 RETURN     all -- * *     0.0.0.0/0 0.0.0.0/0            owner GID match 1337
4     119K 7122K RETURN     all -- * *   0.0.0.0/0 127.0.0.1
5        4 240 ISTIO_REDIRECT  all -- * * 0.0.0.0/0            0.0.0.0/0

Chain ISTIO_REDIRECT (2 references)
num   pkts bytes target     prot opt in out   source destination
1      12M 708M REDIRECT   tcp -- * *   0.0.0.0/0 0.0.0.0/0            redir ports 15001


This allows me to quickly understand a lot of details about what is happening at this layer. Each rule specification is on the right-hand side and is relatively intelligible to the personas that operate at this layer. On the left, I get "pkts" and "bytes" - this is a count of how many packets have triggered each rule, helping me answer "Is this doing what I want it to?". There's even more information here if I'm really struggling: I can log the individual packets that are triggering a rule, or mark them in a way that I can capture them with tcpdump.  

Finally, furthest on the left in the "num" column is a line number, which is necessary if I want to modify or delete rules or add new ones before/after; this is a little bit of help for "Where do I go to change this?". I say a little bit because in most systems that I'm familiar with, including the one I grabbed that dump from, iptables rules are produced by some program or higher-level system; they're not written by a human. So if I just added a rule, it would only apply until that higher-level system intervened and changed the rules (in my case, until a new Pod was created, which can happen at any time). I need help navigating up a few layers to find the right place to effect the change.

iptables lets you organize groups of rules into your own chains, in this case the name of the chain (ISTIO_***) is a hint that Istio produced this and so I've got a hint on what higher layer to examine.

For a much different example, how about the Kubernetes CI Robot (from Prow)? If you've ever made a PR to Kubernetes or many other CNCF projects, you likely interacted with this robot. It's an implementer of policy; in this case the policies around changing source code for Kubernetes.  One of the policies it manages is compliance with the Contributor's License Agreement; contributors agree to grant some Intellectual Property rights surrounding their contributions. If k8s-ci-robot can't confirm that everything is alright, it will add a comment to your PR:

This is much different than firewall policy, but I say it's still policy and I think the same principles apply. Let's explore. If you had to diagram the policy around this, it would start at the top with the legal principle that Kubernetes wants to make sure all the software under its umbrella has free and clear IP terms. Stepping down a layer, the Kubernetes project decided to satisfy that requirement by requiring a CLA for any contributions. So on until we get to the bottom layer, the code that implements the CLA check.

As an aside, the code that implements the CLA check is actually split into two halves: first there's a CI job that actually checks the commits in the PR against a database of signed CLAs, and then there's code that takes the result of that job and posts helpful information for users to resolve any issues. That's not visible or important at that top layer of abstraction (the CNCF lawyers shouldn't care).

This policy structure is easy to navigate. If your CLA check fails, the comment from the robot has great links. If you're an individual contributor you can likely skip up a layer, sign the CLA and move on. If you're contributing on behalf of a company, the links will take you to the document you need to send to your company's lawyers so they can sign on behalf of the company.

So those are two examples of policy. You probably encounter many other ones every day from corporate travel policy to policies (written, unwritten or communicated via email missives) around dirty dishes.

It's easy to focus on the technical capabilities of the lowest levels of your system. But I'd recommend that you don't lose focus on the operability of your system. It’s important that it be transparent and easy to understand. Both the iptables and k8s-ci-robot are transparent. The k8s-ci-robot has an additional feature: it knows you're probably wondering "Where did this come from?" and it answers that question for you. This helps you and your organization navigate the layers of policy. 

When implementing service mesh to add observability, resilience and security to your Kubernetes clusters, it’s important to consider how to set up policy in a way that can be navigated by your entire team. With that end in mind, Aspen Mesh is building a policy framework for Istio that makes it easy to implement policy and understand how it will affect application behavior.

Did you like this blog? Subscribe to get email updates when new Aspen Mesh blogs go live.